Skip to content

Commit

Permalink
Update manual and cli
Browse files Browse the repository at this point in the history
  • Loading branch information
Itolstoganov committed Jun 5, 2023
1 parent b919f2c commit f3c221a
Show file tree
Hide file tree
Showing 2 changed files with 94 additions and 49 deletions.
61 changes: 54 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,49 @@ Now to run SpLitteR move to folder `assembler/` and execute

### Input

#### Format

The tool requires

- Assembly graph file in [GFA 1.0 format](https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md), with scaffolds included as path lines.
- SLR library in YAML format. The tool supports SLR libraries produced using 10X Genomics Chromium and TELL-seq technologies. SLR library should be in FASTQ format with barcodes attached as BC:Z or BX:Z tags:
- SLR library in YAML format. The tool supports SLR libraries produced using 10X Genomics Chromium and UST TELL-Seq technologies. Other SLR technologies, such as stLFR or LoopSeq can potentially be used as an input if converted to 10X or TELL-Seq format.

SpLitteR supports [LJA](https://github.com/AntonBankevich/LJA) and [Flye](https://github.com/fenderglass/Flye) assembly graphs out of the box. Other assembly graphs should prefferably be converted into blunt format by e.g. [GetBlunted](https://github.com/vgteam/GetBlunted) utility.

#### UST TELLSeq format

TELL-Seq library should include barcodes, left reads, and right reads as three separate FASTQ files.

For example, if you have a TELL-Seq library

``` bash
tellseq_reads_I1.fastq.gz
tellseq_reads_R1.fastq.gz
tellseq_reads_R2.fastq.gz
```

YAML file should look like this:

``` bash

[
{
orientation: "fr",
type: "tell-seq",
right reads: [
"/FULL_PATH_TO_DATASET/tellseq_reads_R2.fastq.gz"
],
left reads: [
"/FULL_PATH_TO_DATASET/tellseq_reads_R1.fastq.gz"
],
aux: [
"/FULL_PATH_TO_DATASET/tellseq_reads_I1.fastq.gz"
]
}
]
```

#### 10X Genomics Chromium format

10X library should be in FASTQ format with barcodes attached as BC:Z or BX:Z tags:

```
@COOPER:77:HCYNTBBXX:1:1216:22343:0 BX:Z:AAAAAAAAAACATAGT
Expand Down Expand Up @@ -74,8 +111,9 @@ Main options:

- `-t` Number of threads to use (default: 1/2 of available threads)
- `--mapping-k` k-mer length for read mapping (default: 31)
- `-Gmdbg|-Gblunt` Assembly graph type (mDBG or blunted)
- `-Gmdbg|-Gblunt` Assembly graph type: mDBG (LJA) or blunted (Flye)
- `-Mdiploid|-Mmeta` Repeat resolution mode (diploid or meta)
- `--assembly-info` Path to metaFlye assembly_info.txt file (meta mode, metaFlye graphs only)

Barcode index construction:
- `--count-threshold` Minimum number of reads for barcode index
Expand All @@ -84,19 +122,28 @@ Barcode index construction:
- `--linkage-distance` Reads are assigned to the same fragment on long edges based on the linkage distance
- `--min-read-threshold` Minimum number of reads for path cluster extraction
- `--relative-score-threshold` Relative score threshold for path cluster extraction
- `--sampling-factor` Downsample input SLR reads by this factor

Repeat resolution:
- `--score` Score threshold for link index.
- `--tail-threshold` Barcodes are assigned to the first and last <tail_threshold> nucleotides of the edge.
- `--scaffold-links` Use scaffold links in addition to graph links for repeat resolution

Developer options:
- `--ref` Reference path for repeat resolution evaluation
- `--statistics` Produce additional read cloud library statistics
- `--bin-load` Load binary-converted reads from tmpdir
- `--debug` Produce lots of debug data
- `--bin-load` Load read-to-graph alignment
- `--debug` Produce lots of debug data, save read-to-graph alignment
- `--tmp-dir` Scratch directory to use
- `-h, --help ` Print help message

Example command lines:

- Assembly produced LJA from HiFi diploid human dataset, with 10X SLR library (HPC compressed)\
`splitter lja_output/mdbg/mdbg.hpc.gfa 10x_dataset.yaml output -Mdiploid -Gmdbg`
- Assembly produced by metaFlye from metagenomic dataset, with TELL-Seq SLR library\
`splitter metaflye_output/assembly_graph.gfa tellseq_dataset.yaml output --assembly-info metaflye_output/assembly_info.txt -Mmeta -Gblunt`
-

### Output

SpLitteR stores all output files in output directory `<output_dir> `, which is set by the user.
Expand Down
82 changes: 40 additions & 42 deletions assembler/src/projects/splitter/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,13 @@ using namespace cont_index;
using namespace path_extend::read_cloud;

enum class GraphType {
Blunted,
Multiplexed
Blunted,
Multiplexed
};

enum class ResolutionMode {
Diploid,
Meta
Diploid,
Meta
};

struct gcfg {
Expand All @@ -51,7 +51,6 @@ struct gcfg {
ResolutionMode mode = ResolutionMode::Diploid;
bool bin_load = false;
bool debug = false;
bool statistics = false;

//barcode_index_construction
size_t frame_size = 40000;
Expand Down Expand Up @@ -83,38 +82,37 @@ static void process_cmdline(int argc, char** argv, gcfg& cfg) {

auto cli = (
graph << value("graph (in binary or GFA)"),
file << value("SLR library description (in YAML)"),
output_dir << value("path to output directory"),
(option("--dataset") & value("yaml", file)) % "dataset description (in YAML)",
(option("-l") & integer("value", cfg.libindex)) % "library index (0-based, default: 0)",
(option("--assembly-info") & value("assembly-info", assembly_info))
% "Path to metaflye assembly_info.txt file (meta mode, metaFlye graphs only)",
(option("-t") & integer("value", cfg.nthreads)) % "# of threads to use",
(option("--mapping-k") & integer("value", cfg.mapping_k)) % "k for read mapping",
(option("--tmp-dir") & value("tmp", tmpdir)) % "scratch directory to use",
(option("--ref") & value("reference", refpath)) % "Reference path for repeat resolution evaluation (developer option)",
(option("--bin-load").set(cfg.bin_load)) % "load binary-converted reads from tmpdir (developer option)",
(option("--debug").set(cfg.debug)) % "produce lots of debug data (developer option)",
(option("--statistics").set(cfg.statistics)) % "produce additional read cloud library statistics (developer option)",
(option("--sampling-factor") & value("sampling-factor", cfg.sampling_factor)) % "Sampling factor for read downsampling",
(with_prefix("-G",
option("mdbg").set(cfg.graph_type, GraphType::Multiplexed) |
option("blunt").set(cfg.graph_type, GraphType::Blunted)) % "assembly graph type (mDBG or blunted)"),
(with_prefix("-M",
option("diploid").set(cfg.mode, ResolutionMode::Diploid) |
option("meta").set(cfg.mode, ResolutionMode::Meta)) % "repeat resolution mode (diploid or meta)"),
(option("--frame-size") & value("frame-size", cfg.frame_size)) % "Resolution of barcode index",
(option("--linkage-distance") & value("read-linkage-distance", cfg.read_linkage_distance)) %
"Reads are assigned to the same fragment based on linkage distance",
(option("--score") & value("score", cfg.graph_score_threshold)) % "Score threshold for link index",
(option("--rel-threshold") & value("rel-threshold", cfg.rel_threshold)) % "Relative score threshold for vertex resolution",
(option("--tail-threshold") & value("tail-threshold", cfg.tail_threshold)) %
"Barcodes are assigned to the first and last <tail_threshold> nucleotides of the edge",
(option("--count-threshold") & value("count-threshold", cfg.count_threshold))
% "Minimum number of reads for barcode index",
(option("--scaffold-links").set(cfg.scaffold_links)) % "Use scaffold links in the vertex resolution",
(option("--length-threshold") & value("length-threshold", cfg.length_threshold))
% "Minimum scaffold graph edge length (meta mode option)"
file << value("SLR library description (in YAML)"),
output_dir << value("path to output directory"),
(option("--dataset") & value("yaml", file)) % "dataset description (in YAML)",
(option("-l") & integer("value", cfg.libindex)) % "library index (0-based, default: 0)",
(option("--assembly-info") & value("assembly-info", assembly_info))
% "Path to metaflye assembly_info.txt file (meta mode, metaFlye graphs only)",
(option("-t") & integer("value", cfg.nthreads)) % "# of threads to use",
(option("--mapping-k") & integer("value", cfg.mapping_k)) % "k for read mapping",
(option("--tmp-dir") & value("tmp", tmpdir)) % "scratch directory to use",
(option("--ref") & value("reference", refpath)) % "Reference path for repeat resolution evaluation (developer option)",
(option("--bin-load").set(cfg.bin_load)) % "load binary-converted reads from tmpdir (developer option)",
(option("--debug").set(cfg.debug)) % "produce lots of debug data (developer option)",
(option("--sampling-factor") & value("sampling-factor", cfg.sampling_factor)) % "Sampling factor for read downsampling",
(with_prefix("-G",
option("mdbg").set(cfg.graph_type, GraphType::Multiplexed) |
option("blunt").set(cfg.graph_type, GraphType::Blunted)) % "assembly graph type (mDBG or blunted)"),
(with_prefix("-M",
option("diploid").set(cfg.mode, ResolutionMode::Diploid) |
option("meta").set(cfg.mode, ResolutionMode::Meta)) % "repeat resolution mode (diploid or meta)"),
(option("--frame-size") & value("frame-size", cfg.frame_size)) % "Resolution of barcode index",
(option("--linkage-distance") & value("read-linkage-distance", cfg.read_linkage_distance)) %
"Reads are assigned to the same fragment based on linkage distance",
(option("--score") & value("score", cfg.graph_score_threshold)) % "Score threshold for link index",
(option("--rel-threshold") & value("rel-threshold", cfg.rel_threshold)) % "Relative score threshold for vertex resolution",
(option("--tail-threshold") & value("tail-threshold", cfg.tail_threshold)) %
"Barcodes are assigned to the first and last <tail_threshold> nucleotides of the edge",
(option("--count-threshold") & value("count-threshold", cfg.count_threshold))
% "Minimum number of reads for barcode index",
(option("--scaffold-links").set(cfg.scaffold_links)) % "Use scaffold links in addition to graph links for repeat resolution",
(option("--length-threshold") & value("length-threshold", cfg.length_threshold))
% "Minimum scaffold graph edge length (meta mode option)"
);

auto result = parse(argc, argv, cli);
Expand Down Expand Up @@ -154,16 +152,16 @@ struct TimeTracerRAII {
};

gfa::GFAReader ReadGraph(const gcfg &cfg,
debruijn_graph::Graph &graph,
io::IdMapper<std::string> *id_mapper) {
debruijn_graph::Graph &graph,
io::IdMapper<std::string> *id_mapper) {
switch (cfg.graph_type) {
default:
FATAL_ERROR("Unknown graph representation type");
case GraphType::Multiplexed: {
gfa::GFAReader gfa(cfg.graph);
gfa.to_graph(graph, id_mapper);
INFO("GFA segments: " << gfa.num_edges() << ", links: " << gfa.num_links() << ", paths: "
<< gfa.num_paths());
INFO("GFA segments: " << gfa.num_edges() << ", links: " << gfa.num_links() << ", paths: "
<< gfa.num_paths());
return gfa;
}
}
Expand Down Expand Up @@ -281,7 +279,7 @@ cont_index::VertexResults GetRepeatResolutionResults(const gcfg &cfg,
case ResolutionMode::Meta: {
auto repetitive_edges = ParseRepetitiveEdges(graph, cfg.assembly_info, id_mapper);
auto repeat_predicate = [&repetitive_edges](const debruijn_graph::EdgeId &edge) {
return repetitive_edges.find(edge) == repetitive_edges.end();
return repetitive_edges.find(edge) == repetitive_edges.end();
};
contracted_graph::DBGContractedGraphFactory factory(graph, repeat_predicate);
factory.Construct();
Expand Down

0 comments on commit f3c221a

Please sign in to comment.