Skip to content

Commit

Permalink
Simplifies README.md, pointing to readthedocs
Browse files Browse the repository at this point in the history
  • Loading branch information
samhorsfield96 committed Jan 7, 2023
1 parent 58837f9 commit 9bffe14
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 231 deletions.
235 changes: 4 additions & 231 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ ggCaller traverses [Bifrost](https://github.com/pmelsted/bifrost) graphs constru

ggCaller incorporates [Balrog](https://github.com/salzberg-lab/Balrog) to filter ORFs to improve specificity of calls and [Panaroo](https://github.com/gtonkinhill/panaroo) for pangenome analysis and quality control.

## Documentation

Guides for installation, usage and a tutorial can be found [here](https://ggcaller.readthedocs.io/en/latest/).

## Installation

ggCaller is available on Linux and MacOS. If you are running Windows 10/11, Linux can be installed via the Windows Subsystem for Linux ([WSL](https://docs.microsoft.com/en-us/windows/wsl/install-win10))
Expand Down Expand Up @@ -32,237 +36,6 @@ git clone --recursive https://github.com/samhorsfield96/ggCaller
python setup.py install
```

## Gene-call mode
In gene-call mode, ggCaller identifies ORFs within a Bifrost graph, filters them using BALROG and clusters them using Panaroo.

ggCaller takes a list of fasta files (one file per line), or a Bifrost GFA file and Bifrost Colours file generated by ```Bifrost build```. See the [Bifrost](https://github.com/pmelsted/bifrost) repository for installation.

ggCaller additionally employs FMindexing using [SDSLv3](https://github.com/xxsds/sdsl-lite) to remove artificial sequences generated by incorrectly phased nodes in the DBG. This is only employed for genomes designated as references.

#### If Bifrost GFA and Colours file do not exist:

To build a new Bifrost graph using assembled genomes and/or reads, specify ONE OR BOTH:
- ```--refs <refs.txt>``` List of absolute paths to reference sequence fastas (one file per line)
- ```--reads <reads.txt>``` List of absolute paths to read fastas (one file per line)

Note: Ensure assembled genome files are exclusively passed to the ```--refs``` argument, and read files exclusively to the ```--reads```
argument. Bifrost uses kmer coverage filtering for read files to remove read errors, but does not do this for assembled genomes.

#### If Bifrost GFA and Colours file already exist:

To run ggCaller using an existing Bifrost GFA file and Colours file, specify BOTH:
- ```--graph <graph.gfa>``` Input GFA
- ```--colours <colours.bfg_colors>``` Input colours file

To inform ggCaller of which files are references, supply them as a list to the ```--refs``` argument as above.
- ```--refs <refs.txt>``` List of absolute paths to reference sequence fastas (one file per line)

If ```--refs``` is not specified, ggCaller assumes all sequences are assembled. To avoid this, i.e. when no sequences are assembled genomes, additionally specify:
- ```--not-ref```

Note: Ensure the sequences used to build the graph are in the same directories as when the graph was built.

#### Additional helpful arguments
- ```--kmer``` k-mer size for graph building. Only used for building of new graphs (default: 31 bp).
- ```--no-filter``` Do not conduct ORF filtering. ggCaller will return all ORFs present.
- ```--threads``` Number of threads (default: 1).
- ```--clean-mode {sensitive, moderate, strict}``` specify stringency for Panaroo quality control. See [Panaroo parameters](https://gtonkinhill.github.io/panaroo/#/gettingstarted/params) for details (default: 'sensitive')
- ```--annotation {fast, sensitive}``` annotate clusters. Either specify ```fast``` to use diamond, or ```sensitive``` to use diamond + HMMscan.
- ```--diamonddb``` specify path to diamond annotation database. Should be fasta format. Default is Bacterial dataset (```Bacteria```) downloaded with ggCaller. Can also specify Viral dataset (```Viruses```). Both default annotation datasets are from [Uniprot](https://www.uniprot.org/).
- ```--hmmdb``` specify HMMscan annotation database. Should be pre-trained HMMER profile database. Default is Pfam dataset from [Prokka](https://github.com/tseemann/prokka), downloaded with ggCaller.
- ```--alignment {core, pan}``` generate alignments and VCFs for core genome (```core```) or all clusters (```pan```).
- ```--aligner {def, ref}``` use with ```--alignment```, specify whether to align all genes in cluster together at once (```def```), or via reference guided approach (```ref```). ```ref``` is faster when more genomes are used to build the graph.
- ```--out``` output directory (default: 'ggCaller_output')

## Examples
- Build graph using assembled genomes, using kmer size of 31 bp and strict clean mode.

```ggcaller --refs refs.txt --kmer 31 --clean-mode strict --out output_dir```

- Build graph using reads using fast annotation using diamond.

```ggcaller --reads reads.txt --annotation fast --out output_dir```

- Build graph using assembled genomes and reads, with reference-based pangenome alignment.

```ggcaller --refs refs.txt --reads reads.txt --alignment pan --aligner ref --out output_dir```

- Use existing graph, specifying genomes which are assembled and no filtering.

```ggcaller --graph graph.gfa --colours colours.bfg_colours --refs refs.txt --no-filter --out output_dir```

- Use existing graph which was built using only reads.

```ggcaller --graph graph.gfa --colours colours.bfg_colours --not-ref --out output_dir```

Test data from [Bentley et al. 2006](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0020031) is available in the ```data``` directory.

### Outputs

ggCaller generates a number of outputs depending on the arguments specified.

#### Always generated
- Annotated gene calls in fasta format
- Gene presence/absence matrix in Panaroo, Roary and RTAB format
- Structural variant presence/absence matrix
- Pangenome reference, containing largest sequence for each gene cluster
- Pre/post filtered Panaroo graph
- Rarefaction curve
- Gene frequency plot
- Cluster size plot
- Neighbour joining tree based on gene presence/absence
- Summary statistics on gene frequency

#### if ```--annotation fast/sensitive``` specified
- Annotated gene calls in GFF format

#### if ```--alignment pan``` or ```--alignment core``` specified
- Neighbour joining tree based on core genome alignment
- Core genome alignment

#### if ```--alignment pan``` specified
- Aligned sequences for each cluster (if ```--alignment pan``` specified)
- Variant calls for each cluster in VCF format

## Query mode
ggCaller can query pre-called genes within a graph. This requires ggCaller to have been run on a prior dataset, and the outputs saved.

### Example
First generate a set of gene calls and save them. This will generate a data directory within the output directory, ```output_dir/ggc_data```

```ggcaller --graph graph.gfa --colours colours.bfg_colours --save --out output_dir```

Then supply ggCaller with the graph and the data directory, along with the queries in fasta format, saving to same directory

```ggcaller --graph graph.gfa --colours colours.bfg_colours --data output_dir/ggc_data --query query.fasta --out output_dir```

### Output
ggCaller generates a fasta file detailing the genes overlapping with each query, and in which genomes they are found.

## All I/O options
```
usage: ggcaller [-h] [--graph GRAPH] [--colours COLOURS] [--not-ref] [--refs REFS] [--reads READS] [--query QUERY] [--codons CODONS] [--kmer KMER] [--save] [--data DATA] [--all-seq-in-graph] [--out OUT]
[--max-path-length MAX_PATH_LENGTH] [--min-orf-length MIN_ORF_LENGTH] [--score-tolerance SCORE_TOLERANCE] [--max-ORF-overlap MAX_ORF_OVERLAP] [--min-path-score MIN_PATH_SCORE]
[--min-orf-score MIN_ORF_SCORE] [--max-orf-orf-distance MAX_ORF_ORF_DISTANCE] [--query-id QUERY_ID] [--no-filter] [--no-write-idx] [--no-write-graph] [--repeat] [--no-clustering] [--no-refind]
[--identity-cutoff IDENTITY_CUTOFF] [--len-diff-cutoff LEN_DIFF_CUTOFF] [--family-threshold FAMILY_THRESHOLD] [--merge-paralogs] [--clean-mode {strict,moderate,sensitive}]
[--annotation {fast,sensitive,ultrasensitive}] [--diamonddb ANNOTATION_DB] [--hmmdb HMM_DB] [--evalue EVALUE] [--truncation-threshold TRUNCATION_THRESHOLD] [--search-radius SEARCH_RADIUS]
[--refind-prop-match REFIND_PROP_MATCH] [--remove-invalid-genes] [--min-trailing-support MIN_TRAILING_SUPPORT] [--trailing-recursive TRAILING_RECURSIVE]
[--edge-support-threshold EDGE_SUPPORT_THRESHOLD] [--length-outlier-support-proportion LENGTH_OUTLIER_SUPPORT_PROPORTION] [--remove-by-consensus {True,False}]
[--high-var-flag CYCLE_THRESHOLD_MIN] [--min-edge-support-sv MIN_EDGE_SUPPORT_SV] [--no-clean-edges] [--alignment {core,pan}] [--aligner {def,ref}] [--core-threshold CORE] [--no-variants]
[--ignore-pseduogenes] [--quiet] [--threads THREADS] [--version]
Generates ORFs from a Bifrost graph.
optional arguments:
-h, --help show this help message and exit
Input/Output options:
--graph GRAPH Bifrost GFA file generated by Bifrost build.
--colours COLOURS Bifrost colours file generated by Bifrost build.
--not-ref If using existing graph, was not graph built exclusively with assembled genomes. [Default = False]
--refs REFS List of reference genomes (one file path per line).
--reads READS List of read files (one file path per line).
--query QUERY List of unitig sequences to query (either FASTA or one sequence per line)
--codons CODONS JSON file containing start and stop codon sequences.
--kmer KMER K-mer size used in Bifrost build (bp). [Default = 31]
--save Save graph objects for sequence querying. [Default = False]
--data DATA Directory containing data from previous ggCaller run generated via "--save"
--all-seq-in-graph Retains all DNA sequence for each gene cluster in the Panaroo graph output. Off by default as it uses a large amount of space.
--out OUT Output directory
ggCaller traversal and gene-calling cut-off settings:
--max-path-length MAX_PATH_LENGTH
Maximum path length during ORF finding (bp). [Default = 20000]
--min-orf-length MIN_ORF_LENGTH
Minimum ORF length to return (bp). [Default = 90]
--score-tolerance SCORE_TOLERANCE
Length probability tolerance for shorter alternative start sites. If within tolerance,ggCaller will check if start coverage and BALROG score are both higher in shorter ORF. [Default =
0.2]
--max-ORF-overlap MAX_ORF_OVERLAP
Maximum overlap allowed between overlapping ORFs. [Default = 60]
--min-path-score MIN_PATH_SCORE
Minimum total Balrog score for a path of ORFs to be returned. [Default = 100]
--min-orf-score MIN_ORF_SCORE
Minimum individual Balrog score for an ORF to be returned. [Default = 100]
--max-orf-orf-distance MAX_ORF_ORF_DISTANCE
Maximum distance for graph traversal during ORF connection (bp). [Default = 10000]
--query-id QUERY_ID Ratio of query-kmers to required to match in graph. [Default = 0.8]
Settings to avoid/include algorithms:
--no-filter Do not filter ORF calls using Balrog. Will return all ORF calls. [Default = False]
--no-write-idx Do not write FMIndexes to file. [Default = False]
--no-write-graph Do not write Bifrost GFA and colours to file. [Default = False]
--repeat Enable traversal of nodes multiple times. [Default = False]
--no-clustering Do not cluster ORFs. [Default = False]
--no-refind Do not refind uncalled genes [Default = False]
Gene clustering options.:
--identity-cutoff IDENTITY_CUTOFF
Minimum identity at amino acid level between two ORFs for clustering. [Default = 0.98]
--len-diff-cutoff LEN_DIFF_CUTOFF
Minimum ratio of length between two ORFs for clustering. [Default = 0.98]
--family-threshold FAMILY_THRESHOLD
protein family sequence identity threshold (default=0.7)
--merge-paralogs don't split paralogs
Panaroo run mode options:
--clean-mode {strict,moderate,sensitive}
R|The stringency mode at which to run panaroo. Must be one of 'strict', 'moderate' or 'sensitive'. Each of these modes can be fine tuned using the additional parameters in the 'Graph
correction' section. strict: Requires fairly strong evidence (present in at least 5% of genomes) to keep likely contaminant genes. moderate: Requires moderate evidence (present in at
least 1% of genomes) to keep likely contaminant genes. sensitive: Does not delete any genes and only performes merge and refinding operations. Useful if rare plasmids are of interest as
these are often hard to disguish from contamination. Results will likely include higher number of spurious annotations.
Panaroo gene cluster annotation options:
--annotation {fast,sensitive,ultrasensitive}
Annotate genes using diamond default (fast), diamond sensitive (sensitive) or diamond and HMMscan (ultrasensitive).If not specified, no annotation done
--diamonddb ANNOTATION_DB
Diamond database. Defaults are 'Bacteria' or 'Viruses'. Can also specify path to fasta file for custom database generation
--hmmdb HMM_DB HMMER hmm profile file. Default is Uniprot HAMAP. Can alsospecify path to pre-built hmm profile file generated using hmmbuild
--evalue EVALUE Maximum e-value to return for DIAMOND and HMMER searches during annotation
--truncation-threshold TRUNCATION_THRESHOLD
Sequences in a gene family cluster below this proportion of the length of thecentroid will be annotated as 'potential pseudogene'
Panaroo gene-refinding options:
--search-radius SEARCH_RADIUS
the distance in nucleotides surronding the neighbour of an accessory gene in which to search for it
--refind-prop-match REFIND_PROP_MATCH
the proportion of an accessory gene that must be found in order to consider it a match
Panaroo graph correction stringency options:
--remove-invalid-genes
removes annotations that do not conform to the expected Prokka format such as those including premature stop codons.
--min-trailing-support MIN_TRAILING_SUPPORT
minimum cluster size to keep a gene called at the end of a contig
--trailing-recursive TRAILING_RECURSIVE
number of times to perform recursive trimming of low support nodes near the end of contigs
--edge-support-threshold EDGE_SUPPORT_THRESHOLD
minimum support required to keep an edge that has been flagged as a possible mis-assembly
--length-outlier-support-proportion LENGTH_OUTLIER_SUPPORT_PROPORTION
proportion of genomes supporting a gene with a length more than 1.5x outside the interquatile range for genes in the same cluster (default=0.01). Genes failing this test will be re-
annotated at the shorter length
--remove-by-consensus {True,False}
if a gene is called in the same region with similar sequence a minority of the time, remove it. One of 'True' or 'False'
--high-var-flag CYCLE_THRESHOLD_MIN
minimum number of nested cycles to call a highly variable gene region (default = 5).
--min-edge-support-sv MIN_EDGE_SUPPORT_SV
minimum edge support required to call structural variants in the presence/absence sv file
--no-clean-edges Turn off edge filtering in the final output graph.
Gene alignment options:
--alignment {core,pan}
Output alignments of core genes or all genes. Options are 'core' and 'pan'. Default: 'None'
--aligner {def,ref} Specify an aligner. Options:'ref' for reference-guided MSA and 'def' for default standard MSA
--core-threshold CORE
Core-genome sample threshold (default=0.95)
--no-variants Do not call variants using SNP-sites after alignment.
--ignore-pseduogenes Ignore ORFs annotated as 'potential pseudogenes' in alignment
Misc. options:
--quiet suppress additional output
--threads THREADS Number of threads to use. [Default = 1]
--version, -v show program's version number and exit
```

## Citation

If you use this code, please cite:
Expand Down
3 changes: 3 additions & 0 deletions docs/installation.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
Installation
============

ggCaller is available on Linux and MacOS. If you are running Windows 10/11, Linux can be installed via the Windows Subsystem for Linux (`WSL <https://docs.microsoft.com/en-us/windows/wsl/install-win10>`_).

The easiest way to install is through conda, which will also install the
dependencies::

Expand Down

0 comments on commit 9bffe14

Please sign in to comment.