Reporting #204

matinnuhamunada · 2022-10-12T14:36:05Z

matinnuhamunada
Oct 12, 2022
Maintainer

To do list:

Drop unused rules
Generate report for important rules
Categorize rules into subgroups based on data units
Remove the all servers

OmkarSaMo · 2022-10-13T11:09:01Z

OmkarSaMo
Oct 13, 2022
Maintainer

I recommend restructuring of reports with objective to make it appealing and user friendly. I believe having a DataUnit based structure for pages. Each page can have multiple sections per each of the relevant rules.

For example, the genome page can have a section of BiGSCAPE

If bigscape is True then show a table with Genome ID as index and #Known GCFs, #Unknown GCFs, #Unique GCFs, etc at columns. Additionally add columns for each BiGSCAPE Class with BGCs per genome of that class
If bigscape is False, then there should be a message saying that this table is currently empty since the bigscape rule was False

Classifying rules to Data units

Rules	Data units
seqfu	Genomes
gtdbtk	Genomes
prokka-gbk	Genomes
antismash	Genomes, BGCs
bigscape	Genomes, BGCs, GCFs
query-bigslice	Genomes, BGCs, GCFs
mash	Genomes, Pagenome
automlst-wrapper	Pagenome
roary	Genomes, Pagenome
eggnog-roary	Pagenome

3 replies

matinnuhamunada Oct 13, 2022
Maintainer Author

I think we should strictly assign one rule to one groups?

Genomes:
  -  gtdbtk
  -  prokka-gbk
  -  antismash
  ...

OmkarSaMo Oct 13, 2022
Maintainer

The reports (tables and figures) will be unique for a rule and data group combination.

For example:

Genomes:antismash will have info on BGCs per genome
BGCs:antismash will have info on each BGC from antismash output
Genomes:BiGSCAPE will have info on known GCFs, unique GCFs per genome
BGCs:BiGSCAPE will have info on GCF assignment of BGCs
GCFs:BiGSCAPE will have info on number of BGCs in GCF

OmkarSaMo Oct 13, 2022
Maintainer

Let me know if this is too complex either for coding or for user understanding.

OmkarSaMo · 2022-10-13T12:28:59Z

OmkarSaMo
Oct 13, 2022
Maintainer

`Genomes` report:

Description

This page will include data and results from various analyses per genome. By default the page will provide annotation statistics and fasta, genbank files to download.

Rules to consider for enriched reports:

prokka-gbk, ncbi (not a rule but default?), seqfu, gtdbtk, mash, antismash, bigscape, query-bigslice, roary

Sections to include in the Genomes report page depending on if the rule was True. Print a default message that running the rule will provide more information if not True.

Following are the details of reports if the specific rule was True.

`prokka-gbk`

Title

Genome annotation

Description of the report

Provide information on the version of prokka used to annotate genomes with the command. Provide a prokka-db table used as reference genomes for annotation.
Explain that users can download the fasta (.fna) and genbank (.gbk) files by clicking on the links in the table.

Figures

Histogram of the number of locustags per genome

Tables

Genome ID	Organism name	GTDB species	Fasta	Genbank	Genome Length	CDSs	rRNAs	tRNAs

Use the file located at data/interim/prokka/{genome_id}/{genome_id}.txt to collect above values. Is is easy to add GTDB species here`?

`ncbi`

Title

Strain metadata from NCBI

Description of the report

Provide information on NCBI metadata

Figures

Pie chart with genome per 4 of the assembly levels from NCBI

Tables

Genome ID	Organism name	GTDB species	Assembly level	BioProject	BioSample	Date	Isolation source	Isolation country

Use the file located at data/processed/{project}/tables/df_ncbi_meta.csv to collect the above values.
Can we add links to NCBI data?
Consider adding strain isolation source and country information from NCBI datasets (inspired from https://github.com/NBChub/BGC_analytics/blob/main/notebooks/15_ncbi_datasets_meta.ipynb (Code needs be updated as a rule in snakemake))

`seqfu`

Title

Genome assembly statistics

Description of the report

Provide information on the version of seqfu used to assess the assembly statistics

Figures

Scatter plot with genome assembly statistics

Tables

Genome ID	Organism name	GTDB species	Genome length	GC content	Contigs	N50	N90	AuN

Use the file located at data/processed/{project}/tables/df_seqfu.csv to collect the above values.

`gtdbtk`

Title

Taxonomic classification

Description of the report

Provide information on the version of gtdbtk and database used among other details

Figures

Two pie charts with the number of genomes per genus and species (take the top 10 if too many)

Tables

Genome ID	Organism name	Family	Genus	Species

Use the file located at data/processed/{project}/tables/df_gtdb_meta.csv to collect the above values.

TO ADD

`mash`

`antismash`

`bigscape`

`query-bigslice`

`roary`

0 replies

OmkarSaMo · 2022-10-14T08:05:48Z

OmkarSaMo
Oct 14, 2022
Maintainer

`BGCs` report:

Description

This page will include data and results from various analyses per BGC. By default the page will provide BGCs statistics and genbank files per BGC to download.

Rules to consider for enriched reports:

antismash, bigscape, query-bigslice

Sections to include in the BGCs report page depending on if the rule was True. Print a default message that running the rule will provide more information if not True.

Following are the details of reports if the specific rule was True.

`antismash`

Title

Detected BGCs

Description of the report

Provide information on the version of antismash used to mine genomes with the command.
Explain that users can download the genbank (.gbk) files by clicking on the links in the table.

Figures

TBD

Tables

BGC ID	Type	Contig Edge	Biosynthetic genes	Genome ID	Organism name	GTDB species	Genbank	Halogenase	Oxygenase	Glycosylase

Use the BGC genbank files to extract column metadata

`bigscape`

Title

GCF assignment

Description of the report

Provide information on the version of bigscape used to mine genomes with the command and parameters.
GCF IDs are defined differently than the BiGSCAPE software. Here, we assign each connected network a separate GCF ID.

Figures

TBD

Tables

Choose BiGSCAPE cut-off for raw distance. (default 0.30)

BGC ID	Type	BiGSCAPE Class	GCF ID	GCF type	Known compounds	MIBIG ID	Contig Edge	Genome ID	Organism name	GTDB species

Use the tables df_bgcs.csv from cytoscape output

TO ADD

`query-bigslice`

0 replies

OmkarSaMo · 2022-10-14T09:12:21Z

OmkarSaMo
Oct 14, 2022
Maintainer

`GCFs` report:

Description

This page will include data and results from various analyses per GCF. By default, the page will provide GCF statistics.

Rules to consider for enriched reports:

bigscape, query-bigslice. mash

`bigscape`

Title

GCF assignment

Description of the report

Provide information on the version of bigscape used to mine genomes with the command and parameters.
GCF IDs are defined differently than the BiGSCAPE software. Here, we assign each connected network a separate GCF ID.

Figures

TBD

Tables

Choose BiGSCAPE cut-off for raw distance. (default 0.30)

GCF ID	BiGSCAPE Class	GCF type	Known compounds	MIBIG ID	BGCs	Incomplete BGCs(%)	Genomes

Use the tables df_families.csv from cytoscape output

TO ADD

`query-bigslice`

`mash`

0 replies

OmkarSaMo · 2022-10-14T09:19:58Z

OmkarSaMo
Oct 14, 2022
Maintainer

`Pangenome` report:

Description

This page will include data and results from various analyses for the entire pangenome of the project. By default, the page will provide pangenome, mash statistics.

Rules to consider for enriched reports:

mash, automlst-wrapper, roary, eggnog-roary

`automlst-wrapper`

Title

Phylogenetic tree

Description of the report

Provide information on the version of automlst used to mine genomes with the command and parameters. Mention that 30 core genes were used for MLST.

Figures

Tree visual based on r notebook

Use the tables automlst tree - midpoint rooted

`mash`

Title

MASH based phylogroups

Description of the report

Provide information on the version of mash used to mine genomes with the command and parameters. Provide citation to E coli mash paper and how the clustering was carried out.
The report is further enriched with other rules such as roary and bigscape.
We report number of genes in core, pan and specific genomes of each phylogroup.
We report number of total and specific GCFs in the phylogroup.

Figures

Clustermap with color codes per phylogroup

Tables

Phylogroup ID	Genomes	Core genome	GCFs	Specific GCFs	Pangenome	Coregenome	Specific genome

Use the file located at mash, cytoscape, roary outputs

TO ADD

`roary`

`eggnog-roary`

0 replies

matinnuhamunada · 2022-10-17T10:45:23Z

matinnuhamunada
Oct 17, 2022
Maintainer Author

Some coding elements to determine reports:
BGCFlow snakemake

Assign rule category in: workflow/rules.yaml. This might subject to change or dropped.
Extend reports in:
- workflow/Report (Snakefile)
- workflow/rules/report.smk

BGCFlow wrapper:

Index template: /home/bgcflow_admin/user_home/testdir/bgcflow_wrapper/bgcflow_wrapper/mkdocs.py. This generates the mkdocs.yaml in the report folder

0 replies

OmkarSaMo · 2022-10-17T15:27:19Z

OmkarSaMo
Oct 17, 2022
Maintainer

The more I think about it, I am not sure if this restructuring is necessary. Let's discuss this in person later

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reporting #204

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Reporting #204

matinnuhamunada Oct 12, 2022 Maintainer

Replies: 7 comments · 3 replies

OmkarSaMo Oct 13, 2022 Maintainer

Classifying rules to Data units

matinnuhamunada Oct 13, 2022 Maintainer Author

OmkarSaMo Oct 13, 2022 Maintainer

OmkarSaMo Oct 13, 2022 Maintainer

OmkarSaMo Oct 13, 2022 Maintainer

Genomes report:

Description

Rules to consider for enriched reports:

prokka-gbk

Title

Description of the report

Figures

Tables

ncbi

Title

Description of the report

Figures

Tables

seqfu

Title

Description of the report

Figures

Tables

gtdbtk

Title

Description of the report

Figures

Tables

TO ADD

mash

antismash

bigscape

query-bigslice

roary

OmkarSaMo Oct 14, 2022 Maintainer

BGCs report:

Description

Rules to consider for enriched reports:

antismash

Title

Description of the report

Figures

Tables

bigscape

Title

Description of the report

Figures

Tables

TO ADD

query-bigslice

OmkarSaMo Oct 14, 2022 Maintainer

GCFs report:

Description

Rules to consider for enriched reports:

bigscape

Title

Description of the report

Figures

Tables

TO ADD

query-bigslice

mash

OmkarSaMo Oct 14, 2022 Maintainer

Pangenome report:

Description

Rules to consider for enriched reports:

automlst-wrapper

Title

Description of the report

Figures

mash

Title

Description of the report

Figures

Tables

TO ADD

matinnuhamunada
Oct 12, 2022
Maintainer

Replies: 7 comments 3 replies

OmkarSaMo
Oct 13, 2022
Maintainer

matinnuhamunada Oct 13, 2022
Maintainer Author

OmkarSaMo Oct 13, 2022
Maintainer

OmkarSaMo Oct 13, 2022
Maintainer

OmkarSaMo
Oct 13, 2022
Maintainer

`Genomes` report:

`prokka-gbk`

`ncbi`

`seqfu`

`gtdbtk`

`mash`

`antismash`

`bigscape`

`query-bigslice`

`roary`

OmkarSaMo
Oct 14, 2022
Maintainer

`BGCs` report:

`antismash`

`bigscape`

`query-bigslice`

OmkarSaMo
Oct 14, 2022
Maintainer

`GCFs` report:

`bigscape`

`query-bigslice`

`mash`

OmkarSaMo
Oct 14, 2022
Maintainer

`Pangenome` report:

`automlst-wrapper`

`mash`

`roary`

`eggnog-roary`

matinnuhamunada
Oct 17, 2022
Maintainer Author

OmkarSaMo
Oct 17, 2022
Maintainer