Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add actions to fetch data from BV-BRC. #202

Merged
merged 21 commits into from
Oct 3, 2024

Conversation

VinzentRisch
Copy link
Contributor

@VinzentRisch VinzentRisch commented Aug 20, 2024

closes #200

  • Adds new action get_bv_brc_genomes to fetch genome sequences outputted as GenomeData[DNASequences] with corresponding taxonomy data as FeatureData[Taxonomy].

  • Adds new action get_bv_brc_metadata to fetch any type of metadata outputted as immutable metadata.

  • Adds new action get_bv_brc_genome_features to fetch genome feature sequences outputted as GenomeData[Genes] and GenomeData[Proteins]. Additionally taxonomy as FeatureData[Taxonomy] is provided and GenomeData[Loci]

  • All actions make use of the Data API by BV-BRC.

  • The data that should be downloaded can be queried with three options. First for all actions an RQL style query can be provided. Second by providing IDs/values and a corresponding data field, you can retrieve all data associated with those specific values in that data field. And as a third option a metadata column can be provided, to use metadata obtained with the action get-bv-brc-metadata as a new query.

Run it locally

  1. First, clone the repo and checkout the PR branch:
git clone https://github.com/bokulich-lab/rescript.git
cd RESCRIPt
git fetch origin pull/202/head:pr-202
git checkout pr-202
pip install -e .
  1. Test it out!
qiime rescript get-bv-brc-metadata --p-data-type genome --p-rql-query "in(genome_id,(224308.43,2030927.4755))" --output-dir metadata_rql
qiime rescript get-bv-brc-metadata --p-data-type genome --p-ids 224308.43 2030927.4755 --p-data-field "genome_id" --output-dir metadata_ids 
qiime rescript get-bv-brc-metadata --p-data-type genome --m-ids-metadata-file  metadata_ids/metadata.qza  --m-ids-metadata-column "species" --output-dir metadata_mcolumn

qiime rescript get-bv-brc-genomes --p-rql-query "in(genome_id,(224308.43,2030927.4755))" --output-dir genome_sequences_rql
qiime rescript get-bv-brc-genomes --p-ids 224308.43 2030927.4755 --p-data-field "genome_id" --output-dir genome_sequences_ids
qiime rescript get-bv-brc-genomes --m-ids-metadata-file  metadata_ids/metadata.qza  --m-ids-metadata-column genome_id --output-dir genomes_mcolumn

qiime rescript get-bv-brc-genome-features --p-rql-query "in(genome_id,(2030927.4755))" --output-dir features_rql
qiime rescript get-bv-brc-genome-features --p-ids 224308.43 2030927.4755 --p-data-field "genome_id" --output-dir features_ids
qiime rescript get-bv-brc-genome-features --m-ids-metadata-file  metadata_ids/metadata.qza  --m-ids-metadata-column genome_id --output-dir features_mcolumn

@mikerobeson
Copy link
Collaborator

mikerobeson commented Aug 20, 2024

Thanks for working on this @VinzentRisch! When I try and install this branch I obtain the following error:

from q2_types.genome_data import GenomeData, Loci, Proteins, Genes, DNASequence
ImportError: cannot import name 'DNASequence' from 'q2_types.genome_data' (/home/.../.conda/envs/qiime2-amplicon-2024.5-test/lib/python3.9/site-packages/q2_types/genome_data/init.py)

I suggest that you update / merge your master fork with the master branch of our repo. Then merge/sync your master to the your get_bv_brc_data branch. Then your PR should be good to review. This PR is currently 3 commits behind the current branch, and fails with the above error.

@VinzentRisch
Copy link
Contributor Author

Hi @mikerobeson
I updated the branch. Now it should be ready to review.

@mikerobeson
Copy link
Collaborator

mikerobeson commented Aug 21, 2024

I think there is still an issue. I can pull and install the plugin from head, but unable to run qiime --help. When I do still get:

>from q2_types.genome_data import GenomeData, Loci, Proteins, Genes, DNASequence
ImportError: cannot import name 'DNASequence' from 'q2_types.genome_data' (/Users/.../AnalysisSoftware/miniconda3/envs/qiime2-amplicon-2024.5-test/lib/python3.9/site-packages/q2_types/genome_data/init.py)

Never mind, I was mistakingly not running this in the developer version of QIIME 2. 🤦‍♂️

Testing it now. :-)

Copy link
Collaborator

@mikerobeson mikerobeson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start @VinzentRisch! I was able to run your example code via the dev version of QIIME 2.

There should be more content within the parameter_descriptions and descriptions. Also, it would be nice to have as part of your PR description, and the code help text / descriptions, the "use cases" for these actions. That is, what does a user do once they have these QZA files? How do they work with it? Perhaps consider constructing a RESCRIPt tutorial to eventually go along with this?

I'd also recommend changing the names of your actions so that they resemble the other actions:

  • get-bv-brc-genome-features
  • get-bv-brc-genomes
  • get-bv-brc-metadata
  • get-bv-brc-taxonomy

🤔 I wonder if it'd be simpler to combine to one action, with options for all the other downloadable items? Or combine a couple of the simple ones like metadata and taxonomy? I currently do not have an opinion on this as I've never used BV-BRC before, and do not know what people routinely use it for.... I mean I have some ideas but...

'genomes': 'genomes',
},
name='fetch genomes',
description="fetch genomes",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide a detailed description of what this action does. What is the use case for this? Can you provide an example?

In all of the descriptions please define and explain what BV-BRC stands for, etc... Explain what is this tool used for?

output_descriptions={
'metadata': 'metadata'},
name='Fetch BV-BCR metadata.',
description="Fetch BV-BCR metadata for a specific data type with an RQL "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, please provide a detailed description of what this action retrieves and what it outputs. What is the downstream use case for this? Can you provide an example?

rescript/plugin_setup.py Outdated Show resolved Hide resolved
rescript/plugin_setup.py Outdated Show resolved Hide resolved


plugin.methods.register_function(
function=fetch_taxonomy_bv_brc,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of these new actions supposed to be used together for something? Nothing about these actions or their use case have been explained. What is the user expected to do with all of this output?

rescript/plugin_setup.py Outdated Show resolved Hide resolved
rescript/plugin_setup.py Outdated Show resolved Hide resolved

},
name='Fetch genome features from BV-BRC.',
description='Fetch DNA and protein sequences of genome features from '
Copy link
Collaborator

@mikerobeson mikerobeson Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should explicitly describe the difference between fetch-genomes-bv-brc and fetch-genome-features-bv-brc. For example, a user might be confused as to why they need a genome if they have the features? Or vice versa?

Each action description should contain enough information so that the user understands why they are using a given action. Even better if the user knows how that data can be used downstream. See the other RESCRIPt code help text / examples.

@VinzentRisch
Copy link
Contributor Author

Hi @mikerobeson

  • So I now removed the taxonomy action and added taxonomy outputs to the actions for genomes and genome features. I think having three actions makes it easy enough but still flexible to be specific with data querying.
  • I added more options to query the data more effectively.
  • I have not yet added usage examples to the descriptions. I will come up with some examples on Monday during our dev meeting. Because I am not so sure myself what all can be done with this data in Qiime.

@mikerobeson
Copy link
Collaborator

Hi @VinzentRisch, I apologize for taking a while to get to this. 😬

I really like to improvements. I was able to run everything w/o issue. Again, as I've never really used this tool, I'm not sure what the next steps are with the QZA outputs. Once I have an idea of what the "final" steps are I should be able to finalize my review. Other than that, nothing really stands out to me to fix.. at least for the moment. :-)

@nbokulich
Copy link
Collaborator

Hey @mikerobeson thank you for reviewing this! This is an action with yet-unrealized potential for downstream re-use. The output types are the same as used in moshpit, so these outputs could be used, e.g., to create a reference genome catalog or to perform a comparative genomics analyses. But I think new actions and transformers are needed to really utilize these data, as the types are very new. So it's a chicken-vs-egg type problem, and @VinzentRisch has just created a chicken. Next I think we need to figure out how to lay some eggs...

@mikerobeson
Copy link
Collaborator

Thanks for that @nbokulich! I am okay to merge this as it is, if you think it is ready. I've nothing more to add. 👍

@mikerobeson mikerobeson merged commit 58a0ef6 into bokulich-lab:master Oct 3, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add support to fetch data from BV-BRC.
3 participants