Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull data directly from COG-UK Data #329

Open
joverlee521 opened this issue Jul 25, 2022 · 2 comments
Open

Pull data directly from COG-UK Data #329

joverlee521 opened this issue Jul 25, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@joverlee521
Copy link
Contributor

joverlee521 commented Jul 25, 2022

Context

There has been a significant drop off in sequences from the UK in the NCBI data since ~April 2022 (issue was originally raised in Slack):

genbank-uk

Description

We can update the pipeline to pull metadata and sequences directly from COG-UK Data instead of waiting on them to submit to NCBI.

We would have to use the ena_sample.secondary_accession column in their accessions TSV to drop duplicates from GenBank via the BioSample accession.

@joverlee521 joverlee521 added the enhancement New feature or request label Jul 25, 2022
@huddlej
Copy link
Contributor

huddlej commented Jul 26, 2022

We discussed a couple of options to address this during triage:

  1. Reach out to COG-UK group via Slack to see if there are plans to continue submitting to NCBI more regularly
  2. Add COG-UK to ingest which will require a way to ingest from metadata and sequences to NDJSON prior to applying transforms.

@joverlee521 will continue work on the latter scripts and then revisit this issue.

@joverlee521
Copy link
Contributor Author

Prompted by @corneliusroemer, this is my general idea of how to switch to directly pulling data from COG-UK instead of relying on their submissions to GenBank:

  1. Update the current patch of COG-UK data to remove all COG-UK records from the GenBank data. It will be less confusing if we make sure all COG-UK data comes from a single source instead of mix of sources. I also think this is the best way to ensure that we do not have duplicate COG-UK records.

  2. Add a rule to fetch the COG-UK sequences. I think this should be the All sequence FASTA since we do our own alignment and masking. (We already fetch the COG-UK metadata CSV)

  3. The COG-UK metadata CSV is formatted differently than GenBank data, so I think we can run it through its own transform pipeline with some combination of tsv-utils, csvtk, and/or the upcoming augur curate command. The produced TSV + FASTA can then be appended to the GenBank files before upload to S3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Status: Backlog
Development

No branches or pull requests

2 participants