Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface Nextclade version + Nextclade dataset version in final metadata output #458

Closed
joverlee521 opened this issue Jul 8, 2024 · 10 comments · Fixed by #467
Closed

Surface Nextclade version + Nextclade dataset version in final metadata output #458

joverlee521 opened this issue Jul 8, 2024 · 10 comments · Fixed by #467
Assignees
Labels
enhancement New feature or request

Comments

@joverlee521
Copy link
Contributor

Prompted by discussion in blab/forecasting project

Naively, we could include the Nextclade version and Nextclade dataset version in the join-metadata-and-clades.

However, if #457 is implemented, then the metadata comes from a single Nextclade version/dataset version. Then these versions should be surfaced through the file name or file metadata.

@joverlee521 joverlee521 added the enhancement New feature or request label Jul 8, 2024
@huddlej
Copy link
Contributor

huddlej commented Jul 8, 2024

One specific example of the data we want to communicate would be:

  • Nextclade version: 3.8.0
  • Nextclade dataset name: sars-cov-2 (a shortcut for the full name of nextstrain/sars-cov-2/wuhan-hu-1/orfs)
  • Nextclade dataset version: 2024-07-03--08-29-55Z

Which would allow users to install the specific Nextclade software (e.g., conda install -c bioconda nextclade=3.8.0 and then download the specific dataset (e.g., nextclade dataset get -n sars-cov-2 --tag "2024-07-03--08-29-55Z").

One specific implementation of that implementation could be a JSON file named like metadata_details.json (or something better) stored alongside the metadata in the S3 bucket with a contents like:

{
    "nextclade_version": "3.8.0",
    "nextclade_dataset_name": "sars-cov-2",
    "nextclade_dataset_version": "2024-07-03--08-29-55Z"
}

Another implementation could be storing that information in the file name of the metadata like metadata_nextclade-3.8.0_sars-cov-2_2024-07-03--08-29-55Z.tsv.xz.

Another approach would be to nest the metadata in a directory structure with the information like nextclade-3.8.0/sars-cov-2/2024-07-03--08-29-55Z/metadata.tsv.xz. This nested approach is how the nextclade_data repo is organized (e.g., one of the SARS-CoV-2 datasets).

One nice aspect of using an additional details file is that the metadata URI and the details URI would be stable. Decoupling the metadata from the details could also make the two files inconsistent for some period of time during updates.

Encoding the information in the filename nicely couples the metadata contents with its version details. The format is more ambiguous to parse, but it isn't complicated once you know what the underscored-delimited fields are.

@corneliusroemer
Copy link
Member

Storing version details of the cache in a separate file makes sense, makes it easy to determine quickly whether the cache should automatically be invalidated.

Regarding putting version info in metadata:

  • We already store Nextclade dataset version in the intermediary metadata, as an extra column, which helps us debug cache invalidation issues. So outputting that should be easy.
  • We should add an additional Nextclade version column to also capture that. Or maybe this is even in there already as well.

Embedding in path makes sense as well if we want to avoid adding 2/3 columns that are essentially always identical. Though overall overhead is small given the size of our existing rows and their compressibility.

@huddlej
Copy link
Contributor

huddlej commented Jul 18, 2024

I shared the three main options discussed in this issue with Evan Ray and folks from the forecasting hub group as follows:

  1. put the Nextclade versions in the file names of the metadata on S3 (so file names will change with each new version but the contents remain the same)
  2. create a separate JSON file that contains the Nextclade versions and that lives alongside the metadata in S3 (file names remain the same through time but you have to fetch both metadata and Nextclade versions files)
  3. keep the version information as columns in the metadata on S3 (file names remain the same through time, you only have to fetch one file, but the metadata contains two extra columns for every record).

Evan replied that:

...option 2 on your list seems cleanest to us: it seems nice to store the date/timestamp separately in json rather than adding it as columns within the data file or encoding the information in the file name. Noting here that we would potentially want/need to pull past versions of that file from S3, but that is possible. Options 1 and 3 also seem viable though, with perhaps a slight preference for 1 over 3.

@jameshadfield
Copy link
Member

Re: option 2, can we include a hash of the metadata file in the meta-metadata JSON, so one can ensure the files are matched up?

@joverlee521
Copy link
Contributor Author

If we don't want to track it in a separate file, the other option is to add it to the AWS S3 user defined object metadata.

We already use this to store the sha256sum in upload-to-s3, which can be accessed with

aws s3api head-object --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text

@joverlee521
Copy link
Contributor Author

To slightly complicate things, we are technically running Nextclade twice with different Nextclade datasets. We use the clade + QC metrics from the general SARS-CoV-2 dataset, but we spike in immune_escape + ace2_binding from the 21L dataset.

This is probably not a huge issue since we only care about the clade assignments.
However, if we want to keep track of all data provenance, the version JSON file might need to look something like

{
    "nextclade_version": "3.8.0",
    "nextclade_datasets": [
        {
            "name": "sars-cov-2",
            "version": "2024-07-03--08-29-55Z",
            "columns": [
                "clade_nextstrain",
                "clade_who",
                "Nextclade_pango",
                "missing_data",
                "divergence",
                "nonACGTN",
                "coverage",
                "rare_mutations",
                "reversion_mutations",
                "potential_contaminants",
                "QC_missing_data",
                "QC_mixed_sites",
                "QC_rare_mutations",
                "QC_snp_clusters",
                "QC_frame_shifts",
                "QC_stop_codons",
                "QC_overall_score",
                "QC_overall_status",
                "frame_shifts",
                "deletions",
                "insertions",
                "substitutions",
                "aaSubstitutions"
            ]
        },
        {
            "name": "sars-cov-2-21L",
            "version": "2024-07-03--08-29-55Z",
            "columns": [
                "immune_escape",
                "ace2_binding"
            ]
        }
    ]
}

@huddlej
Copy link
Contributor

huddlej commented Jul 23, 2024

@joverlee521 Good call. It would be slick and probably helpful in the long term to know which columns came from which Nextclade dataset. Like you mentioned, we only care about the clade assignments in the short term. If we stick with the simpler JSON format example above, your example here shows the need for a way to migrate the JSON schema over time. So maybe we at least need a schema version in the simple JSON like this?

{
    "json_schema_version": "v1",
    "nextclade_version": "3.8.0",
    "nextclade_dataset_name": "sars-cov-2",
    "nextclade_dataset_version": "2024-07-03--08-29-55Z"
}

@corneliusroemer
Copy link
Member

corneliusroemer commented Jul 23, 2024 via email

@genehack
Copy link
Contributor

So maybe we at least need a schema version in the simple JSON like this?

+1 for explicit version numbers 🙌

maybe consider calling it schema_version instead, to avoid confusion with JSON Schema?

@joverlee521
Copy link
Contributor Author

The public metadata version file is available at https://data.nextstrain.org/files/ncov/open/metadata_version.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants