Surface Nextclade version + Nextclade dataset version in final metadata output #458

joverlee521 · 2024-07-08T21:53:05Z

Prompted by discussion in blab/forecasting project

Naively, we could include the Nextclade version and Nextclade dataset version in the join-metadata-and-clades.

However, if #457 is implemented, then the metadata comes from a single Nextclade version/dataset version. Then these versions should be surfaced through the file name or file metadata.

huddlej · 2024-07-08T22:39:20Z

One specific example of the data we want to communicate would be:

Nextclade version: 3.8.0
Nextclade dataset name: sars-cov-2 (a shortcut for the full name of nextstrain/sars-cov-2/wuhan-hu-1/orfs)
Nextclade dataset version: 2024-07-03--08-29-55Z

Which would allow users to install the specific Nextclade software (e.g., conda install -c bioconda nextclade=3.8.0 and then download the specific dataset (e.g., nextclade dataset get -n sars-cov-2 --tag "2024-07-03--08-29-55Z").

One specific implementation of that implementation could be a JSON file named like metadata_details.json (or something better) stored alongside the metadata in the S3 bucket with a contents like:

{
    "nextclade_version": "3.8.0",
    "nextclade_dataset_name": "sars-cov-2",
    "nextclade_dataset_version": "2024-07-03--08-29-55Z"
}

Another implementation could be storing that information in the file name of the metadata like metadata_nextclade-3.8.0_sars-cov-2_2024-07-03--08-29-55Z.tsv.xz.

Another approach would be to nest the metadata in a directory structure with the information like nextclade-3.8.0/sars-cov-2/2024-07-03--08-29-55Z/metadata.tsv.xz. This nested approach is how the nextclade_data repo is organized (e.g., one of the SARS-CoV-2 datasets).

One nice aspect of using an additional details file is that the metadata URI and the details URI would be stable. Decoupling the metadata from the details could also make the two files inconsistent for some period of time during updates.

Encoding the information in the filename nicely couples the metadata contents with its version details. The format is more ambiguous to parse, but it isn't complicated once you know what the underscored-delimited fields are.

corneliusroemer · 2024-07-09T13:22:07Z

Storing version details of the cache in a separate file makes sense, makes it easy to determine quickly whether the cache should automatically be invalidated.

Regarding putting version info in metadata:

We already store Nextclade dataset version in the intermediary metadata, as an extra column, which helps us debug cache invalidation issues. So outputting that should be easy.
We should add an additional Nextclade version column to also capture that. Or maybe this is even in there already as well.

Embedding in path makes sense as well if we want to avoid adding 2/3 columns that are essentially always identical. Though overall overhead is small given the size of our existing rows and their compressibility.

huddlej · 2024-07-18T22:20:52Z

I shared the three main options discussed in this issue with Evan Ray and folks from the forecasting hub group as follows:

put the Nextclade versions in the file names of the metadata on S3 (so file names will change with each new version but the contents remain the same)
create a separate JSON file that contains the Nextclade versions and that lives alongside the metadata in S3 (file names remain the same through time but you have to fetch both metadata and Nextclade versions files)
keep the version information as columns in the metadata on S3 (file names remain the same through time, you only have to fetch one file, but the metadata contains two extra columns for every record).

Evan replied that:

...option 2 on your list seems cleanest to us: it seems nice to store the date/timestamp separately in json rather than adding it as columns within the data file or encoding the information in the file name. Noting here that we would potentially want/need to pull past versions of that file from S3, but that is possible. Options 1 and 3 also seem viable though, with perhaps a slight preference for 1 over 3.

jameshadfield · 2024-07-18T22:26:09Z

Re: option 2, can we include a hash of the metadata file in the meta-metadata JSON, so one can ensure the files are matched up?

joverlee521 · 2024-07-18T22:44:48Z

If we don't want to track it in a separate file, the other option is to add it to the AWS S3 user defined object metadata.

We already use this to store the sha256sum in upload-to-s3, which can be accessed with

aws s3api head-object --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text

joverlee521 · 2024-07-19T22:37:36Z

To slightly complicate things, we are technically running Nextclade twice with different Nextclade datasets. We use the clade + QC metrics from the general SARS-CoV-2 dataset, but we spike in immune_escape + ace2_binding from the 21L dataset.

This is probably not a huge issue since we only care about the clade assignments.
However, if we want to keep track of all data provenance, the version JSON file might need to look something like

{
    "nextclade_version": "3.8.0",
    "nextclade_datasets": [
        {
            "name": "sars-cov-2",
            "version": "2024-07-03--08-29-55Z",
            "columns": [
                "clade_nextstrain",
                "clade_who",
                "Nextclade_pango",
                "missing_data",
                "divergence",
                "nonACGTN",
                "coverage",
                "rare_mutations",
                "reversion_mutations",
                "potential_contaminants",
                "QC_missing_data",
                "QC_mixed_sites",
                "QC_rare_mutations",
                "QC_snp_clusters",
                "QC_frame_shifts",
                "QC_stop_codons",
                "QC_overall_score",
                "QC_overall_status",
                "frame_shifts",
                "deletions",
                "insertions",
                "substitutions",
                "aaSubstitutions"
            ]
        },
        {
            "name": "sars-cov-2-21L",
            "version": "2024-07-03--08-29-55Z",
            "columns": [
                "immune_escape",
                "ace2_binding"
            ]
        }
    ]
}

huddlej · 2024-07-23T16:47:50Z

@joverlee521 Good call. It would be slick and probably helpful in the long term to know which columns came from which Nextclade dataset. Like you mentioned, we only care about the clade assignments in the short term. If we stick with the simpler JSON format example above, your example here shows the need for a way to migrate the JSON schema over time. So maybe we at least need a schema version in the simple JSON like this?

{
    "json_schema_version": "v1",
    "nextclade_version": "3.8.0",
    "nextclade_dataset_name": "sars-cov-2",
    "nextclade_dataset_version": "2024-07-03--08-29-55Z"
}

corneliusroemer · 2024-07-23T16:58:07Z

You technically only need a schema for version 2 as version 1 can be defined implicitly as the schema with no explicit version

…

On Tue, Jul 23, 2024, 18:48 John Huddleston ***@***.***> wrote: @joverlee521 <https://github.com/joverlee521> Good call. It would be slick and probably helpful in the long term to know which columns came from which Nextclade dataset. Like you mentioned, we only care about the clade assignments in the short term. If we stick with the simpler JSON format example above, your example here shows the need for a way to migrate the JSON schema over time. So maybe we at least need a schema version in the simple JSON like this? { "json_schema_version": "v1", "nextclade_version": "3.8.0", "nextclade_dataset_name": "sars-cov-2", "nextclade_dataset_version": "2024-07-03--08-29-55Z" } — Reply to this email directly, view it on GitHub <#458 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF77AQNVNBGRY6T5DPSAA63ZN2CMZAVCNFSM6AAAAABKRULZJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBVG42DCNRUGY> . You are receiving this because you commented.Message ID: ***@***.***>

genehack · 2024-07-23T17:03:57Z

So maybe we at least need a schema version in the simple JSON like this?

+1 for explicit version numbers 🙌

maybe consider calling it schema_version instead, to avoid confusion with JSON Schema?

joverlee521 · 2024-08-01T17:31:15Z

The public metadata version file is available at https://data.nextstrain.org/files/ncov/open/metadata_version.json

joverlee521 added the enhancement New feature or request label Jul 8, 2024

joverlee521 mentioned this issue Jul 24, 2024

Ignore cache if Nextclade or dataset version is different #466

Merged

1 task

joverlee521 self-assigned this Jul 26, 2024

joverlee521 mentioned this issue Jul 27, 2024

Surface Nextclade versions #467

Merged

1 task

joverlee521 closed this as completed in #467 Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surface Nextclade version + Nextclade dataset version in final metadata output #458

Surface Nextclade version + Nextclade dataset version in final metadata output #458

joverlee521 commented Jul 8, 2024

huddlej commented Jul 8, 2024

corneliusroemer commented Jul 9, 2024

huddlej commented Jul 18, 2024

jameshadfield commented Jul 18, 2024

joverlee521 commented Jul 18, 2024

joverlee521 commented Jul 19, 2024

huddlej commented Jul 23, 2024

corneliusroemer commented Jul 23, 2024 via email

genehack commented Jul 23, 2024

joverlee521 commented Aug 1, 2024

Surface Nextclade version + Nextclade dataset version in final metadata output #458

Surface Nextclade version + Nextclade dataset version in final metadata output #458

Comments

joverlee521 commented Jul 8, 2024

huddlej commented Jul 8, 2024

corneliusroemer commented Jul 9, 2024

huddlej commented Jul 18, 2024

jameshadfield commented Jul 18, 2024

joverlee521 commented Jul 18, 2024

joverlee521 commented Jul 19, 2024

huddlej commented Jul 23, 2024

corneliusroemer commented Jul 23, 2024 via email

genehack commented Jul 23, 2024

joverlee521 commented Aug 1, 2024