Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running issue nextstrain pipeline #149

Open
sekhwal opened this issue Feb 15, 2024 · 18 comments
Open

Running issue nextstrain pipeline #149

sekhwal opened this issue Feb 15, 2024 · 18 comments
Labels
bug Something isn't working

Comments

@sekhwal
Copy link

sekhwal commented Feb 15, 2024

I am trying to run the nextstrain pipeline, I did following steps but I am getting an error.

Step 1. I downloaded seasonal-flu repo and created a directory called "data/h3n2", and placed the fasta files. 1. raw_sequences_ha.fasta (downloaded from the gisaid database 2. raw_ha.fasta (my query H3N2 HA fasta file).

Step 2. build the "seasonal-flu/profiles/gisaid/build.yaml" file

Step 3. Install nextstrain cli on conda and run the following command.

nextstrain build . --configfile profiles/gisaid/builds.yaml --use-conda --conda-frontend mamba

I do not have lat-longs: "config/lat_longs.tsv" file so I am using the default and not have a metadata file in the data/h3n2 folder.

##error
Building DAG of jobs...
/home/mmk6053/.nextstrain/runtimes/conda/env/bin/bash: line 1: conda: command not found
Traceback (most recent call last):
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/init.py", line 792, in snakemake
success = workflow.execute(
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/workflow.py", line 1078, in execute
dag.create_conda_envs(
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/dag.py", line 359, in create_conda_envs
env.create(dryrun)
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/deployment/conda.py", line 393, in create
pin_file = self.pin_file
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/common/init.py", line 217, in get
value = self.method(instance)
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/deployment/conda.py", line 103, in pin_file
f".{self.conda.platform}.pin.txt"
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/common/init.py", line 217, in get
value = self.method(instance)
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/deployment/conda.py", line 96, in conda
return Conda(
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/deployment/conda.py", line 653, in init
shell.check_output(self._get_cmd("conda info --json"), text=True)
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/snakemake/shell.py", line 61, in check_output
return sp.check_output(cmd, shell=True, executable=executable, **kwargs)
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/subprocess.py", line 421, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/home/mmk6053/.nextstrain/runtimes/conda/env/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'conda info --json' returned non-zero exit status 127.

###builds.yaml
custom_rules:

  • profiles/gisaid/prepare_data.smk

metadata_fields:

  • Identifier
  • Collection_Date
  • Clade

renamed_metadata_fields:

  • strain
  • date
  • clade

lat-longs: "config/lat_longs.tsv"

segments:

  • ha

submission_date_field: date

recency:
date_bins: [7, 30, 90]
date_bin_labels: ["last week", "last month", "last quarter"]
upper_bin_label: older

builds:
"h3n2":
lineage: h3n2
reference: "config/h3n2/{segment}/reference.fasta"
annotation: "config/h3n2/{segment}/genemap.gff"
tree_exclude_sites: "config/h3n2/{segment}/exclude-sites.txt"
clades: "config/h3n2/ha/clades.tsv"
subclades: "config/h3n2/ha/subclades.tsv"
auspice_config: "config/h3n2/auspice_config.json"
enable_lbi: true
enable_glycosylation: true
subsamples:
global:
filters: "--group-by region year month --subsample-max-sequences 100"

@sekhwal sekhwal added the bug Something isn't working label Feb 15, 2024
@joverlee521 joverlee521 transferred this issue from nextstrain/nextstrain.org Feb 15, 2024
@joverlee521
Copy link
Contributor

Sorry you ran into this error @sekhwal! Looks like we need to update the instructions in the Quickstart with GISAID data.

The --use-conda flag should not be used with the Nextstrain managed environments.
As @victorlin previously stated in a separate issue:

--use-conda falls under the advanced use case of managing Conda environments, which we recommend the ambient runtime for.

Please try running the workflow again with:

nextstrain build . --configfile profiles/gisaid/builds.yaml

@sekhwal
Copy link
Author

sekhwal commented Feb 16, 2024

When I run the following command on the "seasonal flu" repo, it runs and generates files in auspice folder. But it seems it does not include the fasta and metadata files that I provided. Please let me know how to run the pipeline with real data that I downloaded from gisaid and one sequence that I my own.

I am using nextstrain pipeline on the conda environment using "conda activate clade_env". I followed all the steps of "https://github.com/nextstrain/seasonal-flu". Created folder data/h3n2 and added metadata.xls and raw_sequences_ha.fasta.

nextstrain shell .
nextstrain build . --configfile profiles/gisaid/builds.yaml

nextstrain view auspice/

@joverlee521
Copy link
Contributor

Just to clarify, did you add your own sequence to metadata.xls and raw_sequences_ha.fasta? Or is it in separate files? The workflow currently does not support multiple inputs, so you would have to manually add your own data to the GISAID data.

@sekhwal
Copy link
Author

sekhwal commented Feb 16, 2024

Yes, I have added my own sequence information to metadata.xls that I downloaded from GISAID and the fasta sequence to raw_sequences_ha.fasta and placed these files in data/h3n2 folder. When I run the pipeline, it generates files "ha.fasta", metadata.tsv in data/h3n2 folder and in addition "auspice", "benchmarks", "builds" and "logs". Then I run the command "nextstrain view auspice/" so it generates the phylogenetic tree but I can not find my own sequence in the tree.
If possible can I email you the metadata.xls file and the "raw_sequences_ha.fasta".

@sekhwal
Copy link
Author

sekhwal commented Feb 16, 2024

Do you think that my own sequence belongs to clade "3C.2a1b.2a.2b" and in the metadata, other sequences that I downloaded from GISAID, none of them are not from this clade, so I am not getting it in the tree?

joverlee521 added a commit that referenced this issue Feb 16, 2024
The `nextstrain build` command should be the same regardless of which
runtime the user has installed as the default.

Snakemake's `--use-conda` flag also causes an error when used with the
Nextstrain managed Conda runtime as described in
#149
@joverlee521
Copy link
Contributor

Ah, I see. Your sequences may be randomly removed during subsampling.

Subsampling parameters are defined in the profiles/gisaid/builds.yaml file:

subsamples:
global:
filters: "--group-by region year month --subsample-max-sequences 100"

These parameters are being used by augur filter within the workflow. You can read more about augur filter to determine if you want to change the subsampling parameters.


For your specific case of wanting to include your own sequences in the tree, it is possible to force sequences to be included regardless of subsampling parameters by using the --include flag.

  1. Create a text file that lists the sequences that you want to force include, with a single strain name per line.
    This file must be added within the seasonal-flu directory, for example you can add it to profiles/gisaid/include.txt.
  2. Edit the parameters in the profiles/gisaid/builds.yaml file to use your include.txt file.
filters: "--group-by region year month --subsample-max-sequences 100 --include profiles/gisaid/include.txt" 
  1. Re-run the workflow with the --forceall flag.
nextstrain build . --configfile profiles/gisaid/builds.yaml --forceall

@sekhwal
Copy link
Author

sekhwal commented Feb 16, 2024

Just little more clarification. What should I include in the include.txt, a fasta sequence of my own file or the ID of my sequence that I added it in raw_sequences_ha.fasta and metadata.xls. Is there any template of include.txt?

@joverlee521
Copy link
Contributor

The include.txt should have a single sequence ID per line, matching the sequence ID used in the metadata and FASTA file.

@sekhwal
Copy link
Author

sekhwal commented Feb 16, 2024

I added a single seq ID "EPI_ISL_170" in include.txt and updated the builds.yaml

filters: "--group-by region year month --subsample-max-sequences 1000 --include profiles/gisaid/include.txt"

However, it is not showing in the tree.

##It is my seq header

A/Pennsylvania/01/2022 | EPI_ISL_170 | A / H3N2 | |3C.2a1b.2a.2b | | 2022-01-01

##metadata file screenshot
Screenshot from 2024-02-16 16-14-42

@sekhwal
Copy link
Author

sekhwal commented Feb 16, 2024

Here is the clades_h3n2.txt log file, I think there might be an issue of clade mismatching?

clades_h3n2.txt

@joverlee521
Copy link
Contributor

I added a single seq ID "EPI_ISL_170" in include.txt and updated the builds.yaml

Sorry, when I said the sequence ID, I mean the first part of the FASTA header.
You should add the strain name "A/Pennsylvania/01/2022" to include.txt.

@sekhwal
Copy link
Author

sekhwal commented Feb 16, 2024

Sorry, even if I add this strain name "A/Pennsylvania/01/2022" to include.txt, it shows the following error.

##error

[Fri Feb 16 16:54:04 2024]
Error in rule annotate_epiweeks:
jobid: 17
input: builds/h3n2/metadata.tsv
output: builds/h3n2/epiweeks.json
log: logs/annotate_epiweeks_h3n2.txt (check log file(s) for error details)
conda-env: /home/mmk6053/Dropbox/influenza_data/seasonal-flu/.snakemake/conda/b2dc7dab0f55bf6c5699067912b18fca_
shell:

    python3 scripts/calculate_epiweek.py             --metadata builds/h3n2/metadata.tsv             --output-node-data builds/h3n2/epiweeks.json 2>&1 | tee logs/annotate_epiweeks_h3n2.txt
    
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[Fri Feb 16 16:54:04 2024]
Finished job 18.
5 of 26 steps (19%) done
[Fri Feb 16 16:54:04 2024]
Finished job 9.
6 of 26 steps (23%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-02-16T165400.558340.snakemake.log

@joverlee521
Copy link
Contributor

There should be more error details in the log file. Can you check the logs/annotate_epiweeks_h3n2.txt file?

@sekhwal
Copy link
Author

sekhwal commented Feb 20, 2024

I tried running the pipeline in different ways, but it does not include my sequence in the tree. Also when I filter the gisaid downloaded sequences such as one seq per clade to make less number of sequences, the pipeline does not allow this formatted file and generates an error. I think it is strictly designed to run GISAID formatted data only.

I will use multiple seq alignment software such as mafft and will run Treetime and auspice.

@joverlee521
Copy link
Contributor

It was nice to meet you in office hours today @sekhwal! In case you lose the link that @huddlej had shared during office hours, here's the example of the more complex subsampling scheme:

subsamples: &subsampling-scheme
regions_except_europe:
filters: --query "(passage_category != 'egg') & (region != 'Europe') & (ha == True) & (na == True)" --group-by region year month --subsample-max-sequences 2700 --min-date {min_date} --exclude {exclude} --exclude-where passage=egg
priorities: "titers"
europe:
filters: --query "(passage_category != 'egg') & (region == 'Europe') & (ha == True) & (na == True)" --group-by country year month --subsample-max-sequences 300 --min-date {min_date} --exclude {exclude} --exclude-where passage=egg
priorities: "titers"
references:
filters: --query "(is_reference == True)" --min-date {reference_min_date} --exclude {exclude} --exclude-where passage=egg

@sekhwal
Copy link
Author

sekhwal commented Feb 23, 2024

Thank you for sharing the sub-sampling scheme, I would like to discuss in our next week meeting to understand more about the scheme. When I download the data from 2018 to 2023 from GISAID, it comes a large datasets. How I can use the sub-sampling scheme to filter the data and get a decent number of dataset to make a tree.

@sekhwal
Copy link
Author

sekhwal commented Feb 23, 2024

Also, I tried running seasonal_flu repo with include.txt but still I am not able to find my seq in the analysis. I am using the filter "filters: "--group-by region year month --subsample-max-sequences 3300 --include profiles/gisaid/include.txt" in the builds.yaml.

@huddlej
Copy link
Contributor

huddlej commented Mar 1, 2024

@sekhwal and I met today to walk through the workflow and figure out where his local sequence was getting dropped. We found that it was dropped at the "flag outliers" step because its collection date was later than expected by the defined clock rate. We added an include entry to the GISAID build config to keep his strain from getting added to those outliers. After this change, the strain got pruned by augur refine's clock filter. We figured that the sequence was dropped because it was a 2022 sequence in a tree with sequences collected up to December 2020, so the sequence would appear to be an outlier. We also realized that the collection date had an ambiguous month and day, so we modified the starting metadata to have a collection date of 2022-XX-XX. This change should force augur refine to infer the likely date of the sequence from the rest of the sequences in the tree.

@sekhwal is going to try the analysis again with more sequences from 2021 and 2022 so his sequence has more appropriate temporal context. Still, we may want to add an option to the workflow that allows users to disable the clock filter in augur refine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants