Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download_ena fails with multiple r1 and r2 fqs per sample #69

Open
kriemo opened this issue Nov 11, 2022 · 2 comments
Open

download_ena fails with multiple r1 and r2 fqs per sample #69

kriemo opened this issue Nov 11, 2022 · 2 comments

Comments

@kriemo
Copy link
Member

kriemo commented Nov 11, 2022

If there are multiple r1 and r2 input fastqs per sample the download_ena rule will be called, even if they exist. This will attempt to generate a single r1 and r2 fastq by downloading the sample wildcard from ENA, which will fail.

e.g.

$ snakemake -npr --configfile config.yaml --config DATA=sample_data R1_SAMPLES=test R2_SAMPLES=test FASTQS=sample_fastqs.tsv
['results/counts/test_R1_counts.tsv.gz', 'results/test/test_R1_Aligned.sortedByCoord.out.bam', 'results/bed/test_R1.bed.gz', 'results/counts/test_R2_counts.tsv.gz', 'results/test/test_R2_Aligned.sortedByCoord.out.bam', 'results/bed/test_R2.bed.gz', 'sample_data/test_R1.fastq.gz', 'sample_data/test_R2.fastq.gz']

rule cutadapt_paired:
    output: results/cutadapt/test_trimmed.R1.fastq.gz, results/cutadapt/test_trimmed.R2.fastq.gz
    log: results/logs/test_cutadapt.out
    jobid: 9
    reason: Missing output files: results/cutadapt/test_trimmed.R1.fastq.gz, results/cutadapt/test_trimmed.R2.fastq.gz
    wildcards: results=results, sample=test


      if [ -z "" ] ; then
        echo "no additional trimming"
        cutadapt -j 24          -a TSO_R1=CCCATGTACTCTGCGTTGATACCACTGCTT          -a TruSeq_R1=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA          -A polya_R2=A{30}          -G TSO_R2=XAAGCAGTGGTATCAACGCAGAGTACATGGG          -n 4          -m 75:17          --nextseq-trim=20          -o results/cutadapt/test_trimmed.R1.fastq.gz          -p results/cutadapt/test_trimmed.R2.fastq.gz          <(zcat )          <(zcat $(echo  | sed 's/_R1/_R2/g'))
      else
        echo "removing internal seq..."
        cutadapt --discard-untrimmed --info-file results/cutadapt/temp_test.info -o /dev/null -p results/cutadapt/temp_test_trimmed.R2.fastq.gz -g   <(zcat $(echo  | sed 's/_R1/_R2/g'))
        echo "stitching R1 FASTQ back together..."
        python3 inst/scripts/cut_paste_fastq.py -i results/cutadapt/temp_test.info -o results/cutadapt/temp_test_trimmed.R1.fastq.gz
        echo "2nd round of trimming..."
        cutadapt -j 24          -a TSO_R1=CCCATGTACTCTGCGTTGATACCACTGCTT          -a TruSeq_R1=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA          -A polya_R2=A{30}          -G TSO_R2=XAAGCAGTGGTATCAACGCAGAGTACATGGG          -n 4          -m 75:17          --nextseq-trim=20          -o results/cutadapt/test_trimmed.R1.fastq.gz          -p results/cutadapt/test_trimmed.R2.fastq.gz          <(zcat results/cutadapt/temp_test_trimmed.R1.fastq.gz)          <(zcat results/cutadapt/temp_test_trimmed.R2.fastq.gz)
        rm results/cutadapt/temp_test.info
      fi


rule starsolo_R1:
    input: results/cutadapt/test_trimmed.R1.fastq.gz, results/cutadapt/test_trimmed.R2.fastq.gz
    output: results/test/test_R1_Aligned.sortedByCoord.out.bam
    log: results/logs/test_star_R1.out
    jobid: 4
    reason: Missing output files: results/test/test_R1_Aligned.sortedByCoord.out.bam; Input files updated by another job: results/cutadapt/test_trimmed.R1.fastq.gz, results/cutadapt/test_trimmed.R2.fastq.gz
    wildcards: results=results, sample=test
    resources: total_impact=5


      STAR --soloType CB_UMI_Simple       --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts       --soloUMIfiltering MultiGeneUMI       --soloUMIdedup 1MM_Directional       --readFilesCommand gunzip -c       --runThreadN 12       --soloCBwhitelist whitelist/737K-august-2016.txt --soloUMIlen 10 --clip5pNbases 56 0 --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17       --soloBarcodeMate 1       --outFilterMultimapNmax 1       --outFilterMismatchNmax 999       --outFilterMismatchNoverReadLmax 0.2       --genomeDir index/cr2020A_star       --soloFeatures Gene       --alignMatesGapMax 100000       --alignIntronMax 100000       --soloCellFilter None         --outFileNamePrefix results/test/test_R1_       --outSAMtype BAM SortedByCoordinate       --limitBAMsortRAM 48000000000       --outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM       --readFilesIn results/cutadapt/test_trimmed.R1.fastq.gz results/cutadapt/test_trimmed.R2.fastq.gz

      rm -rf results/test/test_R1_Solo.out


rule starsolo_R2:
    output: results/test/test_R2_Aligned.sortedByCoord.out.bam
    log: results/logs/test_star_R2.out
    jobid: 3
    reason: Missing output files: results/test/test_R2_Aligned.sortedByCoord.out.bam
    wildcards: results=results, sample=test
    resources: total_impact=5


      STAR --soloType CB_UMI_Simple       --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts       --soloUMIfiltering MultiGeneUMI       --soloUMIdedup 1MM_Directional       --readFilesCommand gunzip -c       --runThreadN 12       --outFilterMultimapNmax 1       --soloBarcodeReadLength 0       --soloCBwhitelist whitelist/737K-august-2016.txt --soloUMIlen 10       --clipAdapterType CellRanger4       --genomeDir index/cr2020A_star       --soloFeatures Gene       --soloCellFilter None         --outFileNamePrefix results/test/test_R2_       --outSAMtype BAM SortedByCoordinate       --limitBAMsortRAM 48000000000       --outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM       --readFilesIn $(echo  | sed 's/ /,/g;s/R1/R2/g') $(echo  | sed 's/ /,/g')

      rm -rf results/test/test_R2_Solo.out


rule filterR2:
    input: results/test/test_R2_Aligned.sortedByCoord.out.bam
    output: results/temp/test_R2.bam
    log: results/logs/filter_test_R2.txt
    jobid: 11
    reason: Missing output files: results/temp/test_R2.bam; Input files updated by another job: results/test/test_R2_Aligned.sortedByCoord.out.bam
    wildcards: results=results, sample=test


    samtools view -h results/test/test_R2_Aligned.sortedByCoord.out.bam | grep -v 'CB:Z:-\|UB:Z:-' | samtools view -b - > results/temp/test_R2_filter.bam
    samtools index results/temp/test_R2_filter.bam
    python3 inst/scripts/filter_bam.py -i results/temp/test_R2_filter.bam -o results/temp/test_R2.bam
    samtools index results/temp/test_R2.bam
    rm -rf results/temp/test_R2_filter.bam
    rm -rf results/temp/test_R2_filter.bam.bai


rule assign_sites_read2:
    input: results/temp/test_R2.bam
    output: results/temp/test_R2_assigned.bam
    log: results/logs/assign_sites_test_R2.txt
    jobid: 8
    reason: Missing output files: results/temp/test_R2_assigned.bam; Input files updated by another job: results/temp/test_R2.bam
    wildcards: results=results, sample=test


    if echo ref/polyadb32.hg38.saf.gz | tr '[:upper:]' '[:lower:]' | grep -q -e "\.saf$" -e "\.saf.gz$"; then
      polyaformat='SAF'; else
      polyaformat='GTF';
    fi

    featureCounts     -s 1 -Q 10 -O     --read2pos 3     -o results/temp/test_R2_assigned     -F $polyaformat     -a ref/polyadb32.hg38.saf.gz     -R BAM     -T 1     results/temp/test_R2.bam     2> results/logs/assign_sites_test_R2.txt

    samtools sort     results/temp/test_R2.bam.featureCounts.bam     -o results/temp/test_R2_assigned.bam

    samtools index results/temp/test_R2_assigned.bam

    rm -rf results/temp/test_R2.bam.featureCounts.bam


rule count:
    input: results/temp/test_R2_assigned.bam
    output: results/counts/test_R2_counts.tsv.gz
    log: results/logs/count_test_R2.txt
    jobid: 1
    reason: Missing output files: results/counts/test_R2_counts.tsv.gz; Input files updated by another job: results/temp/test_R2_assigned.bam
    wildcards: results=results, sample=test, read=R2


    umi_tools count       --per-gene       --gene-tag=XT       --assigned-status-tag=XS       --per-cell       --extract-umi-method=tag       --umi-tag=UB       --cell-tag=CB       -I results/temp/test_R2_assigned.bam       -S results/counts/test_R2_counts.tsv.gz       > results/logs/count_test_R2.txt


rule bedR2:
    input: results/temp/test_R2.bam
    output: results/bed/test_R2.bed.gz
    log: results/logs/bed_test_R2.txt
    jobid: 6
    reason: Missing output files: results/bed/test_R2.bed.gz; Input files updated by another job: results/temp/test_R2.bam
    wildcards: results=results, sample=test


    umi_tools dedup       --extract-umi-method=tag       --umi-tag=UB       --per-cell       --cell-tag=CB       --method=unique       -I results/temp/test_R2.bam       -S results/temp/test_R2_dedup.bam       > results/logs/bed_test_R2.txt

    bedtools genomecov -ibam results/temp/test_R2_dedup.bam -3 -dz | awk 'BEGIN { OFS = "	" } { print $1, $2 - 1, $2, $3 }' - | gzip > results/bed/test_R2.bed.gz


rule filterR1:
    input: results/test/test_R1_Aligned.sortedByCoord.out.bam
    output: results/temp/test_R1.bam
    log: results/logs/filter_test_R1.txt
    jobid: 12
    reason: Missing output files: results/temp/test_R1.bam; Input files updated by another job: results/test/test_R1_Aligned.sortedByCoord.out.bam
    wildcards: results=results, sample=test


    samtools view -h results/test/test_R1_Aligned.sortedByCoord.out.bam | grep -v 'CB:Z:-\|UB:Z:-' | samtools view -b - > results/temp/test_R1_filter.bam
    set +e
    python3 inst/scripts/filter_bam_correct.py -i results/temp/test_R1_filter.bam -o results/temp/test_R1_filter2.bam -l 56
    exitcode=$?
    if [ $exitcode -eq 1 ]
    then
      touch results/counts/test_R1_counts.tsv.gz
    fi
    samtools sort results/temp/test_R1_filter2.bam -o results/temp/test_R1.bam
    # mv results/temp/test_R1_filter.bam results/temp/test_R1.bam
    samtools index results/temp/test_R1.bam
    rm -rf results/temp/test_R1_filter.bam
    rm -rf results/temp/test_R1_filter2.bam


rule assign_sites_read1:
    input: results/temp/test_R1.bam
    output: results/temp/test_R1_assigned.bam
    log: results/logs/assign_sites_test_R1.txt
    jobid: 10
    reason: Missing output files: results/temp/test_R1_assigned.bam; Input files updated by another job: results/temp/test_R1.bam
    wildcards: results=results, sample=test


    if echo ref/polyadb32.hg38.saf.gz | tr '[:upper:]' '[:lower:]' | grep -q -e "\.saf$" -e "\.saf.gz$"; then
      polyaformat='SAF'; else
      polyaformat='GTF';
    fi

    featureCounts     -s 2 -Q 10 -O -p     --read2pos 5     -o results/temp/test_R1_assigned     -F $polyaformat     -a ref/polyadb32.hg38.saf.gz     -R BAM     -T 1     results/temp/test_R1.bam     2> results/logs/assign_sites_test_R1.txt

    samtools sort     results/temp/test_R1.bam.featureCounts.bam     -o results/temp/test_R1_assigned.bam

    samtools index results/temp/test_R1_assigned.bam
    rm -rf results/temp/test_R1.bam.featureCounts.bam


rule count:
    input: results/temp/test_R1_assigned.bam
    output: results/counts/test_R1_counts.tsv.gz
    log: results/logs/count_test_R1.txt
    jobid: 5
    reason: Missing output files: results/counts/test_R1_counts.tsv.gz; Input files updated by another job: results/temp/test_R1_assigned.bam
    wildcards: results=results, sample=test, read=R1


    umi_tools count       --per-gene       --gene-tag=XT       --assigned-status-tag=XS       --per-cell       --extract-umi-method=tag       --umi-tag=UB       --cell-tag=CB       -I results/temp/test_R1_assigned.bam       -S results/counts/test_R1_counts.tsv.gz       > results/logs/count_test_R1.txt


rule bedR1:
    input: results/temp/test_R1.bam
    output: results/bed/test_R1.bed.gz
    log: results/logs/bed_test_R1.txt
    jobid: 7
    reason: Missing output files: results/bed/test_R1.bed.gz; Input files updated by another job: results/temp/test_R1.bam
    wildcards: results=results, sample=test


    umi_tools dedup       --extract-umi-method=tag       --umi-tag=UB       --per-cell       --cell-tag=CB       --paired       --method=unique       -I results/temp/test_R1.bam       -S results/temp/test_R1_dedup.bam       > results/logs/bed_test_R1.txt
    # 0x82 for R2
    samtools view -f 0x42 results/temp/test_R1_dedup.bam -b > results/temp/test_R1_r.bam
    bedtools genomecov -ibam results/temp/test_R1_r.bam -5 -dz | awk 'BEGIN { OFS = "	" } { print $1, $2 - 1, $2, $3 }' - | gzip > results/bed/test_R1.bed.gz
    rm -rf results/temp/test_R1_r.bam


rule ena_download:
    input: config.yaml
    output: sample_data/test_R1.fastq.gz, sample_data/test_R2.fastq.gz
    jobid: 2
    reason: Missing output files: sample_data/test_R1.fastq.gz, sample_data/test_R2.fastq.gz
    wildcards: data=sample_data, sample=test


      sample=test
      part2=${sample//${sample:(-4):(-1)}//00}
      wget -c -O sample_data/test_R1.fastq.gz ftp.sra.ebi.ac.uk/vol1/fastq/${part2}/test/test_1.fastq.gz
      wget -c -O sample_data/test_R2.fastq.gz ftp.sra.ebi.ac.uk/vol1/fastq/${part2}/test/test_2.fastq.gz


localrule all:
    input: results/counts/test_R1_counts.tsv.gz, results/test/test_R1_Aligned.sortedByCoord.out.bam, results/bed/test_R1.bed.gz, results/counts/test_R2_counts.tsv.gz, results/test/test_R2_Aligned.sortedByCoord.out.bam, results/bed/test_R2.bed.gz, sample_data/test_R1.fastq.gz, sample_data/test_R2.fastq.gz
    jobid: 0
    reason: Input files updated by another job: results/counts/test_R2_counts.tsv.gz, sample_data/test_R2.fastq.gz, results/test/test_R2_Aligned.sortedByCoord.out.bam, results/test/test_R1_Aligned.sortedByCoord.out.bam, sample_data/test_R1.fastq.gz, results/counts/test_R1_counts.tsv.gz, results/bed/test_R2.bed.gz, results/bed/test_R1.bed.gz

Job counts:
	count	jobs
	1	all
	1	assign_sites_read1
	1	assign_sites_read2
	1	bedR1
	1	bedR2
	2	count
	1	cutadapt_paired
	1	ena_download
	1	filterR1
	1	filterR2
	1	starsolo_R1
	1	starsolo_R2
	13
@kriemo kriemo self-assigned this Nov 11, 2022
@johnchamberlin
Copy link

I am encountering this issue. Does it prevent the rest of the rules from executing properly, or can I ignore the download_ena errors?

@kriemo kriemo removed their assignment Aug 28, 2023
@kriemo
Copy link
Member Author

kriemo commented Aug 28, 2023

i don't have the bandwidth to fix this issue at this time. Perhaps @agillen or @raysinensis can help out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants