Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VarSim generated readnames too long for SAM/BAM #276

Open
gabeng opened this issue Apr 24, 2024 · 2 comments
Open

VarSim generated readnames too long for SAM/BAM #276

gabeng opened this issue Apr 24, 2024 · 2 comments

Comments

@gabeng
Copy link

gabeng commented Apr 24, 2024

VarSim may generate read names that cannot be written into SAM or BAM format. bwa-mem fails to perform alignment on these reads.

chr13_31453404_33401170-1248109-,chr13_31453404_33401170-1248192-I,chr13_31453404_33401170-1248107-,chr13_31453404_33401170-1248218-D,chr13_31453404_33401170-1248109-:chr13_31453404_33401170-1248218-D-,chr13_31453404_33401170-1248217--:::::::::::622992MjYyMDg=:0

samtools import can be used to check validity of read names.

I used seqkit to rename all reads.

seqkit replace -p .+ -r "seq_{nr}"
@yunfeiguo
Copy link
Contributor

Hi @gabeng what commands did you use? Also can you provide a small example to reproduce the issue?

@gabeng
Copy link
Author

gabeng commented Apr 25, 2024

Hi,

these are the steps to reproduce the issue:

Create a small contig from hg38.

This was my region.bed file.

chr13	31453404	33401170

I used the helper script to extract variants and sequences:

/usr/bin/python /opt/varsim/generate_small_test_ref.py 
  --regions region.bed 
  --vcfs hg001.vcf.gz 
  --outdir small_ref_out 
  --reference GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fa

Simulate reads

Then, reads were simulated with this command:

varsim.py \
	--reference small_ref_out/ref.fa \
	--id simulated-sample \
	--read_length 150 \
	--disable_rand_vcf \
	--disable_rand_dgv \
	--vcfs small_ref_out/hg001.vcf.gz \
	--mean_fragment_size 350 \
	--sd_fragment_size 50 \
	--nlanes 4 \
	--total_coverage 1000 \
	--java_max_mem 20g \
	--simulator_executable /opt/varsim/opt/ART/art_bin_VanillaIceCream/art_illumina \
	--out_dir ~/sim-1000 \
	--log_dir log \
	--work_dir work \
	--simulator art \
	--vc_prop_het 0 \
	--seed 200

I ran about 20 different simulations with different coverage. A single simulation did not contain invalid FASTQ names. Not every FASTQ name is invalid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants