Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with dbCreator #13

Open
gwajnberg opened this issue Apr 29, 2020 · 19 comments
Open

Error with dbCreator #13

gwajnberg opened this issue Apr 29, 2020 · 19 comments

Comments

@gwajnberg
Copy link

Hello ,

I just tried to use with my data PGA package and i had an error in dbCreator:

library(BSgenome.Hsapiens.UCSC.hg19)
dbfile <- dbCreator(gtfFile=gtffile,vcfFile=vcffile,bedFile=bedfile,annotation_path=annotation_path,outfile_name=outfile_name,genome=Hsapiens,outdir=outfile_path)
Error in y[, z] : subscript out of bounds
this is the vcf header:

##fileformat=VCFv4.1
##source=VarScan2
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth of quality bases">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Indicates if record is a somatic mutation">
##INFO=<ID=SS,Number=1,Type=String,Description="Somatic status of variant (0=Reference,1=Germline,2=Somatic,3=LOH, or 5=Unknown)">
##INFO=<ID=SSC,Number=1,Type=String,Description="Somatic score in Phred scale (0-255) derived from somatic p-value">
##INFO=<ID=GPV,Number=1,Type=Float,Description="Fisher's Exact Test P-value of tumor+normal versus no variant for Germline calls">
##INFO=<ID=SPV,Number=1,Type=Float,Description="Fisher's Exact Test P-value of tumor versus normal for Somatic/LOH calls">
##FILTER=<ID=str10,Description="Less than 10% or more than 90% of variant supporting reads on one strand">
##FILTER=<ID=indelError,Description="Likely artifact due to indel reads at this position">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting bases (reads1)">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Depth of variant-supporting bases (reads2)">
##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">
##FORMAT=<ID=DP4,Number=1,Type=String,Description="Strand read counts: ref/fwd, ref/rev, var/fwd, var/rev">
#CHROM POS ID REF ALT QUAL
chr1 10043516 . A G 0.01
chr1 10044378 . A G 0.01

iarc@iarc-H8QG6:/Processed_data/gabriel/amrita/tfRNA/EVs/derfinder-pipe$ head /New_data/pancreatic/new_data/junctions.bed
chr1 12697 13221 JUNC_0 0 + 12697 13221 255,0,0 2 50,50 0,575
chr1 14829 14970 JUNC_1 1 - 14829 14970 255,0,0 2 50,50 0,192
chr1 14829 15021 JUNC_2 0 - 14829 15021 255,0,0 2 50,50 0,243
chr1 15012 25233 JUNC_3 0 + 15012 25233 255,0,0 2 50,50 0,10272
chr1 15038 15796 JUNC_4 3 - 15038 15796 255,0,0 2 50,50 0,809
chr1 15947 16607 JUNC_5 3 - 15947 16607 255,0,0 2 50,50 0,711
chr1 16765 16854 JUNC_6 0 - 16765 16854 255,0,0 2 50,50 0,140
chr1 16765 16858 JUNC_7 0 - 16765 16858 255,0,0 2 50,50 0,144
chr1 17055 17233 JUNC_8 0 - 17055 17233 255,0,0 2 50,50 0,229
chr1 17055 17606 JUNC_9 0 - 17055 17606 255,0,0 2 50,50 0,602

iarc@iarc-H8QG6:/Processed_data/gabriel/amrita/tfRNA/EVs/derfinder-pipe$ head /New_data/pancreatic/new_data/merged.gtf
chr1 Cufflinks exon 852260 857889 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.6.1"; class_code "u"; tss_id "TSS1";
chr1 Cufflinks exon 857991 858025 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; oId "CUFF.6.1"; class_code "u"; tss_id "TSS1";
chr1 Cufflinks exon 878110 878438 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 878633 878757 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 879078 879188 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "3"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 879288 880180 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "4"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 880422 880526 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "5"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 880898 881033 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "6"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 881527 881666 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "7"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 881782 881925 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "8"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";

annotation driectory prepared with PrepareAnnotationRefseq2

@Shawn-Xu
Copy link
Collaborator

Hi,

Some information/columns are missing from your VCF file. You need to add these columns (e.g. content in the red box) to ensure the integrity of the format.
image

@gwajnberg
Copy link
Author

ok my vcf was a product from a merging software that merged multiple vcf files of snps and indels...so i decided to use an vcf that was in this format generated by varscan. And now, I had another error:

dbfile <- dbCreator(gtfFile=gtffile,vcfFile=vcffile,bedFile=bedfile,annotation_path=annotation,outfile_name=outfile_name,genome=Hsapiens,outdir=outfile_path)
Output novel junction peptides... Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") :
In range 1: 'end' must be >= 'start' - 1.

@Shawn-Xu
Copy link
Collaborator

Shawn-Xu commented May 1, 2020

In your *.bed file, the coordinates of the start point (12697+575) in your junction part 2 exceed the end point (13221). This indicates that your file is not a standard output file generated by Tophat. Each junction from junctions.bed reported by TopHat will consists of two connected BED blocks in the column 11 and 12.

In the *.gtf file. this function was designed for the cufflinks output. Only the transcript_id beginning with "Cuff." will be considered as novel sequences for the subsequent processing.
If you use the gffcompare to merge all the *.gtf, you need to manually add the prefix "Cuff." to the transcript_id of each entries.

image

@gwajnberg
Copy link
Author

ok... so what pipeline do you recommend to generate a dabase from 3 patients raw sequencing data? since pga accepts only one vcf, one bed and one gtf? cheers

@gwajnberg
Copy link
Author

and why you don't create an alternative for tophat, since it's being discontinued since 2016?

@Shawn-Xu
Copy link
Collaborator

Shawn-Xu commented May 5, 2020

and why you don't create an alternative for tophat, since it's being discontinued since 2016?

vcf and gtf are the common data formats for point mutations and transcripts. Creating a new format is even more likely to cause new misunderstandings.

Tophat2 has entered a low maintenance since 2016, and HISAT2 provides the same core functionality. So we developed an interface to hisat2's splicesites output file. You can use tabFile argument instead of bedFile to complete junction construction。
20200505120148

splicesites format demo:
20200505124637
This format only contains the left and right flanking genomic positions, which are more easier to understand.

@Jokendo-collab
Copy link

In your *.bed file, the coordinates of the start point (12697+575) in your junction part 2 exceed the end point (13221). This indicates that your file is not a standard output file generated by Tophat. Each junction from junctions.bed reported by TopHat will consists of two connected BED blocks in the column 11 and 12.

In the *.gtf file. this function was designed for the cufflinks output. Only the transcript_id beginning with "Cuff." will be considered as novel sequences for the subsequent processing.
If you use the gffcompare to merge all the *.gtf, you need to manually add the prefix "Cuff." to the transcript_id of each entries.

image

This information is missing in the tutorial. I think the tools used in the generation of the GTF and BED files needs to be mentioned briefly for ease of following the workflow. I have struggled and this information solved my problem

@gwajnberg
Copy link
Author

hello I started from the beggining then I have a new vcf, new gtf and new tabFile

@gwajnberg
Copy link
Author

image
here is the junction tab

@gwajnberg
Copy link
Author

image
the vcf

@gwajnberg
Copy link
Author

image
the gtf

@gwajnberg
Copy link
Author

gwajnberg commented Jun 1, 2020

and now I have got this error :

dbfile <- dbCreator(gtfFile=gtffile,vcfFile=vcffile,tabFile=tabfile,
annotation_path=annotation,outfile_name=outfile_name,
genome=Hsapiens,outdir=outfile_path)
Output novel junction peptides... Error in $<-.data.frame(*tmp*, "V4", value = "+") :
replacement has 1 row, data has 0
In addition: Warning messages:
1: In rbind(...) :
number of columns of result is not a multiple of vector length (arg 1)
2: In rbind(...) :
number of columns of result is not a multiple of vector length (arg 1)
3: In rbind(...) :
number of columns of result is not a multiple of vector length (arg 1)

@gwajnberg
Copy link
Author

any ideas?

@Jokendo-collab
Copy link

@gwajnberg I also had that problem and I was not able to solve it

@gwajnberg
Copy link
Author

Hello javanOkendo, I found the problem there is a function in the end of the script called Tab2Range, which tries to replace if you have a row into your tab File with a "." instead of + or - ... but instead of using a condition the function just try to create a tmp variable, which if you don't have any "." it will be empty...here is it

Tab2Range
function(tabfile)
{
options(stringsAsFactors=FALSE)
jun <- read.table(tabfile, sep='\t', header=F, stringsAsFactors=F)

#replace the strand "." with "+" and "-".
tmp<-jun[jun$V4==".",]
tmp_forward<-tmp
tmp_reverse<-tmp
tmp_forward$V4<-"+"
tmp_reverse$V4<-"-"
jun<-rbind(jun[jun$V4!=".",],tmp_forward,tmp_reverse)

part1_sta <- as.numeric(jun[,'V2'])+1-59
part1_end <- as.numeric(jun[,'V2'])+1
part2_sta <- as.numeric(jun[,'V3'])+1
part2_end <- as.numeric(jun[,'V3'])+1+59

junction <- data.frame(chr=jun[, 'V1'], id="NULL", start=jun[, 'V2'],
end=jun[,'V3'], cov=255, strand=jun$V4, part1_len=0,
part2_len=0, part1_sta, part1_end, part2_sta, part2_end)
if('chrM' %in% junction$chr) junction <- junction[-which(junction$chr=='chrM'), ]
if('MT' %in% junction$chr) junction <- junction[-which(junction$chr=='MT'), ]

junRange <- GRanges(seqnames=junction$chr, ranges=IRanges(start=junction$part1_end,
end=junction$part2_sta), strand=junction$strand, id=junction$id,
cov=junction$cov, part1_len=junction$part1_len, part2_len=junction$part2_len,
part1_sta=junction$part1_sta, part1_end=junction$part1_end,
part2_sta=junction$part2_sta, part2_end=junction$part2_end)

}

@Jokendo-collab
Copy link

@gwajnberg did this work with your data? if Yes can you share the workflow which you used so that I can give it a trial on my data? You can send the attachment to oknjav001@myuct.ac.za

@gwajnberg
Copy link
Author

Hello javan, I will send you for sure!! I will just clean the script lol, because i went through line by line of each PGA's functions

@Jokendo-collab
Copy link

Jokendo-collab commented Jun 4, 2020 via email

@Jokendo-collab
Copy link

@gwajnberg did you manage to make the script work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants