Error with dbCreator #13

gwajnberg · 2020-04-29T14:55:39Z

Hello ,

I just tried to use with my data PGA package and i had an error in dbCreator:

library(BSgenome.Hsapiens.UCSC.hg19)
dbfile <- dbCreator(gtfFile=gtffile,vcfFile=vcffile,bedFile=bedfile,annotation_path=annotation_path,outfile_name=outfile_name,genome=Hsapiens,outdir=outfile_path)
Error in y[, z] : subscript out of bounds
this is the vcf header:

##fileformat=VCFv4.1
##source=VarScan2
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth of quality bases">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Indicates if record is a somatic mutation">
##INFO=<ID=SS,Number=1,Type=String,Description="Somatic status of variant (0=Reference,1=Germline,2=Somatic,3=LOH, or 5=Unknown)">
##INFO=<ID=SSC,Number=1,Type=String,Description="Somatic score in Phred scale (0-255) derived from somatic p-value">
##INFO=<ID=GPV,Number=1,Type=Float,Description="Fisher's Exact Test P-value of tumor+normal versus no variant for Germline calls">
##INFO=<ID=SPV,Number=1,Type=Float,Description="Fisher's Exact Test P-value of tumor versus normal for Somatic/LOH calls">
##FILTER=<ID=str10,Description="Less than 10% or more than 90% of variant supporting reads on one strand">
##FILTER=<ID=indelError,Description="Likely artifact due to indel reads at this position">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting bases (reads1)">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Depth of variant-supporting bases (reads2)">
##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">
##FORMAT=<ID=DP4,Number=1,Type=String,Description="Strand read counts: ref/fwd, ref/rev, var/fwd, var/rev">
#CHROM POS ID REF ALT QUAL
chr1 10043516 . A G 0.01
chr1 10044378 . A G 0.01

iarc@iarc-H8QG6:/Processed_data/gabriel/amrita/tfRNA/EVs/derfinder-pipe$ head /New_data/pancreatic/new_data/junctions.bed
chr1 12697 13221 JUNC_0 0 + 12697 13221 255,0,0 2 50,50 0,575
chr1 14829 14970 JUNC_1 1 - 14829 14970 255,0,0 2 50,50 0,192
chr1 14829 15021 JUNC_2 0 - 14829 15021 255,0,0 2 50,50 0,243
chr1 15012 25233 JUNC_3 0 + 15012 25233 255,0,0 2 50,50 0,10272
chr1 15038 15796 JUNC_4 3 - 15038 15796 255,0,0 2 50,50 0,809
chr1 15947 16607 JUNC_5 3 - 15947 16607 255,0,0 2 50,50 0,711
chr1 16765 16854 JUNC_6 0 - 16765 16854 255,0,0 2 50,50 0,140
chr1 16765 16858 JUNC_7 0 - 16765 16858 255,0,0 2 50,50 0,144
chr1 17055 17233 JUNC_8 0 - 17055 17233 255,0,0 2 50,50 0,229
chr1 17055 17606 JUNC_9 0 - 17055 17606 255,0,0 2 50,50 0,602

iarc@iarc-H8QG6:/Processed_data/gabriel/amrita/tfRNA/EVs/derfinder-pipe$ head /New_data/pancreatic/new_data/merged.gtf
chr1 Cufflinks exon 852260 857889 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.6.1"; class_code "u"; tss_id "TSS1";
chr1 Cufflinks exon 857991 858025 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; oId "CUFF.6.1"; class_code "u"; tss_id "TSS1";
chr1 Cufflinks exon 878110 878438 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 878633 878757 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 879078 879188 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "3"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 879288 880180 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "4"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 880422 880526 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "5"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 880898 881033 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "6"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 881527 881666 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "7"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 881782 881925 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "8"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";

annotation driectory prepared with PrepareAnnotationRefseq2

Shawn-Xu · 2020-04-30T02:19:34Z

Hi,

Some information/columns are missing from your VCF file. You need to add these columns (e.g. content in the red box) to ensure the integrity of the format.

gwajnberg · 2020-04-30T11:49:35Z

ok my vcf was a product from a merging software that merged multiple vcf files of snps and indels...so i decided to use an vcf that was in this format generated by varscan. And now, I had another error:

dbfile <- dbCreator(gtfFile=gtffile,vcfFile=vcffile,bedFile=bedfile,annotation_path=annotation,outfile_name=outfile_name,genome=Hsapiens,outdir=outfile_path)
Output novel junction peptides... Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") :
In range 1: 'end' must be >= 'start' - 1.

Shawn-Xu · 2020-05-01T04:49:36Z

In your *.bed file, the coordinates of the start point (12697+575) in your junction part 2 exceed the end point (13221). This indicates that your file is not a standard output file generated by Tophat. Each junction from junctions.bed reported by TopHat will consists of two connected BED blocks in the column 11 and 12.

In the *.gtf file. this function was designed for the cufflinks output. Only the transcript_id beginning with "Cuff." will be considered as novel sequences for the subsequent processing.
If you use the gffcompare to merge all the *.gtf, you need to manually add the prefix "Cuff." to the transcript_id of each entries.

gwajnberg · 2020-05-04T11:48:26Z

ok... so what pipeline do you recommend to generate a dabase from 3 patients raw sequencing data? since pga accepts only one vcf, one bed and one gtf? cheers

gwajnberg · 2020-05-04T11:51:51Z

and why you don't create an alternative for tophat, since it's being discontinued since 2016?

Shawn-Xu · 2020-05-05T04:57:51Z

and why you don't create an alternative for tophat, since it's being discontinued since 2016?

vcf and gtf are the common data formats for point mutations and transcripts. Creating a new format is even more likely to cause new misunderstandings.

Tophat2 has entered a low maintenance since 2016, and HISAT2 provides the same core functionality. So we developed an interface to hisat2's splicesites output file. You can use tabFile argument instead of bedFile to complete junction construction。

splicesites format demo:

This format only contains the left and right flanking genomic positions, which are more easier to understand.

Jokendo-collab · 2020-05-10T20:55:38Z

In your *.bed file, the coordinates of the start point (12697+575) in your junction part 2 exceed the end point (13221). This indicates that your file is not a standard output file generated by Tophat. Each junction from junctions.bed reported by TopHat will consists of two connected BED blocks in the column 11 and 12.

In the *.gtf file. this function was designed for the cufflinks output. Only the transcript_id beginning with "Cuff." will be considered as novel sequences for the subsequent processing.
If you use the gffcompare to merge all the *.gtf, you need to manually add the prefix "Cuff." to the transcript_id of each entries.

This information is missing in the tutorial. I think the tools used in the generation of the GTF and BED files needs to be mentioned briefly for ease of following the workflow. I have struggled and this information solved my problem

gwajnberg · 2020-06-01T16:41:22Z

hello I started from the beggining then I have a new vcf, new gtf and new tabFile

gwajnberg · 2020-06-01T16:45:56Z

here is the junction tab

gwajnberg · 2020-06-01T16:46:58Z

the vcf

gwajnberg · 2020-06-01T16:47:53Z

the gtf

gwajnberg · 2020-06-01T16:48:18Z

and now I have got this error :

dbfile <- dbCreator(gtfFile=gtffile,vcfFile=vcffile,tabFile=tabfile,
annotation_path=annotation,outfile_name=outfile_name,
genome=Hsapiens,outdir=outfile_path)
Output novel junction peptides... Error in $<-.data.frame(*tmp*, "V4", value = "+") :
replacement has 1 row, data has 0
In addition: Warning messages:
1: In rbind(...) :
number of columns of result is not a multiple of vector length (arg 1)
2: In rbind(...) :
number of columns of result is not a multiple of vector length (arg 1)
3: In rbind(...) :
number of columns of result is not a multiple of vector length (arg 1)

gwajnberg · 2020-06-01T16:49:18Z

any ideas?

Jokendo-collab · 2020-06-01T17:10:26Z

@gwajnberg I also had that problem and I was not able to solve it

gwajnberg · 2020-06-02T13:41:56Z

Hello javanOkendo, I found the problem there is a function in the end of the script called Tab2Range, which tries to replace if you have a row into your tab File with a "." instead of + or - ... but instead of using a condition the function just try to create a tmp variable, which if you don't have any "." it will be empty...here is it

Tab2Range
function(tabfile)
{
options(stringsAsFactors=FALSE)
jun <- read.table(tabfile, sep='\t', header=F, stringsAsFactors=F)

#replace the strand "." with "+" and "-".
tmp<-jun[jun$V4==".",]
tmp_forward<-tmp
tmp_reverse<-tmp
tmp_forward$V4<-"+"
tmp_reverse$V4<-"-"
jun<-rbind(jun[jun$V4!=".",],tmp_forward,tmp_reverse)

part1_sta <- as.numeric(jun[,'V2'])+1-59
part1_end <- as.numeric(jun[,'V2'])+1
part2_sta <- as.numeric(jun[,'V3'])+1
part2_end <- as.numeric(jun[,'V3'])+1+59

junction <- data.frame(chr=jun[, 'V1'], id="NULL", start=jun[, 'V2'],
end=jun[,'V3'], cov=255, strand=jun$V4, part1_len=0,
part2_len=0, part1_sta, part1_end, part2_sta, part2_end)
if('chrM' %in% junction$chr) junction <- junction[-which(junction$chr=='chrM'), ]
if('MT' %in% junction$chr) junction <- junction[-which(junction$chr=='MT'), ]

junRange <- GRanges(seqnames=junction$chr, ranges=IRanges(start=junction$part1_end,
end=junction$part2_sta), strand=junction$strand, id=junction$id,
cov=junction$cov, part1_len=junction$part1_len, part2_len=junction$part2_len,
part1_sta=junction$part1_sta, part1_end=junction$part1_end,
part2_sta=junction$part2_sta, part2_end=junction$part2_end)

}

Jokendo-collab · 2020-06-03T08:36:55Z

@gwajnberg did this work with your data? if Yes can you share the workflow which you used so that I can give it a trial on my data? You can send the attachment to oknjav001@myuct.ac.za

gwajnberg · 2020-06-04T15:52:47Z

Hello javan, I will send you for sure!! I will just clean the script lol, because i went through line by line of each PGA's functions

Jokendo-collab · 2020-06-04T16:06:21Z

I will appreciate is and I look forward to it. Thanks for the good gesture

…

On Thu, Jun 4, 2020 at 5:53 PM gwajnberg ***@***.***> wrote: Hello javan, I will send you for sure!! I will just clean the script lol, because i went through line by line of each PGA's functions — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGJ34OY3A4OXC54YV7VQFVDRU67N7ANCNFSM4MTZLVYA> .

Jokendo-collab · 2020-06-17T17:18:54Z

@gwajnberg did you manage to make the script work?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with dbCreator #13

Error with dbCreator #13

gwajnberg commented Apr 29, 2020

Shawn-Xu commented Apr 30, 2020

gwajnberg commented Apr 30, 2020

Shawn-Xu commented May 1, 2020 •

edited

Loading

gwajnberg commented May 4, 2020

gwajnberg commented May 4, 2020

Shawn-Xu commented May 5, 2020 •

edited

Loading

Jokendo-collab commented May 10, 2020

gwajnberg commented Jun 1, 2020

gwajnberg commented Jun 1, 2020

gwajnberg commented Jun 1, 2020

gwajnberg commented Jun 1, 2020

gwajnberg commented Jun 1, 2020 •

edited

Loading

gwajnberg commented Jun 1, 2020

Jokendo-collab commented Jun 1, 2020

gwajnberg commented Jun 2, 2020

Jokendo-collab commented Jun 3, 2020

gwajnberg commented Jun 4, 2020

Jokendo-collab commented Jun 4, 2020 via email

Jokendo-collab commented Jun 17, 2020

Error with dbCreator #13

Error with dbCreator #13

Comments

gwajnberg commented Apr 29, 2020

Shawn-Xu commented Apr 30, 2020

gwajnberg commented Apr 30, 2020

Shawn-Xu commented May 1, 2020 • edited Loading

gwajnberg commented May 4, 2020

gwajnberg commented May 4, 2020

Shawn-Xu commented May 5, 2020 • edited Loading

Jokendo-collab commented May 10, 2020

gwajnberg commented Jun 1, 2020

gwajnberg commented Jun 1, 2020

gwajnberg commented Jun 1, 2020

gwajnberg commented Jun 1, 2020

gwajnberg commented Jun 1, 2020 • edited Loading

gwajnberg commented Jun 1, 2020

Jokendo-collab commented Jun 1, 2020

gwajnberg commented Jun 2, 2020

Jokendo-collab commented Jun 3, 2020

gwajnberg commented Jun 4, 2020

Jokendo-collab commented Jun 4, 2020 via email

Jokendo-collab commented Jun 17, 2020

Shawn-Xu commented May 1, 2020 •

edited

Loading

Shawn-Xu commented May 5, 2020 •

edited

Loading

gwajnberg commented Jun 1, 2020 •

edited

Loading