-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
twobit sizes do not match #160
Comments
Hi, can you pls check why this scaffold different sizes in the 2bit and chain (twobit: 71766; size in chain: 71276). I don't know what the --limit_to_ref_chrom parameter does, but maybe try to remove the problematic scaffolds (if you don't need them) from the 2bit and the chain file. Then it should work. |
Hello, Thank you very much for the response. The 2bit and chain files were generated from a HAL file of 61 genomes generated through Cactus. So the scaffolds should be the same length, but it's not clear why they are different lengths. The main chromosome scaffolds are all the same length. I already went ahead and removed the unplaced scaffolds from the fastas pulled from the HAL file and generating new chain files using However, I ran into another issue running TOGA on a genome where the 2bit size did match with the reference. I kept running into an issue with the CESAR jobs crashing and the run erroring out. I checked this issue here where there was a similar issue: (https://github.com/hillerlab/TOGA/issues/146) Following the suggestions in the issue, I added the
So there is an issue with the realign_exons function in
|
Hmm, 100 GB should be enough. Do all jobs error or only particular ones with very big genes (I think TTN needs more mem). The other thing that could be an issue, though I don't see it in these messages is the .1 in the transcript ID. |
I do not believe the dot in the transcript name would have an issue. I have issued jobs with the same reference that ran smoothly. Also, I realized that I switched the order of the reference 2bit and the target 2bit in the toga command (incredibly embarrassing on my end, but it would explain why the scaffold sizes were not matching between the 2bit and chain files). Regarding the jobs where the CESAR jobs have been crashing, could this be part of the issue? The target genomes here have the same scaffold sizes as the reference, so if I had switched the reference and target 2bits it would not get caught. |
I wanted to follow up, now that I have the results from more attempted jobs. So every run that I have attempted have failed with the same error mentioned in my second comment and reprinted here:
Again, the issue seems to lie with the realign_exons() function in the CESAR_wrapper.py script. This same error has repeated in all of my jobs so far and there is no consistency with which transcripts in the genome the jobs fail on. Only 8-12 of the thousands of transcripts seem to be failing, but since it's not consistent across genomes, I can't really go back and remove the problematic transcripts from the inputs. To provide some more context. All of the input files are derived from a HAL file of 61 genomes, with the exception of the bed file which is derived from the reference genome in NCBI. I previously ran TOGA on genomes derived from a much smaller HAL file of four genomes to test the script and all of those jobs succeeded. However, the chain files derived from that HAL file were noticeably smaller than the ones derived from the 61 genome HAL. For instance, the Nerodia clarkii genome that I present here was part of that smaller alignment and that job finished without an issue as opposed to this one. Is it possible to simply skip over the problematic transcripts when the CESAR jobs fail? Or is this indicative of a larger problem with my data? |
interesting. Typically, such issues affect many scaffolds and thus many genes. But from what you describe, it seems like only a few small scaffolds are affected. There must be a way to ignore the crashed CESAR jobs (typically not recommended, but maybe OK here). @kirilenkobm can you pls advise how to continue downstream? |
Hi @lpnunez That sounds interesting. Could you please grep XM_032237228.1 and any other problematic transcripts from the reference BED file? Maybe I can spot something unusual. I'd recommend skipping such transcripts unless they are vital for downstream research and affect only a small fraction of the whole dataset. |
Sure, here are a few transcripts that keep crashing:
So I went through the list of failed jobs and I found that the transcripts that I list here are the same across most of the ingroup taxa. The outgroup has a different set of transcripts that failed, and there is at least one ingroup taxon that has a different set of problematic transcripts. I decided to remove these transcripts from the bed file and see if it will work for the rest of the taxa (they represent 8 transcripts out of tens of thousands, so I am not too concerned about them). However, while I may have found consistency across most ingroup taxa, it is a bit irksome to have to run a job and see if it fails or not to figure out which transcripts are problematic or not. I was wondering if it was possible to check if a CESAR job will fail or not beforehand, to save time. Or if it's possible to ignore those jobs and continue with the rest of the run. I would understand if you did not implement an option for something like this, though. |
Thank you a lot @lpnunez Actually, some filters for problematic transcripts have already been implemented, but the number of potential edge cases is simply vast. I've tried to find anything suspicious about these particular transcripts so far, but I haven't had any luck. If it's OK with you, could you please send me the files used in this command?
I can see which line crashes, but I'm not sure what could be causing this condition. Despite running TOGA on hundreds of different genomes, this is something new. |
Sure, here is a Google Drive link for the input files: The files here are for a different taxon, but the condition is the same as the command you highlighted. Here is the command for these files:
I have been removing the problematic transcripts from the input bed files and it seems to have fixed the issue. But it would be nice to know why these transcripts keep failing, so there could potentially be a better fix. Thanks again for the help. |
Hello,
I am trying to run TOGA to transfer annotations over from my well-annotated reference to several different query genomes. However, while trying to run TOGA on certain genomes I would run into the following error:
Found 365 sequences in /home/lnunez/mendel-nas1/WGS/Cactus/Outputs/Diss/twobit/Thamnophis_elegans.2bit
Error! 2bit file: /home/lnunez/mendel-nas1/WGS/Cactus/Outputs/Diss/twobit/Thamnophis_elegans.2bit; chain_file: /home/lnunez/mendel-nas1/WGS/TOGA/Dissertation/Natrix_natrix/CM020096/temp/genome_alignment.chain Chromosome: WNA01000062.1; Sizes don't match! Size in twobit: 7
1766; size in chain: 71276
Traceback (most recent call last):
File "/home/lnunez/mendel-nas1/TOGA/toga.py", line 1600, in
main()
File "/home/lnunez/mendel-nas1/TOGA/toga.py", line 1595, in main
toga_manager = Toga(args)
File "/home/lnunez/mendel-nas1/TOGA/toga.py", line 261, in init
self.__check_param_files()
File "/home/lnunez/mendel-nas1/TOGA/toga.py", line 338, in __check_param_files
TogaSanityChecker.check_2bit_file_completeness(self.t_2bit, t_chrom_to_size, self.chain_file)
File "/mendel-nas1/lnunez/TOGA/modules/toga_sanity_checks.py", line 105, in check_2bit_file_completeness
raise ValueError(err)
ValueError: Error! 2bit file: /home/lnunez/mendel-nas1/WGS/Cactus/Outputs/Diss/twobit/Thamnophis_elegans.2bit; chain_file: /home/lnunez/mendel-nas1/WGS/TOGA/Dissertation/Natrix_natrix/CM020096/temp/genome_alignment.chain Chromosome: WNA01000062.1; Sizes don't match! Size
in twobit: 71766; size in chain: 71276
WNA01000062.1 refers to an unplaced scaffold in the reference, of which there are 347 of them. However, I am only interested in looking at the actual reference chromosomes, of which there are 18. At first, I used the --limit_to_ref_chrom option to limit the runs to these specific chromosomes, like so:
./toga.py "${path_to_chain}"/"${genome}.chain.gz" ${path_to_bed} "${path_to_2bit}"/"${ref}.2bit" ${path_to_2bit}"/"${genome}.2bit" --limit_to_ref_chrom ${chromosome} --kt --pn /home/lnunez/mendel-nas1/WGS/TOGA/Dissertation/"${genome}"/"${chromosome}" --nc ${path_to_nextflow_config_dir} --cb 10,100 --cjn 500
However, I still get the same error, despite noting to limit it to the chromosome. Is there a way to bypass this particular step that I am not seeing? I am in a time crunch, so I would greatly prefer it if I did not have to regenerate the input files from the start.
The text was updated successfully, but these errors were encountered: