Some nextflow processes died #161

molinfzlvvv · 2024-05-20T04:03:43Z

Hi,I have a few problems, hope to get your help.

my command is :
./toga.py /home/TOGAInput/query/hg38.H.g.final.chain /home/TOGAInput/human_hg38/toga.transcripts.bed /home/TOGA/hg38.2bit /home/TOGA/query/H.g.2bit --kt --pn /opt/synData2/Hg -i /home/TOGAInput/human_hg38/toga.isoforms.tsv --nc /home/TOGA/nextflow_config_files --cb 10,100 --cjn 300 --u12 /home/TOGAInput/human_hg38/toga.U12introns.tsv --ms -q

When I was working on CESAR job, the following error occurred:

Compiling C code...
Model found
CESAR installation found
Traceback (most recent call last):
File "/home/TOGA/./toga.py", line 1600, in
main()
File "/home/TOGA/./toga.py", line 1596, in main toga_manager.run()
File "/home/TOGA/./toga.py", line 530, in run
self.__check_cesar_completeness() File "/home/TOGA/./toga.py", line 1088, in __check_cesar_completeness
monitor_jobs(jobs_managers, die_if_sc_1=True) File "/home/TOGA/modules/parallel_jobs_manager_helpers.py", line 36, in monitor_jobs
raise AssertionError(err)
AssertionError: Error! Some para/nextflow processes died!

The log file section is as follows：

Checking whether all CESAR results are complete
1 CESAR jobs crashed, trying to run again...
!!RERUN CESAR JOBS: Pushing 1 jobs into None GB queue
Selected parallelization strategy: nextflow
Parallel manager: pushing job nextflow /home/TOGA/execute_joblist.nf --joblist /opt/synData2/Hg/_cesar_rerun_batch_None -c /opt/synData2/Hg/temp/cesar_config_16_queue.nf
Monitoring CESAR jobs rerun
## Stated polling cluster jobs until they done
Polling iteration 0; already waiting 0 seconds.
Polling iteration 1; already waiting 60 seconds.
Polling iteration 2; already waiting 120 seconds.
Polling iteration 3; already waiting 180 seconds.
.......
Polling iteration 48; already waiting 2880 seconds.
Polling iteration 49; already waiting 2940 seconds.
### CESAR jobs done ###

It's worth noting that this error occurs frequently. Sometimes, running it a second time with the same instructions might work, but each run often requires a significant time investment. Do you have any suggestions for addressing this issue?

Best regards!

kirilenkobm · 2024-05-26T16:07:09Z

Hi!
I am sorry for that, feels like I implemented quite aggressive strategy here.
After TOGA tries to execute its CESAR jobs, it collects those that crashed (which may happen due to a variety of reasons) and pushes them again.
I any job dies, the TOGA process dies as well.
Will be disabled in the next commit (in a couple of minutes)

molinfzlvvv · 2024-05-27T01:41:21Z

Hi！
Thank for your response. Could you help to look at this problem again? This bothered me for a long time.
#140 (comment)
My task has been running for ten days, and it has been in the process called "### STEP 7: Execute CESAR jobs: parallel step".Paradoxically, it seems to be working just fine, because the log file keeps growing.

In fact, I applied for a node with 40 cpus, and then I changed the nextflow setting to process.cpus = 40 // SLURM config file for CESAR jobs, but it actually looks like it only utilizes 2 cpus. I don't know why it's not using all the resources, is that why it's so slow?

If you can suggest any commands to speed up the process, I would really appreciate it.

molinfzlvvv · 2024-06-04T03:15:39Z

Hi! @kirilenkobm

I am very sorry to bother you many times, so far I have not successfully run an instance. I actually tried a lot, and I couldn't commit it to the slurm system, it kept reporting errors. So now I'm running TOGA on a master node with 40 cores, divided into two buckets based on memory(--cn 10,100). I expect to be able to use all the CPUs at CESAR, but I'm only using two CPUs. It's working normally just too slow, and it seems like it can only run one and then move on to the next at CESAR, which has been working for over a week. Do you have any suggestions for this, which I would appreciate very much.

In addition, I noticed that when I ran CESAR in the 10 and 100 buckets, it was not run in command-line order, because the output did not match the order in the cesar_joblist_queue_10.txt file. What's the reason for this, because if I ran it in order, I could also know where I was running, How much longer?

By the way, my nextflow is 21.10.6.5660 and I git clone TOGA directly.Looking forward to your reply.

Best regards!

kirilenkobm added a commit that referenced this issue May 26, 2024

see issue #161

a2b014f

kirilenkobm added the nextflow Issues related to nextflow label May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some nextflow processes died #161

Some nextflow processes died #161

molinfzlvvv commented May 20, 2024 •

edited

Loading

kirilenkobm commented May 26, 2024

molinfzlvvv commented May 27, 2024

molinfzlvvv commented Jun 4, 2024 •

edited

Loading

Some nextflow processes died #161

Some nextflow processes died #161

Comments

molinfzlvvv commented May 20, 2024 • edited Loading

kirilenkobm commented May 26, 2024

molinfzlvvv commented May 27, 2024

molinfzlvvv commented Jun 4, 2024 • edited Loading

molinfzlvvv commented May 20, 2024 •

edited

Loading

molinfzlvvv commented Jun 4, 2024 •

edited

Loading