Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some nextflow processes died #161

Open
molinfzlvvv opened this issue May 20, 2024 · 3 comments
Open

Some nextflow processes died #161

molinfzlvvv opened this issue May 20, 2024 · 3 comments
Labels
nextflow Issues related to nextflow

Comments

@molinfzlvvv
Copy link

molinfzlvvv commented May 20, 2024

Hi,I have a few problems, hope to get your help.

my command is :
./toga.py /home/TOGAInput/query/hg38.H.g.final.chain /home/TOGAInput/human_hg38/toga.transcripts.bed /home/TOGA/hg38.2bit /home/TOGA/query/H.g.2bit --kt --pn /opt/synData2/Hg -i /home/TOGAInput/human_hg38/toga.isoforms.tsv --nc /home/TOGA/nextflow_config_files --cb 10,100 --cjn 300 --u12 /home/TOGAInput/human_hg38/toga.U12introns.tsv --ms -q

When I was working on CESAR job, the following error occurred:

Compiling C code...
Model found
CESAR installation found
Traceback (most recent call last):
File "/home/TOGA/./toga.py", line 1600, in
main()
File "/home/TOGA/./toga.py", line 1596, in main toga_manager.run()
File "/home/TOGA/./toga.py", line 530, in run
self.__check_cesar_completeness() File "/home/TOGA/./toga.py", line 1088, in __check_cesar_completeness
monitor_jobs(jobs_managers, die_if_sc_1=True) File "/home/TOGA/modules/parallel_jobs_manager_helpers.py", line 36, in monitor_jobs
raise AssertionError(err)
AssertionError: Error! Some para/nextflow processes died!

The log file section is as follows:

Checking whether all CESAR results are complete
1 CESAR jobs crashed, trying to run again...
!!RERUN CESAR JOBS: Pushing 1 jobs into None GB queue
Selected parallelization strategy: nextflow
Parallel manager: pushing job nextflow /home/TOGA/execute_joblist.nf --joblist /opt/synData2/Hg/_cesar_rerun_batch_None -c /opt/synData2/Hg/temp/cesar_config_16_queue.nf
Monitoring CESAR jobs rerun
## Stated polling cluster jobs until they done
Polling iteration 0; already waiting 0 seconds.
Polling iteration 1; already waiting 60 seconds.
Polling iteration 2; already waiting 120 seconds.
Polling iteration 3; already waiting 180 seconds.
.......
Polling iteration 48; already waiting 2880 seconds.
Polling iteration 49; already waiting 2940 seconds.
### CESAR jobs done ###

It's worth noting that this error occurs frequently. Sometimes, running it a second time with the same instructions might work, but each run often requires a significant time investment. Do you have any suggestions for addressing this issue?

Best regards!

@kirilenkobm
Copy link
Member

Hi!
I am sorry for that, feels like I implemented quite aggressive strategy here.
After TOGA tries to execute its CESAR jobs, it collects those that crashed (which may happen due to a variety of reasons) and pushes them again.
I any job dies, the TOGA process dies as well.
Will be disabled in the next commit (in a couple of minutes)

kirilenkobm added a commit that referenced this issue May 26, 2024
@molinfzlvvv
Copy link
Author

Hi!
Thank for your response. Could you help to look at this problem again? This bothered me for a long time.
#140 (comment)
My task has been running for ten days, and it has been in the process called "### STEP 7: Execute CESAR jobs: parallel step".Paradoxically, it seems to be working just fine, because the log file keeps growing.

In fact, I applied for a node with 40 cpus, and then I changed the nextflow setting to process.cpus = 40 // SLURM config file for CESAR jobs, but it actually looks like it only utilizes 2 cpus. I don't know why it's not using all the resources, is that why it's so slow?

If you can suggest any commands to speed up the process, I would really appreciate it.

@kirilenkobm kirilenkobm added the nextflow Issues related to nextflow label May 27, 2024
@molinfzlvvv
Copy link
Author

molinfzlvvv commented Jun 4, 2024

Hi! @kirilenkobm

I am very sorry to bother you many times, so far I have not successfully run an instance. I actually tried a lot, and I couldn't commit it to the slurm system, it kept reporting errors. So now I'm running TOGA on a master node with 40 cores, divided into two buckets based on memory(--cn 10,100). I expect to be able to use all the CPUs at CESAR, but I'm only using two CPUs. It's working normally just too slow, and it seems like it can only run one and then move on to the next at CESAR, which has been working for over a week. Do you have any suggestions for this, which I would appreciate very much.

In addition, I noticed that when I ran CESAR in the 10 and 100 buckets, it was not run in command-line order, because the output did not match the order in the cesar_joblist_queue_10.txt file. What's the reason for this, because if I ran it in order, I could also know where I was running, How much longer?
image

By the way, my nextflow is 21.10.6.5660 and I git clone TOGA directly.Looking forward to your reply.

Best regards!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nextflow Issues related to nextflow
Projects
None yet
Development

No branches or pull requests

2 participants