Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stochastic bug causes themisto build to hang indefinitely (themisto_linux-v3.2.1) #40

Open
rohanmaddamsetti opened this issue Jun 17, 2024 · 3 comments

Comments

@rohanmaddamsetti
Copy link

Hello,

This bug only happens in ~1-10 out of ~5000 themisto build runs. I am running themisto on ~4500 genomes, calling themisto build on each of these genomes separately, using HPC to schedule them in parallel. Sometimes I see ~5 runs hang indefinitely, most recently I saw 1 run hang indefinitely. If I re-run a failed run, themisto finishes normally. So this bug does not seem to be caused by the specific data, and it occurs in ~ 0.02% of themisto build runs.

Pure speculation on my part, but perhaps caused by some kind of rare race condition?

Here is how I am calling themisto on HPC:

sbatch -p scavenger --mem=2G --cpus-per-task=4 --wrap="themisto build -k 31 -i ../results/themisto_replicon_references/GCF_017165095.1_ASM1716509v1_genomic/GCF_017165095.1_ASM1716509v1_genomic.txt --index-prefix ../results/themisto_replicon_indices/GCF_017165095.1_ASM1716509v1_genomic --temp-dir ../results/themisto_replicon_indices/temp --mem-gigas 2 --n-threads 4 --file-colors"

9 hours later, this is what the log file looked like:
failed-slurm-11363085.out.txt

When I rerun this command, the run finishes in 8 seconds, here is the log file:
rerun-slurm-11372781.out.txt

@iosfwd
Copy link
Collaborator

iosfwd commented Jun 18, 2024

Have you tried to allocate more memory to your batch job? 2G might not be cutting it, since there could be some memory overhead on top of what is passed as an option to Themisto, which in your case is "--mem-gigas 2".

@rohanmaddamsetti
Copy link
Author

Yes, I have tried this, allocating 6GB per job. I see another stochastic failure (1 out of ~5000 runs), where themisto hangs indefinitely. Here is the output (hung for 6 hours):

22.6940 Mon Jul 1 13:15:20 2024 Themisto-v3.2.1
22.8120 Mon Jul 1 13:15:20 2024 Maximum k-mer length (size of the de Bruijn graph node labels): 32
1031.4840 Mon Jul 1 13:15:21 2024 Build configuration:
Sequence file = ../results/themisto_replicon_references/GCF_013122055.1_ASM1312205v1_genomic/GCF_013122055.1_ASM1312205v1_genomic.txt
Index de Bruijn graph output file = ../results/themisto_replicon_indices/GCF_013122055.1_ASM1312205v1_genomic/GCF_013122055.1_ASM1312205v1_genomic.tdbg
Index coloring output file = ../results/themisto_replicon_indices/GCF_013122055.1_ASM1312205v1_genomic/GCF_013122055.1_ASM1312205v1_genomic.tcolors
Temporary directory = ../results/themisto_replicon_indices/temp
k = 31
Reverse complements = true
Number of threads = 4
Memory gigabytes = 6
Manual colors = false
Sequence colors = false
File colors = true
Load DBG = false
Handling of non-ACGT characters = delete
Coloring structure type: sdsl-hybrid
Verbosity = normal
1036.5070 Mon Jul 1 13:15:21 2024 Starting
1038.7190 Mon Jul 1 13:15:21 2024 Running GGCAT
Allocator initialized: mem: 6 GiB chunks: 24576 log2: 18
Started phase: reads bucketing prev stats:
Temp buckets files size: 2.01 MiB
Finished phase: reads bucketing. phase duration: 1.10s gtime: 1.39s
Started phase: kmers merge prev stats:
Total color subsets: 3
Finished phase: kmers merge. phase duration: 1.11s gtime: 2.50s
Started phase: hashes sorting prev stats:
Finished phase: hashes sorting. phase duration: 16.56s gtime: 19.06s
Started phase: links compaction prev stats:
Completed compaction with 30 iters
Finished phase: links compaction. phase duration: 612.27ms gtime: 19.67s
Started phase: reads reorganization prev stats:
Finished phase: reads reorganization. phase duration: 64.54ms gtime: 19.73s
Started phase: unitigs building prev stats:
Finished phase: unitigs building. phase duration: 215.84ms gtime: 19.95s
Started phase: maximal unitigs links building [step 1] prev stats:
Finished phase: maximal unitigs links building [step 1]. phase duration: 7.69ms gtime: 19.96s
Started phase: maximal unitigs links building [step 2] prev stats:
Finished phase: maximal unitigs links building [step 2]. phase duration: 8.58ms gtime: 19.97s
Started phase: maximal unitigs links building [step 3] prev stats:
Finished phase: maximal unitigs links building [step 3]. phase duration: 209.29ms gtime: 20.18s
Compacted De Bruijn graph construction completed.
TOTAL TIME: 20.18s
Final stats:
phase: reads bucketing => 1.10s
phase: kmers merge => 1.11s
phase: hashes sorting => 16.56s
phase: links compaction => 612.27ms
phase: reads reorganization => 64.54ms
phase: unitigs building => 215.84ms
phase: maximal unitigs links building [step 1] => 7.69ms
phase: maximal unitigs links building [step 2] => 8.58ms
phase: maximal unitigs links building [step 3] => 209.29ms
25856.8550 Mon Jul 1 13:15:46 2024 Building SBWT
25856.9900 Mon Jul 1 13:15:46 2024 Running KMC counter


Stage 1: 100%
Stage 2: 0%

(end of log, themisto got stuck here)

I then re-ran the failed job, with exactly the same parameters (6GB of memory), and it finished in 10 seconds or so:

28.1510 Mon Jul 1 20:20:21 2024 Themisto-v3.2.1
28.3550 Mon Jul 1 20:20:21 2024 Maximum k-mer length (size of the de Bruijn graph node labels): 32
83.0810 Mon Jul 1 20:20:21 2024 Build configuration:
Sequence file = ../results/themisto_replicon_references/GCF_013122055.1_ASM1312205v1_genomic/GCF_013122055.1_ASM1312205v1_genomic.txt
Index de Bruijn graph output file = ../results/themisto_replicon_indices/GCF_013122055.1_ASM1312205v1_genomic/GCF_013122055.1_ASM1312205v1_genomic.tdbg
Index coloring output file = ../results/themisto_replicon_indices/GCF_013122055.1_ASM1312205v1_genomic/GCF_013122055.1_ASM1312205v1_genomic.tcolors
Temporary directory = ../results/themisto_replicon_indices/temp
k = 31
Reverse complements = true
Number of threads = 4
Memory gigabytes = 6
Manual colors = false
Sequence colors = false
File colors = true
Load DBG = false
Handling of non-ACGT characters = delete
Coloring structure type: sdsl-hybrid
Verbosity = normal
84.0220 Mon Jul 1 20:20:21 2024 Starting
84.4360 Mon Jul 1 20:20:21 2024 Running GGCAT
Allocator initialized: mem: 6 GiB chunks: 24576 log2: 18
Started phase: reads bucketing prev stats:
Temp buckets files size: 2.01 MiB
Finished phase: reads bucketing. phase duration: 915.31ms gtime: 918.78ms
Started phase: kmers merge prev stats:
Total color subsets: 3
Finished phase: kmers merge. phase duration: 535.29ms gtime: 1.45s
Started phase: hashes sorting prev stats:
Finished phase: hashes sorting. phase duration: 641.58ms gtime: 2.10s
Started phase: links compaction prev stats:
Completed compaction with 30 iters
Finished phase: links compaction. phase duration: 615.56ms gtime: 2.71s
Started phase: reads reorganization prev stats:
Finished phase: reads reorganization. phase duration: 118.70ms gtime: 2.83s
Started phase: unitigs building prev stats:
Finished phase: unitigs building. phase duration: 36.97ms gtime: 2.87s
Started phase: maximal unitigs links building [step 1] prev stats:
Finished phase: maximal unitigs links building [step 1]. phase duration: 14.54ms gtime: 2.88s
Started phase: maximal unitigs links building [step 2] prev stats:
Finished phase: maximal unitigs links building [step 2]. phase duration: 16.37ms gtime: 2.90s
Started phase: maximal unitigs links building [step 3] prev stats:
Finished phase: maximal unitigs links building [step 3]. phase duration: 136.07ms gtime: 3.03s
Compacted De Bruijn graph construction completed.
TOTAL TIME: 3.03s
Final stats:
phase: reads bucketing => 915.31ms
phase: kmers merge => 535.29ms
phase: hashes sorting => 641.58ms
phase: links compaction => 615.56ms
phase: reads reorganization => 118.70ms
phase: unitigs building => 36.97ms
phase: maximal unitigs links building [step 1] => 14.54ms
phase: maximal unitigs links building [step 2] => 16.37ms
phase: maximal unitigs links building [step 3] => 136.07ms
3338.1230 Mon Jul 1 20:20:25 2024 Building SBWT
3338.2360 Mon Jul 1 20:20:25 2024 Running KMC counter


Stage 1: 100%
Stage 2: 100%

6634.8530 Mon Jul 1 20:20:28 2024 Sorting KMC database
in1: 100%
7524.0370 Mon Jul 1 20:20:29 2024 Writing nodes and dummies to disk
7608.0230 Mon Jul 1 20:20:29 2024 Uncompressing KMC database to disk
9071.2700 Mon Jul 1 20:20:30 2024 Streaming
9843.0130 Mon Jul 1 20:20:31 2024 Sorting dummies on disk
9938.6730 Mon Jul 1 20:20:31 2024 Merging sorted streams
10418.3910 Mon Jul 1 20:20:32 2024 Building SBWT structure
10443.9400 Mon Jul 1 20:20:32 2024 Building de Bruijn Graph finished (3283326 k-mers)
10444.0200 Mon Jul 1 20:20:32 2024 Building color structure
10444.0380 Mon Jul 1 20:20:32 2024 Building the color mapping
Started phase: unitigs reorganization prev stats:
Finished phase: unitigs reorganization. phase duration: 122.26ms gtime: 123.27ms
Started phase: colormap reading prev stats:
11124.7280 Mon Jul 1 20:20:33 2024 Serializing color structure
11129.8650 Mon Jul 1 20:20:33 2024 Done

@jnalanko
Copy link
Collaborator

jnalanko commented Jul 2, 2024

The part where it is hanging is very disk-heavy. My guess is that there is some hangup in the disk IO of the cluster, and Themisto is unable to recover from that. Unfortunately that part of the code is within the KMC library, so it's hard for me to fix.

In the long term I'm planning to get rid of the KMC dependency altogether, so this should be fixed then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants