Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker index command stop #1357

Open
fbelleau opened this issue Jun 1, 2024 · 4 comments
Open

docker index command stop #1357

fbelleau opened this issue Jun 1, 2024 · 4 comments

Comments

@fbelleau
Copy link

fbelleau commented Jun 1, 2024

I am able to index 2 nt files individually, but when I concat them, indexing stop without any message.

Command: index

echo '{ "ascii-prefixes-only": false, "num-triples-per-batch": 1000 }' > olympics.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.olympics docker.io/adfreiburg/qlever:latest -c 'cat flymine-*.nt | IndexBuilderMain -F ttl -f - -i olympics -s olympics.settings.json --stxxl-memory 5G | tee olympics.index-log.txt'

2024-06-01 06:26:41.550 - INFO: QLever IndexBuilder, compiled on Tue Apr  2 19:02:03 UTC 2024 using git hash 25449d
2024-06-01 06:26:41.552 - INFO: You specified the input format: TTL
2024-06-01 06:26:41.552 - INFO: Processing input triples from /dev/stdin ...
2024-06-01 06:26:41.553 - INFO: Locale was not specified in settings file, default is en_US
2024-06-01 06:26:41.553 - INFO: You specified "locale = en_US" and "ignore-punctuation = 0"
2024-06-01 06:26:41.554 - INFO: You specified "parallel-parsing = true", which enables faster parsing for TTL files that don't include multiline literals with unescaped newline characters and that have newline characters after the end of triples.
2024-06-01 06:26:41.554 - INFO: You specified "num-triples-per-batch = 1,000", choose a lower value if the index builder runs out of memory
2024-06-01 06:26:41.554 - INFO: Integers that cannot be represented by QLever will throw an exception (this is the default behavior)
2024-06-01 06:28:17.973 - INFO: Done, total number of triples read: 11,359,551 [may contain duplicates]
2024-06-01 06:28:17.974 - INFO: Number of QLever-internal triples created: 11,359,551 [may contain duplicates]
2024-06-01 06:28:17.974 - INFO: Merging partial vocabularies ...

@hannahbast
Copy link
Member

@fbelleau That is strange, can you send a link to these two .nt files? (Here or by mail if you don't want the link to appear on a public website)

@fbelleau
Copy link
Author

fbelleau commented Jun 2, 2024

Freed memory on the server, allowing the job to complete successfully. The job previously crashed due to insufficient memory allocation.

How much memory to index 50 Go of ntriple files do you think is needed ?

@fbelleau fbelleau closed this as completed Jun 6, 2024
@fbelleau
Copy link
Author

fbelleau commented Jun 6, 2024

Adding memory solved the problem.

@fbelleau fbelleau reopened this Jun 6, 2024
@fbelleau
Copy link
Author

fbelleau commented Jun 6, 2024

@hannahbast

now I have the same problem with a larger file and memory do not seems to be a problem.

echo '{ "ascii-prefixes-only": false, "num-triples-per-batch": 100000 }' > flymine-object.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.flymine-object docker.io/adfreiburg/qlever:latest -c 'ulimit -Sn 1048576; cat ./data/xa* | IndexBuilderMain -F ttl -f - -i flymine-object -s flymine-object.settings.json --stxxl-memory 5G | tee flymine-object.index-log.txt'

2024-06-06 15:52:46.688 - INFO: QLever IndexBuilder, compiled on Tue Apr  2 19:02:03 UTC 2024 using git hash 25449d
2024-06-06 15:52:46.688 - INFO: You specified the input format: TTL
2024-06-06 15:52:46.688 - INFO: Processing input triples from /dev/stdin ...
2024-06-06 15:52:46.690 - INFO: Locale was not specified in settings file, default is en_US
2024-06-06 15:52:46.690 - INFO: You specified "locale = en_US" and "ignore-punctuation = 0"
2024-06-06 15:52:46.691 - INFO: You specified "parallel-parsing = true", which enables faster parsing for TTL files that don't include multiline literals with unescaped newline characters and that have newline characters after the end of triples.
2024-06-06 15:52:46.691 - INFO: You specified "num-triples-per-batch = 100,000", choose a lower value if the index builder runs out of memory
2024-06-06 15:52:46.691 - INFO: Integers that cannot be represented by QLever will throw an exception (this is the default behavior)
2024-06-06 15:58:09.398 - INFO: Input triples processed: 100,000,000
2024-06-06 16:03:30.491 - INFO: Input triples processed: 200,000,000
2024-06-06 16:09:28.228 - INFO: Done, total number of triples read: 291,005,465 [may contain duplicates]
2024-06-06 16:09:28.230 - INFO: Number of QLever-internal triples created: 291,005,465 [may contain duplicates]
2024-06-06 16:09:28.230 - INFO: Merging partial vocabularies ...
2024-06-06 16:14:05.367 - INFO: Finished writing compressed external vocabulary, size = 0 B [uncompressed = 0 B, ratio = 100%]
2024-06-06 16:14:06.669 - INFO: Finished writing compressed internal vocabulary, size = 777.9 MB [uncompressed = 3.6 GB, ratio = 21%]
2024-06-06 16:14:06.733 - INFO: Number of words in external vocabulary: 80,929,190
2024-06-06 16:14:06.734 - INFO: Removing temporary files ...
2024-06-06 16:14:07.231 - INFO: Converting triples from local IDs to global IDs ...
2024-06-06 16:14:25.092 - INFO: Triples converted: 100,000,000
2024-06-06 16:14:40.650 - INFO: Triples converted: 200,000,000
2024-06-06 16:14:54.726 - INFO: Done, total number of triples converted: 291,005,465
2024-06-06 16:14:54.774 - INFO: Creating a pair of index permutations ...
2024-06-06 16:15:30.379 - INFO: Triples processed: 100,000,000
2024-06-06 16:15:59.980 - INFO: Triples processed: 200,000,000
2024-06-06 16:16:25.258 - INFO: Number of unique elements: 291,005,465
2024-06-06 16:16:27.963 - INFO: Statistics for SPO: #relations = 34,758,461, #blocks = 6,209, #triples = 291,005,465
2024-06-06 16:16:27.967 - INFO: Statistics for SOP: #relations = 34,758,461, #blocks = 6,209, #triples = 291,005,465
2024-06-06 16:16:27.968 - INFO: Writing meta data for SPO and SOP ...
2024-06-06 16:16:27.982 - INFO: Number of distinct patterns: 170
2024-06-06 16:16:27.982 - INFO: Number of subjects with pattern: 34,758,461 [all]
2024-06-06 16:16:27.982 - INFO: Total number of distinct subject-predicate pairs: 291,005,465
2024-06-06 16:16:27.982 - INFO: Average number of predicates per subject: 8.4
2024-06-06 16:16:27.984 - INFO: Average number of subjects per predicate: 2,852,995
2024-06-06 16:16:28.076 - INFO: Creating a pair of index permutations ...
2024-06-06 16:17:24.113 - INFO: Triples processed: 100,000,000

I am working on a 8G RAM 4 cores server.

The file I am indexing is here :

https://huggingface.co/datasets/bio2rdf/flymine_nt/tree/main

and there is a copy of my Qleverfile. I use the qlever python command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants