Write vocabulary files to separate directory #1237

hannahbast · 2024-01-21T18:19:02Z

The vocabulary files can now be written to (and read from) a separate directory than the other index files. This directory can be specified in both IndexBuilderMain and ServerMain via the new command-line option --vocabulary-basename or -v. The directory for the index files is specified like before via --index-basename or -i. If no directory for the vocabulary files is specified, the same directory as for the index files is taken (that was the status quo before this PR).

This is useful for datasets with a huge vocabulary. For example, the vocabulary files for UniProt have a total size of over 3 TB (mostly due to .vocabulary.external and .vocabulary.external.idsAndOffsets).

codecov · 2024-01-21T18:43:13Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (ea905cc) 87.46% compared to head (81f2e53) 87.48%.

Files	Patch %	Lines
src/index/IndexImpl.Text.cpp	91.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1237      +/-   ##
==========================================
+ Coverage   87.46%   87.48%   +0.02%     
==========================================
  Files         307      307              
  Lines       27771    27785      +14     
  Branches     3116     3116              
==========================================
+ Hits        24291    24309      +18     
+ Misses       2346     2340       -6     
- Partials     1134     1136       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

hannahbast · 2024-01-22T10:57:38Z

@joka921 I have tried this with olympics and dblp and it worked fine. I am currently building an index with uniprot, using SSD for the index files and HDD for the vocabulary files. That works fine so far as well.

I think this is rather straightforward, in particular, since the default behavior (when the new --vocabulary-basename option is not used) is exactly like before.

joka921

Some suggestions to this very useful feature.

joka921 · 2024-01-22T11:37:18Z

src/ServerMain.cpp

 Server server(port, numSimultaneousQueries, memoryMaxSize,
 std::move(accessToken), !noPatternTrick);
- server.run(indexBasename, text, !noPatterns, !onlyPsoAndPosPermutations);
+ server.run(baseNameIndex, baseNameVocabulary, text, !noPatterns,
+ !onlyPsoAndPosPermutations);


I think one thing you could do (the outer interface that interacts with the control script is mostly you area)
is to set up a simple struct ServerConfig that is then passed to the constructor as well as the run() function where they grap their respectively needed arguments. That makes it much easier to add additional arguments.
(Probably we need a similar struct IndexConfig that then becomes part of the server config for exactly the same reason).
Then it will become much easier to add additional arguments.
Are you interested in setting this up as a separate PR, or should I do this?

joka921 · 2024-01-22T11:39:09Z

src/index/IndexBuilderMain.cpp

+ "The basename of the index files (required).");
+ add("vocabulary-basename,v", po::value(&baseNameVocabulary),
+ "The basename of the vocabulary files"
+ "(default: same as basename of the index fles).");


Suggested change

"(default: same as basename of the index fles).");

"(default: same as basename of the index files).");

joka921 · 2024-01-22T11:40:41Z

src/index/IndexBuilderMain.cpp

+ if (baseNameVocabulary.empty()) {
+ baseNameVocabulary = baseNameIndex;
+ }


If the baseNameVocabulary is an std::optional<string> then you can pass it as is all the way down, and only in a single place in the index (in setOnDiskBase) you have to resolve the default (or not even there but only in the functions that interact with the vocabulary).

joka921 · 2024-01-22T11:42:54Z

src/index/IndexImpl.Text.cpp

@@ -107,14 +107,14 @@ void IndexImpl::addTextFromContextFile(const string& contextFile,
 LOG(DEBUG) << "Reloading the RDF vocabulary ..." << std::endl;
 vocab_ = RdfsVocabulary{};
 readConfiguration();
- vocab_.readFromFile(onDiskBase_ + INTERNAL_VOCAB_SUFFIX,
- onDiskBase_ + EXTERNAL_VOCAB_SUFFIX);
+ vocab_.readFromFile(onDiskBaseVocabulary_ + INTERNAL_VOCAB_SUFFIX,


If you make the onDiskBaseVocabulary_ an optional (see my suggestion above), and then instead introduce the function onDiskBaseVocabulary() that returns something like onDiskBaseVocabulary_.value_or(onDiskBaseIndex_), then everything is clean nice and typesafe:)

joka921 · 2024-01-22T11:45:54Z

test/util/IndexTestHelpers.cpp

@@ -124,7 +123,7 @@ Index makeTestIndex(const std::string& indexBasename,
 // these tests.
 static std::ostringstream ignoreLogStream;
 ad_utility::setGlobalLoggingStream(&ignoreLogStream);
- std::string inputFilename = indexBasename + ".ttl";
+ std::string inputFilename = baseName + ".ttl";


If you here change the baseName vocabulary to something that is different but derived from the baseName, then you immediately have a test for your new feature.
(make sure to also pass then then vocabularyBasename to the getAllIndexFilenames function above).

The vocabulary files can now be written to (and read from) a separate directory than the other index files. This directory can be specified in both `IndexBuilderMain` and `ServerMain` via the new command-line option `--vocabulary-basename` or `-v`. The directory for the index files is specified like before via `--index-basename` or `-i`. If no directory for the vocabulary files is specified, the same directory as for the index files is taken (that was the status quo before this PR). This is useful for datasets with a huge vocabulary. For example, the vocabulary files for UniProt have a total size of over 3 TB (mostly due to `.vocabulary.external` and `.vocabulary.external.idsAndOffsets`).

Reason: The merge was very SLOW when these were in the vocabulary directory, which for our UniProt index builds is on HDD (because the external vocabulary is so larger). I first tried to only have the `.tmp.partial-vocabulary.words` files in the index directory, but that was still slow. Now also the `.tmp.partial-vocabulary.ids` files are in the index directory. Explanations concerning SLOW: The merging of the first few 100M triples is fast (30 seconds per 100M triples). Then it becomes slow and then very slow (half an hour from 700M triples to 800M triples). Not only is it slow, but doing other stuff on the machine (like wrting something in an editor with autosave on) becomes very slow to respond to, which is a clear sign that the random accesses to HDD are the problem. NOTE: With the partial solution, where `.tmp.partial-vocabulary.words` are on SSD and `.tmp.partial-vocabulary.ids` are on HDD, it is not as bad. There was a very significant slow-down from 700M to 1100M triples, but after that merging was as fast again (though not as fast as in the beginning). At the time of this writing, I only observed until 1700M, stay tuned for more information.

sonarcloud · 2024-02-02T05:30:52Z

Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

3 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

hannahbast requested a review from joka921 January 22, 2024 10:52

joka921 requested changes Jan 22, 2024

View reviewed changes

Hannah Bast added 3 commits February 2, 2024 05:23

Remove log message used for debugging.

dffc75a

hannahbast force-pushed the write-vocabulary-files-to-separate-directory branch from 6799a37 to 81f2e53 Compare February 2, 2024 04:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write vocabulary files to separate directory #1237

Write vocabulary files to separate directory #1237

hannahbast commented Jan 21, 2024

codecov bot commented Jan 21, 2024 •

edited

Loading

hannahbast commented Jan 22, 2024

joka921 left a comment

joka921 Jan 22, 2024

joka921 Jan 22, 2024

joka921 Jan 22, 2024

joka921 Jan 22, 2024

joka921 Jan 22, 2024

sonarcloud bot commented Feb 2, 2024

	"(default: same as basename of the index fles).");
	"(default: same as basename of the index files).");

Write vocabulary files to separate directory #1237

Are you sure you want to change the base?

Write vocabulary files to separate directory #1237

Conversation

hannahbast commented Jan 21, 2024

codecov bot commented Jan 21, 2024 • edited Loading

Codecov Report

hannahbast commented Jan 22, 2024

joka921 left a comment

Choose a reason for hiding this comment

joka921 Jan 22, 2024

Choose a reason for hiding this comment

joka921 Jan 22, 2024

Choose a reason for hiding this comment

joka921 Jan 22, 2024

Choose a reason for hiding this comment

joka921 Jan 22, 2024

Choose a reason for hiding this comment

joka921 Jan 22, 2024

Choose a reason for hiding this comment

sonarcloud bot commented Feb 2, 2024

Quality Gate passed

codecov bot commented Jan 21, 2024 •

edited

Loading