Improve citation cache memory efficiency #211
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
The citation "cache" functions as an in-memory graph containing all documents as nodes and citations/references as edges. It's used to speed up second order operators that rely on graph information.
Recently, we've been running into out of memory errors: the total citation graph for ~22 million documents was roughly 5GB in size. The graph is also duplicated in memory when the cache is being warmed by another Solr searcher, which can cause the graph to take >10GB of the 24GB allocated to Solr.
A decent amount of this space goes to "boxing" primitive values: when you box a primitive value, it's wrapped inside another object (the "box"). Each wrapper takes 2-4 bytes of memory, which can be substantial over 20 million values.
Purpose
This PR reduces the size of the citation graph by using specialized primitive type collections that don't require boxing.
It also reduces the size of the graph by using maps instead of arrays to store relationships; previously we kept an array of size N, half of whose entries were
null
.