Improve citation cache memory efficiency #211

JCRPaquin · 2024-01-16T07:14:04Z

Background

The citation "cache" functions as an in-memory graph containing all documents as nodes and citations/references as edges. It's used to speed up second order operators that rely on graph information.

Recently, we've been running into out of memory errors: the total citation graph for ~22 million documents was roughly 5GB in size. The graph is also duplicated in memory when the cache is being warmed by another Solr searcher, which can cause the graph to take >10GB of the 24GB allocated to Solr.

A decent amount of this space goes to "boxing" primitive values: when you box a primitive value, it's wrapped inside another object (the "box"). Each wrapper takes 2-4 bytes of memory, which can be substantial over 20 million values.

Purpose

This PR reduces the size of the citation graph by using specialized primitive type collections that don't require boxing.

It also reduces the size of the graph by using maps instead of arrays to store relationships; previously we kept an array of size N, half of whose entries were null.

JCRPaquin added 10 commits January 15, 2024 20:36

Add new dependencies

e984b65

Remove dead code and code comments

fed04ec

Resolve flaky test

01626a1

Remove dead code

4101a09

Replace custom int arraylist with library version

69e822b

Test inner relationship map

01cc384

Sort data before comparison

10d6fae

Sort data before comparison

342ac51

Use a map instead of an array to store citations

1f10655

Remove some more boxing where easy

ad57af3

JCRPaquin marked this pull request as draft August 12, 2024 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve citation cache memory efficiency #211

Improve citation cache memory efficiency #211

JCRPaquin commented Jan 16, 2024

Improve citation cache memory efficiency #211

Are you sure you want to change the base?

Improve citation cache memory efficiency #211

Conversation

JCRPaquin commented Jan 16, 2024

Background

Purpose