OAK-10996 - indexing-job: cache interned strings in a local hashmap to avoid calling String.intern too frequently #1616

nfsantos · 2024-07-31T14:18:31Z

Cache in a HashMap in the heap interned to avoid calling String.intern() for every String that should be interned. Calls to String.intern() are much more expensive than calls to HashMap.get().

In a test reading 10 million paths from a file and calling SortKey.genSortKeyPathElements(path) for each of them, the times are the following:

calling String.intern for every String: 25 seconds
caching interned Strings in a map (this PR): 19.5 seconds
no interning: 19 seconds.

The HashMap cache removed almost all of the overhead of interning the strings.

…n() fo every String that should be interned. Calls to String.intern() are much more expensive than calls to HashMap.get().

thomasmueller

(Same as before), this could cause out-of-memory.

What about, to avoid possible OOME, and I think it would be easier to understand:

Instead of a hash map, use an array (fixed array of eg. 16 K entries)
Calculate the index from the hash (using bitwise and)
Store the entry in the array if it doesn't match
Use the entry in the array if it does

That would be even faster I'm pretty sure.

The memory usage would be at most ~ 16 * 1024 * 4 bytes or so (less than 1 MB).

And we could store all the (small) path elements, and no longer have to worry about "startsWith" etc.

nfsantos · 2024-08-02T06:54:15Z

(Same as before), this could cause out-of-memory.

What about, to avoid possible OOME, and I think it would be easier to understand:

Instead of a hash map, use an array (fixed array of eg. 16 K entries)

Calculate the index from the hash (using bitwise and)

Store the entry in the array if it doesn't match

Use the entry in the array if it does

That would be even faster I'm pretty sure.

The memory usage would be at most ~ 16 * 1024 * 4 bytes or so (less than 1 MB).

And we could store all the (small) path elements, and no longer have to worry about "startsWith" etc.

I believe this strategy is already implemented in the oak-store-document module, here:
https://github.com/apache/jackrabbit-oak/blob/trunk/oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/StringCache.java

I think in this particular case that strategy would not work as well. The frequency of Strings of the path segments is extremely skewed, with a few words appearing in the majority of paths. These are the words of the top level paths (for instance, /content/dam/<companyname>) or the paths created by JCR which contain segments like jcr:content. To get the most of interning, these Strings must all be cached.

The algorithm you suggest does not guarantee that these very common Strings will all be cached. For instance, in an extreme case where all common Strings hash into the same slot in the array, they would all be competing with each other. And another issue is that if we remove the guard that limits the Strings that can be cached, then the large number of infrequent Strings that appear on paths would displace from the table the very common Strings.

I have bounded the HashMap to a max of 1024 entries, to avoid OOME. It could still happen if those 1024 Strings are huge, but if that was the case, Oak would run out of memory much before, because there are many other parts of the code that assume it's possible to store in memory more than 1024 full paths.

In the tests that I ran against two medium-sized Flat File Stores taken from real OAK repositories, when scanning 10 million paths and storing in memory an array with all SortKeys, at the end there were around 50 Strings in the HashMap. And interning these Strings reduced the size of the heap at the end of the scanning by around 800MB. Interestingly, sorting the array of 10 million SortKeys was faster with the interned Strings, it took around 3s instead of 5s. I'm assuming that it is because String.equals() first checks if the objects compared are the same (==) before comparing the value.

thomasmueller · 2024-08-02T07:32:51Z

It could still happen if those 1024 Strings are huge

It would make sense to protect against that case, by limiting the string length we admit. (It is very unlikely that paths elements are longer, but then we have seen some weird edge cases... and the additional check is fast.)

But other than that, I'm OK with the current patch.

nfsantos · 2024-08-02T08:13:52Z

It could still happen if those 1024 Strings are huge

It would make sense to protect against that case, by limiting the string length we admit. (It is very unlikely that paths elements are longer, but then we have seen some weird edge cases... and the additional check is fast.)

But other than that, I'm OK with the current patch.

I added a check on the size of the strings, just to be extra safe.
It's unlikely that there will ever be a very large path segment considered for interning because the condition is only considering the segments at the top 3 levels. These levels are very visible to application developers, so it's less likely that there would be a unreasonably large segment. But the extra check is indeed cheap, so no harm done.

nit0906 · 2024-08-05T07:47:01Z

...c/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/SortKey.java

@@ -60,8 +59,9 @@ public static String[] genSortKeyPathElements(String path) {
 // Interning these strings should provide a big reduction in memory usage.
 // It is not worth to intern all levels because at lower levels the names are more likely to be less diverse,
 // often even unique, so interning them would fill up the interned string hashtable with useless entries.
- if (i < 3 || part.length() == 1 || COMMON_PATH_WORDS.contains(part)) {
- pathElements[i] = part.intern();
+ if ((i < 3 || part.length() == 1 || part.startsWith("dam:") || part.startsWith("jcr:") || COMMON_PATH_WORDS.contains(part)) &&


dam is not something known to oak - so I don't think we should base our logic on that. We should possibly pass a list of strings that is configurable and needs to be checked here in this condition.

I removed the reference to dam:. For the time being, I will not make the list of strings to be checked configurable, as this would be a bigger change. The main goal of PR is to avoid an expensive call to String.intern() every time we parse line, so we can leave further optimizations for future work, if that is deemed necessary.

nit0906

change requested.

Cache in a HashMap in the heap interned to avoid calling String.inter…

f2ddf3b

…n() fo every String that should be interned. Calls to String.intern() are much more expensive than calls to HashMap.get().

thomasmueller reviewed Jul 31, 2024

View reviewed changes

Add limit on the maximum size of String that can be interned.

6f682ac

nfsantos requested a review from thomasmueller August 2, 2024 11:55

nit0906 reviewed Aug 5, 2024

View reviewed changes

nit0906 requested changes Aug 5, 2024

View reviewed changes

Remove reference to node name prefix that is not part of Oak.

3c3cd8d

nfsantos requested a review from nit0906 August 5, 2024 11:52

thomasmueller approved these changes Aug 5, 2024

View reviewed changes

nit0906 approved these changes Aug 6, 2024

View reviewed changes

nfsantos merged commit 7f29f92 into apache:trunk Aug 6, 2024
1 of 2 checks passed

nfsantos deleted the OAK-10996 branch August 6, 2024 06:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OAK-10996 - indexing-job: cache interned strings in a local hashmap to avoid calling String.intern too frequently #1616

OAK-10996 - indexing-job: cache interned strings in a local hashmap to avoid calling String.intern too frequently #1616

nfsantos commented Jul 31, 2024 •

edited

Loading

thomasmueller left a comment

nfsantos commented Aug 2, 2024

thomasmueller commented Aug 2, 2024

nfsantos commented Aug 2, 2024

nit0906 Aug 5, 2024

nfsantos Aug 5, 2024

nit0906 left a comment

OAK-10996 - indexing-job: cache interned strings in a local hashmap to avoid calling String.intern too frequently #1616

OAK-10996 - indexing-job: cache interned strings in a local hashmap to avoid calling String.intern too frequently #1616

Conversation

nfsantos commented Jul 31, 2024 • edited Loading

thomasmueller left a comment

Choose a reason for hiding this comment

nfsantos commented Aug 2, 2024

thomasmueller commented Aug 2, 2024

nfsantos commented Aug 2, 2024

nit0906 Aug 5, 2024

Choose a reason for hiding this comment

nfsantos Aug 5, 2024

Choose a reason for hiding this comment

nit0906 left a comment

Choose a reason for hiding this comment

nfsantos commented Jul 31, 2024 •

edited

Loading