Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAK-10996 - indexing-job: cache interned strings in a local hashmap to avoid calling String.intern too frequently #1616

Merged
merged 3 commits into from
Aug 6, 2024

Conversation

nfsantos
Copy link
Contributor

@nfsantos nfsantos commented Jul 31, 2024

Cache in a HashMap in the heap interned to avoid calling String.intern() for every String that should be interned. Calls to String.intern() are much more expensive than calls to HashMap.get().

In a test reading 10 million paths from a file and calling SortKey.genSortKeyPathElements(path) for each of them, the times are the following:

  • calling String.intern for every String: 25 seconds
  • caching interned Strings in a map (this PR): 19.5 seconds
  • no interning: 19 seconds.

The HashMap cache removed almost all of the overhead of interning the strings.

…n() fo every String that should be interned. Calls to String.intern() are much more expensive than calls to HashMap.get().
Copy link
Member

@thomasmueller thomasmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Same as before), this could cause out-of-memory.

What about, to avoid possible OOME, and I think it would be easier to understand:

  • Instead of a hash map, use an array (fixed array of eg. 16 K entries)
  • Calculate the index from the hash (using bitwise and)
  • Store the entry in the array if it doesn't match
  • Use the entry in the array if it does

That would be even faster I'm pretty sure.

The memory usage would be at most ~ 16 * 1024 * 4 bytes or so (less than 1 MB).

And we could store all the (small) path elements, and no longer have to worry about "startsWith" etc.

@nfsantos
Copy link
Contributor Author

nfsantos commented Aug 2, 2024

(Same as before), this could cause out-of-memory.

What about, to avoid possible OOME, and I think it would be easier to understand:

  • Instead of a hash map, use an array (fixed array of eg. 16 K entries)
  • Calculate the index from the hash (using bitwise and)
  • Store the entry in the array if it doesn't match
  • Use the entry in the array if it does

That would be even faster I'm pretty sure.

The memory usage would be at most ~ 16 * 1024 * 4 bytes or so (less than 1 MB).

And we could store all the (small) path elements, and no longer have to worry about "startsWith" etc.

I believe this strategy is already implemented in the oak-store-document module, here:
https://github.com/apache/jackrabbit-oak/blob/trunk/oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/StringCache.java

I think in this particular case that strategy would not work as well. The frequency of Strings of the path segments is extremely skewed, with a few words appearing in the majority of paths. These are the words of the top level paths (for instance, /content/dam/<companyname>) or the paths created by JCR which contain segments like jcr:content. To get the most of interning, these Strings must all be cached.

The algorithm you suggest does not guarantee that these very common Strings will all be cached. For instance, in an extreme case where all common Strings hash into the same slot in the array, they would all be competing with each other. And another issue is that if we remove the guard that limits the Strings that can be cached, then the large number of infrequent Strings that appear on paths would displace from the table the very common Strings.

I have bounded the HashMap to a max of 1024 entries, to avoid OOME. It could still happen if those 1024 Strings are huge, but if that was the case, Oak would run out of memory much before, because there are many other parts of the code that assume it's possible to store in memory more than 1024 full paths.

In the tests that I ran against two medium-sized Flat File Stores taken from real OAK repositories, when scanning 10 million paths and storing in memory an array with all SortKeys, at the end there were around 50 Strings in the HashMap. And interning these Strings reduced the size of the heap at the end of the scanning by around 800MB. Interestingly, sorting the array of 10 million SortKeys was faster with the interned Strings, it took around 3s instead of 5s. I'm assuming that it is because String.equals() first checks if the objects compared are the same (==) before comparing the value.

@thomasmueller
Copy link
Member

It could still happen if those 1024 Strings are huge

It would make sense to protect against that case, by limiting the string length we admit. (It is very unlikely that paths elements are longer, but then we have seen some weird edge cases... and the additional check is fast.)

But other than that, I'm OK with the current patch.

@nfsantos
Copy link
Contributor Author

nfsantos commented Aug 2, 2024

It could still happen if those 1024 Strings are huge

It would make sense to protect against that case, by limiting the string length we admit. (It is very unlikely that paths elements are longer, but then we have seen some weird edge cases... and the additional check is fast.)

But other than that, I'm OK with the current patch.

I added a check on the size of the strings, just to be extra safe.
It's unlikely that there will ever be a very large path segment considered for interning because the condition is only considering the segments at the top 3 levels. These levels are very visible to application developers, so it's less likely that there would be a unreasonably large segment. But the extra check is indeed cheap, so no harm done.

@@ -60,8 +59,9 @@ public static String[] genSortKeyPathElements(String path) {
// Interning these strings should provide a big reduction in memory usage.
// It is not worth to intern all levels because at lower levels the names are more likely to be less diverse,
// often even unique, so interning them would fill up the interned string hashtable with useless entries.
if (i < 3 || part.length() == 1 || COMMON_PATH_WORDS.contains(part)) {
pathElements[i] = part.intern();
if ((i < 3 || part.length() == 1 || part.startsWith("dam:") || part.startsWith("jcr:") || COMMON_PATH_WORDS.contains(part)) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dam is not something known to oak - so I don't think we should base our logic on that. We should possibly pass a list of strings that is configurable and needs to be checked here in this condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the reference to dam:. For the time being, I will not make the list of strings to be checked configurable, as this would be a bigger change. The main goal of PR is to avoid an expensive call to String.intern() every time we parse line, so we can leave further optimizations for future work, if that is deemed necessary.

Copy link
Contributor

@nit0906 nit0906 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change requested.

@nfsantos nfsantos requested a review from nit0906 August 5, 2024 11:52
@nfsantos nfsantos merged commit 7f29f92 into apache:trunk Aug 6, 2024
1 of 2 checks passed
@nfsantos nfsantos deleted the OAK-10996 branch August 6, 2024 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants