-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OAK-10996 - indexing-job: cache interned strings in a local hashmap to avoid calling String.intern too frequently #1616
Conversation
…n() fo every String that should be interned. Calls to String.intern() are much more expensive than calls to HashMap.get().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Same as before), this could cause out-of-memory.
What about, to avoid possible OOME, and I think it would be easier to understand:
- Instead of a hash map, use an array (fixed array of eg. 16 K entries)
- Calculate the index from the hash (using bitwise and)
- Store the entry in the array if it doesn't match
- Use the entry in the array if it does
That would be even faster I'm pretty sure.
The memory usage would be at most ~ 16 * 1024 * 4 bytes or so (less than 1 MB).
And we could store all the (small) path elements, and no longer have to worry about "startsWith" etc.
I believe this strategy is already implemented in the I think in this particular case that strategy would not work as well. The frequency of Strings of the path segments is extremely skewed, with a few words appearing in the majority of paths. These are the words of the top level paths (for instance, The algorithm you suggest does not guarantee that these very common Strings will all be cached. For instance, in an extreme case where all common Strings hash into the same slot in the array, they would all be competing with each other. And another issue is that if we remove the guard that limits the Strings that can be cached, then the large number of infrequent Strings that appear on paths would displace from the table the very common Strings. I have bounded the HashMap to a max of 1024 entries, to avoid OOME. It could still happen if those 1024 Strings are huge, but if that was the case, Oak would run out of memory much before, because there are many other parts of the code that assume it's possible to store in memory more than 1024 full paths. In the tests that I ran against two medium-sized Flat File Stores taken from real OAK repositories, when scanning 10 million paths and storing in memory an array with all SortKeys, at the end there were around 50 Strings in the HashMap. And interning these Strings reduced the size of the heap at the end of the scanning by around 800MB. Interestingly, sorting the array of 10 million SortKeys was faster with the interned Strings, it took around 3s instead of 5s. I'm assuming that it is because |
It would make sense to protect against that case, by limiting the string length we admit. (It is very unlikely that paths elements are longer, but then we have seen some weird edge cases... and the additional check is fast.) But other than that, I'm OK with the current patch. |
I added a check on the size of the strings, just to be extra safe. |
@@ -60,8 +59,9 @@ public static String[] genSortKeyPathElements(String path) { | |||
// Interning these strings should provide a big reduction in memory usage. | |||
// It is not worth to intern all levels because at lower levels the names are more likely to be less diverse, | |||
// often even unique, so interning them would fill up the interned string hashtable with useless entries. | |||
if (i < 3 || part.length() == 1 || COMMON_PATH_WORDS.contains(part)) { | |||
pathElements[i] = part.intern(); | |||
if ((i < 3 || part.length() == 1 || part.startsWith("dam:") || part.startsWith("jcr:") || COMMON_PATH_WORDS.contains(part)) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dam is not something known to oak - so I don't think we should base our logic on that. We should possibly pass a list of strings that is configurable and needs to be checked here in this condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the reference to dam:
. For the time being, I will not make the list of strings to be checked configurable, as this would be a bigger change. The main goal of PR is to avoid an expensive call to String.intern()
every time we parse line, so we can leave further optimizations for future work, if that is deemed necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change requested.
Cache in a HashMap in the heap interned to avoid calling
String.intern()
for every String that should be interned. Calls toString.intern()
are much more expensive than calls toHashMap.get()
.In a test reading 10 million paths from a file and calling
SortKey.genSortKeyPathElements(path)
for each of them, the times are the following:The HashMap cache removed almost all of the overhead of interning the strings.