-
-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tolgee runs out of heap space when importing a larger number of translations #2019
Comments
We (@GeertZondervan and I) did some additional troubleshooting and thought it might be helpful to share our findings here. Overall I'm guessing by the use of The first thing I noticed is that the queue implementation will always retrieve all jobs. When implementing a simple queue with PostgreSQL's The main reason why a high volume of jobs crashed the system (by running out of heap space) was because Hibernate was retrieving far too much fields (i.e. not just a job ID and job status) from the database and then tried to stuff everything into a resultset. This is the point where the application goes out of heap space. We thought we could easily fix this by having Hibernate detach only the things it needed. While this did fix the issue with going out of heap space it brought to light another issue. The application now locks-up after boot. I think this is because the queue execution is manually triggered here on application startup: tolgee-platform/backend/data/src/main/kotlin/io/tolgee/batch/BatchJobActionService.kt Lines 56 to 61 in fa9d124
This blocks the thread in Spring boots which fires events (I don't think this thread is suitable for doing such long running work). This also has another side-effect: The job scheduler doesn't know about it. Meaning in 60 seconds the |
Hey! Thanks a lot for the investigation! So if I understand it correctly, the issue is in the populateQueue() method, right? So what about if we didn't return an entity, but some kind of view containing only the fields we need? That might save some memory...
Yep! |
Yes.
Yes, that would probably prevent Tolgee from running out of heap space. The issue is that currently Hibernate apparently joins in the table with the JSONB parameter field for every entry in the Combined with not blocking the Spring Boot event thread by using |
Wouldn't it be better to revise the task population in a way the Tolgee server would both avoid retrieving unnecessary data (easy doable with a The server would for instance pick N jobs that are pending, mark them as "taken" (so other instances in a distributed context would not pick them too), and run these N jobs. Whenever the job queue reaches M jobs with M < N, the server would try to fetch more jobs to execute and start running them (M can be non-zero so the server can start fetching jobs before it runs out and ensure there's never a job exhaustion, so it keeps crunching through jobs). Making sure a the "taken" status does automatically "expire" after a certain time (e.g. could not be executed because the node crashed or something), so it can be picked back up later on again. Unless there's something I'm missing, this should let the jobs be nicely distributed without requiring explicit node communication and/or awareness of their existence (except when the amount of jobs is < N, which should be rare and can be worked around by having servers who queue the job immediately mark them as taken if their job queue is not full, relying on the load distribution to act as a distribution layer). This would also make the memory footprint of the job queue mostly inexistent for a virtually infinitely large queue. |
Sounds like a solid idea. It's what a lot of (Postgre)SQL job queue implementations do and what the This article: How to implement a database job queue using SKIP LOCKED offers an example how to implement such a thing with Java and Hibernate using |
When implementing the feature, i was inspired by the skip locked approach, but I also combined it with fast state cache in redis and pub-sub in redis. So it's not very probable instances would try to work on the same chunks, since when something is removed from local queue, it's propagated to other instances via pub/sub. That said, I think fetching only the required data and removing the manual trigger on app start should be enough. Right? |
Yes, I think so! |
Are there some volunteers, who would like to fix it? 😅 |
We already have a prototype fix so @GeertZondervan will create a PR! |
@JanCizmar We created a prototype which prevents the loading of the entire object and ensures the job is only started from the scheduler (and not manually). While this fixes the initial problems it uncovers more problems. For example, while the initial job retrieval (from the DB to local cache) is now fast (takes milliseconds to complete) when a second retrieval is started it takes about 6 minutes to complete. This is being caused by Hibernate somehow, not by the DB. At this point the entire application freezes (the app works normal before this second retrieval). So it seems something else is also kicking-off this code and then suddenly it's long running work and blocking some main thread (some Kotlin co-routine executor thread perhaps?).
Could you explain what the benefit of the local cache is? It complicates the queue implementation a lot. The way I see it no query is actually prevented because when running multiple Tolgee nodes each node will still need to check if no other node is actually executing the task. Meaning that it needs to run a query at the start of the task? So if that's the case then why not only simply just What makes the issue(s) hard to troubleshoot for us is that there is almost no documentation (not on it's conceptual design, nor on the implementation of the design). Without documentation or the on-hand knowledge of the queue implementation this rabbit hole seems a bit to deep for us. |
Hey! Thanks for the feedback. I'll try to provide some docs. To fix this fast, can you provide some the prototype with some examaple data so I can reproduce it fast and possibly fix it? The benefit of the implementation is to save the database resources. With pure skip locked it would either unbalanced (so single node would handle a lot of chunks while other nodes would be sleeping) or it would execute too many queries to database. While syncing the queues using Redis pub/sub and using Redis as a storage (on multiple node setup) seems like optimal compromise. in the same time I understand it's hard to read and complicated. |
Hello @JanCizmar The fix I've been working on can be found here We have created a project with multiple languages and the following language settings; Note the Auto Translate Imported Items, which causes the creation of the jobs. Now when we import one of our XLIFF files with about 5000 trans-units. it will create about 5000 rows in the |
OK! Thanks a lot! I'll try to create some sample large xliff and will provide a fix soon! |
Created this PR including your changes. I've generated large XLIFF and not it works correctly for me, can you please check? I'll try to go a bit further and optimize the import & batch operations a bit more. |
Hey! I've started to optimize the import process even more, which apparently led to a lot of issues. However, at this commit it's stable. So if you would like to test it with your data, use this one: |
So I enjoyed pretty nice weekend Hibernate adventure. I went from trying to optimize Hibernate operations by keepeing the context smal, moving to StatelessSession and ignoring the context at all, but what I found the fastest was finally executing the queries via the sql connection
I will continue by providing some reliable loading feedback in the UI and optimizing the UI so it's not so stressful for users to import data. |
Hello!
For some reason the initial scheduled run does not cause any problems, the queue is filled quickly, but the next calls to
Sounds great! Looking forward to it. |
Does it also happen with my xliff? Can you also share your language settings? Can you also share your configuration? EDIT: Sorry, lot of questions, but I really need to know 😅 |
I don't think there is much going on in our settings. This is our
Notice that we don't use Redis. This is how we start Tolgee:
Both your example XLIFF and our larger XLIFF still trigger the issue that the web interface becomes unresponsive and Tolgee takes about 7 minutes to load the the data in to Hibernate. We tested with Tolgee 6005b8c. |
Thanks for the information. Anyway, I cannot reproduce it locally, so I really need some minimal steps to reproduce it locally on a fresh Tolgee run. |
Alright let me get my ducks in a row first, I'll get back to you! |
@JanCizmar Quick question; Did you test with real machine translation and multiple languages (9+) in a project? Or are you testing with the mock machine translation? |
I am testing it with mock. Base and 2 languages. |
Allright here we go, bear with me here because there are quite some steps ;-) Assuming you already have a running PostgreSQL instance. This test was done with PostgreSQL 13. Create a clean test database :
Next we will build and run Tolgee. We will make a small modification to the Google Machine translation provider. The provider no longer actually calls the Google API, it just sleeps for a second and returns the original string. This makes it a more realistic test then the mock one, since responses over network aren't instant:
Tolgee now runs. Navigate to Create a project called "Sandbox" and add the following languages:
Remove Goto "Languages". Open the settings for Machine translations and ensure the following options are enabled:
Goto "Import" and select the large XLIFF test file (
Import the file. Wait a bit (couple of minutes). The application will become unresponsive. For example logging in doesn't work anymore. |
Interestingly enough with the above reproducer instructions in the end Tolgee 3.41.3 also runs out of heap space (even when running with
This is what it looks like in Eclipse MAT: This is the stack trace from Eclipse MAT:
|
OK. I've got the fix. There were multiple places where we were fetching chunk executions including the fetched batch job. Since the batch job has large target, fetching it 5000 times makes the response quite big. Thanks for reporting guys! 🙏 I hope this really finally fixes that. 🤘 |
@JanCizmar Thanks for your efforts in fixing this issue and happy holidays! 🎆 |
@JanCizmar Just wanted to report that we do not encounter any problems with the latest release. |
Great news! |
Tolgee runs out of heap space when importing a couple of thousand translation keys. I think this is the same issue as described in #1930.
I think this is caused by Tolgee loading it's entire job queue in to memory. This includes entries it is not currently processing (i.e. the ones which are queued). I haven't investigated further yet but for some reason these job queue entries are fairly large in memory, 1MB+. This means when importing a translation file with for example 3000 entries Tolgee will run out of heap space even when running with a 2GB heap. Once the job queue is saturated Tulgee will never recover from it without manual intervention because on every reboot it will load the job queue and go out of heap space.
During testing we did not have these issues with the Docker image. I have a hunch this is related to the machine translation (which is not enabled in the Docker image). But I haven't investigated this yet.
After making a heap dump and importing it in the Eclipse Memory analyzer one can see the heap consists almost solely out of PostgreSQL tuples:
This is the stacktrace associated with these objects:
This is what the job queue looks like:
This originates from the Tolgee job queue code:
tolgee-platform/backend/data/src/main/kotlin/io/tolgee/batch/BatchJobChunkExecutionQueue.kt
Lines 48 to 68 in fa9d124
A possible fix would be to have Tolgee retrieve batches (of for example 10 or 20 entries) instead of retrieving the entire queue.
The text was updated successfully, but these errors were encountered: