-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate memory issues #134
Comments
Do you mind sharing the memory dump when the heap is at its highest? Did you try analyzing the dump using the Eclipse MAT Analyzer, it usually calls out the leak suspects pretty well. |
Hey @sudorock I have actually taken a heap dump at it's highest and then opened it using Eclipse MAT Analyzer. It could not detect any leaks. But I had screenied it, so here's that screenshot: Is there something in here which helps figure out the problem? Sorry I don't think I can share the entire heap as there might be some API keys in there we can't share! Sorry about that, but if you do have some ideas on what else can be done, I'd be happy to follow through on it! Super appreciate it!! ^_^ I've also added a screenshot of jmap/visualvm in the original post |
Hey 👋, a coworker of mine sent me a link to this thread. With the chance of being superfluous: Datomic peer maintains a cache locally (referred to as the ObjectCache). By default its size is bound to be 50% of the memory that the VM has, but it can be configured by the What I wonder is whether its Datomic that is causing the OOM here, or something else, and Datomic is just being a memory hog of 50% of the VM's memory. What you could do is find Datomic metric reports using its monitoring features, IIRC, it will even output to your normal logs when it gathered a metric report. There are some values that are recorded, but not specified on the mentioned page. One that may be interesting to you is Anyhow, hope this is helpful ^^. |
Thanks a lot @lennartbuit this is really helpful! I've been suspecting Datomic peer too but did not have a concrete step to try and figure it out how. The monitoring features looks like it will help me debug this further. This problem seems to occur after a few days, so the most likely suspect could be
So for (2) I tried inspecting our index on the running application and that doesn't seem too huge to cause us issues.
Regarding the daily batch import (1), I came across this on datomic docs:
and right now we do bulk import once a day. This imports a lot of messages at the same time. So I tried checking the postgres data on disk, which seems to be around 9gb which seems about right for the amount of data we have over the years (4gb worth of json files). But maybe calling this We kind of do want to move to a more continuous manner using an open socket with slack and update the logs in real time (I'm testing it locally now and it does show some promise). But it needs more testing and the resolution of this issue first as we absolutely do not want to miss any events from slack. |
Also one thing which I noticed while rummaging through the heap was that all the individual slack messages and their rendered hiccup were preset in memory. Slack messages being in ram due to datomic index makes sense, but rendered hiccup too? Those should have been gc'd out to free up space for new stuff instead of giving an OOM. 🤔 I also found some logging events to be in the heap too as strings, not a lot though so maybe GC might not have run yet to clear those. |
Our app process for https://clojurians-log.clojureverse.org/ keeps running out of memory.
I've already done some investigation with
jmap
andvisualvm
to find some places where memory might be leaking.We need to figure out what exactly is causing this, or if we simply have too little ram?
Last ditch effort would be to move away from datomic to postgresql, but since datomic handles so many production instances well, I think we can figure out the problem and solve it.
Temporary fix
Simply restarting the app server resolves the issue for a couple days (a week or so) before OOM error happens again. But this causes some problems with cloudflare caching, so we don't want to rely on this for long.
The text was updated successfully, but these errors were encountered: