Investigate memory issues #134

oxalorg · 2021-03-22T09:37:29Z

Our app process for https://clojurians-log.clojureverse.org/ keeps running out of memory.

I've already done some investigation with jmap and visualvm to find some places where memory might be leaking.

We need to figure out what exactly is causing this, or if we simply have too little ram?

Last ditch effort would be to move away from datomic to postgresql, but since datomic handles so many production instances well, I think we can figure out the problem and solve it.

Temporary fix

Simply restarting the app server resolves the issue for a couple days (a week or so) before OOM error happens again. But this causes some problems with cloudflare caching, so we don't want to rely on this for long.

The text was updated successfully, but these errors were encountered:

sudorock · 2021-04-01T17:48:43Z

Do you mind sharing the memory dump when the heap is at its highest? Did you try analyzing the dump using the Eclipse MAT Analyzer, it usually calls out the leak suspects pretty well.

oxalorg · 2021-04-12T16:12:39Z

Hey @sudorock I have actually taken a heap dump at it's highest and then opened it using Eclipse MAT Analyzer. It could not detect any leaks.

But I had screenied it, so here's that screenshot:

Is there something in here which helps figure out the problem?

Sorry I don't think I can share the entire heap as there might be some API keys in there we can't share! Sorry about that, but if you do have some ideas on what else can be done, I'd be happy to follow through on it! Super appreciate it!! ^_^

I've also added a screenshot of jmap/visualvm in the original post

lennartbuit · 2021-04-12T18:05:35Z

Hey 👋, a coworker of mine sent me a link to this thread. With the chance of being superfluous: Datomic peer maintains a cache locally (referred to as the ObjectCache). By default its size is bound to be 50% of the memory that the VM has, but it can be configured by the datomic.objectCacheMax system property.

What I wonder is whether its Datomic that is causing the OOM here, or something else, and Datomic is just being a memory hog of 50% of the VM's memory.

What you could do is find Datomic metric reports using its monitoring features, IIRC, it will even output to your normal logs when it gathered a metric report. There are some values that are recorded, but not specified on the mentioned page. One that may be interesting to you is :ObjectCacheCount giving some indication as to how big this cache of Datomic actually is.

Anyhow, hope this is helpful ^^.

oxalorg · 2021-04-12T21:25:52Z

Thanks a lot @lennartbuit this is really helpful! I've been suspecting Datomic peer too but did not have a concrete step to try and figure it out how. The monitoring features looks like it will help me debug this further.

This problem seems to occur after a few days, so the most likely suspect could be

our import transactions which run in batch every day
-or-
the indexer which runs every few hours.

So for (2) I tried inspecting our index on the running application and that doesn't seem too huge to cause us issues.

clojurians-log.db.queries=>(def data (-> @!indexes :day-chan-cnt))
clojurians-log.db.queries=> (reduce #(+ %1 (-> %2 (get 1) count)) 0 data)
75764

Regarding the daily batch import (1), I came across this on datomic docs:

Clean Up After Import
Manually requestIndex after the import to reduce the connection time for a transactor and peers configured for ongoing transactional load.
Manually gcStorage after indexing job requested above is completed, to recover underlying storage space

and right now we do bulk import once a day. This imports a lot of messages at the same time. So I tried checking the postgres data on disk, which seems to be around 9gb which seems about right for the amount of data we have over the years (4gb worth of json files). But maybe calling this gcStorage manually is worth a shot if the transactor/peer are storing unnecessary garbage after the batch imports?

We kind of do want to move to a more continuous manner using an open socket with slack and update the logs in real time (I'm testing it locally now and it does show some promise). But it needs more testing and the resolution of this issue first as we absolutely do not want to miss any events from slack.

oxalorg · 2021-04-12T21:32:37Z

Also one thing which I noticed while rummaging through the heap was that all the individual slack messages and their rendered hiccup were preset in memory.

Slack messages being in ram due to datomic index makes sense, but rendered hiccup too? Those should have been gc'd out to free up space for new stuff instead of giving an OOM. 🤔

I also found some logging events to be in the heap too as strings, not a lot though so maybe GC might not have run yet to clear those.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate memory issues #134

Investigate memory issues #134

oxalorg commented Mar 22, 2021 •

edited

Loading

sudorock commented Apr 1, 2021

oxalorg commented Apr 12, 2021 •

edited

Loading

lennartbuit commented Apr 12, 2021 •

edited

Loading

oxalorg commented Apr 12, 2021 •

edited

Loading

oxalorg commented Apr 12, 2021 •

edited

Loading

Investigate memory issues #134

Investigate memory issues #134

Comments

oxalorg commented Mar 22, 2021 • edited Loading

Temporary fix

sudorock commented Apr 1, 2021

oxalorg commented Apr 12, 2021 • edited Loading

lennartbuit commented Apr 12, 2021 • edited Loading

oxalorg commented Apr 12, 2021 • edited Loading

oxalorg commented Apr 12, 2021 • edited Loading

oxalorg commented Mar 22, 2021 •

edited

Loading

oxalorg commented Apr 12, 2021 •

edited

Loading

lennartbuit commented Apr 12, 2021 •

edited

Loading

oxalorg commented Apr 12, 2021 •

edited

Loading

oxalorg commented Apr 12, 2021 •

edited

Loading