Cleaning nodestore_node table #1808

aminvakil · 2022-11-14T17:31:29Z

Problem Statement

Over time nodestore_node table gets bigger and bigger and currently there is no procedure to clean it up.

A forum comment explaining what nodestore_node is by @untitaker :

To answer the question about purpose of nodestore:

nodestore stores the raw, unindexed parts of the data of an event. For example all event breadcrumbs and contexts are stored there. Clickhouse only stores the data you can search and aggregate by. It only has indices, so to speak.

nodestore has multiple backends. On sentry.io it is not postgres, but google bigtable. We use the postgres backend for local development and for onpremise, but for large-scale installations it is suboptimal.

We don’t have a solution right now for the storage problem in any case, sorry. One thing you should ensure actually runs is sentry cleanup, but it’s definitely possible that nodestore takes this amount of disk space.

https://forum.sentry.io/t/postgres-nodestore-node-table-124gb/12753/3

Solution Brainstorm

There was an idea suggested on forum which worked for me, but I lost all event details.
Something like this would work:

DELETE FROM public.nodestore_node WHERE “timestamp” < NOW() - INTERVAL '1 day';
VACUUM FULL public.nodestore_node;

Change 1 day according to your needs.

Maybe put this in a cron container which gets run every night, we should think about its performance issues though, this took a long time to get executed on our instance, maybe because it wasn't run before, but I'm not sure.

The text was updated successfully, but these errors were encountered:

aminvakil · 2022-11-14T17:32:03Z

cc @BYK @chadwhitacre @hubertdeng123

BYK · 2022-11-14T17:51:24Z

There are a few interesting and important things:

We can never add the VACUUM FULL public.nodestore_node; part in any kind of production setup as it will lock the whole table for an indefinite time. It also requires additional space on disk so it has a high potential to put the system in an inoperable and hard-to-recover state.
The sentry cleanup command that we trigger via the cron does run nodestore.cleanup():
https://github.com/getsentry/sentry/blob/1d24b160a74e10759d20fa323f5251bffee3f7b5/src/sentry/runner/commands/cleanup.py#L260-L268 which is defined at https://github.com/getsentry/sentry/blob/1d24b160a74e10759d20fa323f5251bffee3f7b5/src/sentry/nodestore/django/backend.py#L54-L62

So technically, this should work, if you set SENTRY_RETENTION_DAYS to 90 or less which is the default. The only missing thing I see here is an explicit VACUUM call but autovacuum should handle that already: https://www.postgresql.org/docs/12/routine-vacuuming.html#AUTOVACUUM

Now we need to figure out which part of the puzzle is broken:

Do we have the nodestore.cleanup() call successfully?
If so, does it actually run the DELETE commands?
If so, does the autovacuum thingy work in the Postgres container or do we need to enable that
Even then, should we issue a manual VACUUM command after a cleanup operation?

untitaker · 2022-11-15T12:11:56Z

I have nothing to add to what BYK said. This should work out of the box assuming you run sentry cleanup

chadwhitacre · 2022-11-15T13:14:53Z

I lost all event details.

Did you lose all event details because you ran the command without substituting a real value for SENTRY_RETENTION_DAYS, is that what you mean? Does Postgres silently accept INTERVAL ‘SENTRY_RETENTION_DAYS’ as equivalent to zero, something like that?

github-actions · 2022-12-07T00:20:06Z

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Backlog or Status: In Progress, I will leave it alone ... forever!

"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

aminvakil · 2022-12-17T13:55:53Z

I lost all event details.

Did you lose all event details because you ran the command without substituting a real value for SENTRY_RETENTION_DAYS, is that what you mean? Does Postgres silently accept INTERVAL ‘SENTRY_RETENTION_DAYS’ as equivalent to zero, something like that?

Sorry for late response, No, I ran this exact same command: DELETE FROM public.nodestore_node WHERE "timestamp" < NOW() - INTERVAL '1 day';

github-actions · 2023-01-08T00:21:06Z

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Backlog or Status: In Progress, I will leave it alone ... forever!

"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

mrmachine · 2023-06-22T09:52:33Z

We hit this problem as well. nodestore_node was 42GB and the table had never been vacuumed.

We ran the sentry-cleanup service which did it's thing, but did not reduce disk space. We ran vacuum analyze which did not reduce disk space. We ran vacuum full nodestore_node and that did reduce disk space to 3GB. Postgres also then reported that vacuum had been executed one time.

postgres=# select relname, vacuum_count, n_tup_del, n_live_tup, n_dead_tup 
from pg_stat_sys_tables 
where relname='pg_toast_20247';
    relname     | vacuum_count | n_tup_del | n_live_tup | n_dead_tup 
----------------+--------------+-----------+------------+------------
 pg_toast_20247 |            1 |  62889277 |    1840323 |          0
(1 row)

So it seems that (3) is true. The autovacuum thingy inside the postgres container is not working.

However, this seems to indicate that it is in fact enabled:

postgres=# SHOW autovacuum;
 autovacuum 
------------
 on
(1 row)

Someone hinted on a forum that if Postgres is under load it may not get a chance to autovacuum. I don't think our installation is especially high load, but if it is the case that Postgres never runs autovacuum under nominal load then maybe sentry self-hosted does need a way to explicitly trigger vacuum on some kind of schedule.

While it is true that vacuum locks the table and uses some disk space, and temporarily doubling our 42GB table during a vacuum might have causes some issues, it only took a couple of minutes to complete the vacuum and it significantly dropped the disk usage.

If vacuum was run regularly, the size should also hover at a much lower level (e.g. 3GB instead of 42GB) and should not get large enough to cause significant issues with disk usage during the vacuum.

We would be happy to accept a periodic 2 minute complete sentry outage to avoid unbounded disk usage, but it seems that sentry did not suffer a complete outage during the vacuum anyway. We're also willing to accept a partial outage or some amount of events that are missing or missing data.

We could also schedule vacuum to occur during a specified maintenance window to further reduce the impact to production systems.

I also tried the pg_repack option mentioned in the docks, but the current command did not run at all (failed to install pg_repack inside the container) and an older version of the command I found in a GitHub issue that matched our version of postgres also failed to install inside the container.

So I think a setting to schedule a vacuum full nodestore_node in postgres is the best way forward. It could be disabled by default, but should be easy to enable via setting that allows a specific schedule to be set.

mrmachine · 2023-06-22T10:46:53Z

Reading more about vacuum and vacuum full. The latter rewrites the whole table and returns disk space back to the operating system. The former frees up deleted rows for re-use but does not return disk space back to the operating system. So if the sentry cleanup could run vacuum immediately after deleting rows from nodestore_node, and the cleanup frequency was high enough (or configurable) for the volume of new writes vs deletions of old data, then there should be no more unbounded growth without requiring a full table lock or any downtime. This should be easier to implement and more reliable than trying to configure the autovacuum daemon to do it for us?

https://www.postgresql.org/docs/12/routine-vacuuming.html#VACUUM-FOR-SPACE-RECOVERY

chadwhitacre · 2023-06-22T19:29:50Z

Thanks for the deep dive, @mrmachine.

aminvakil · 2023-06-23T09:06:19Z

In our environment, we have not yet upgraded to 23.4.0 as disk space is almost 80% full and cannot upgrade our postgres.

Therefore we have not yet tested if upgrading to 14.5 would have an effect on this matter.

lcd1232 · 2024-08-13T18:00:08Z

It's possible to free up space without downtime using pg_repack

klboke · 2024-08-16T06:33:40Z

We also encountered this issue, but it was more severe for us. Our nodestore_node table expanded to 3.6TB, so we had to focus on and resolve this problem. We tried to pull a new instance from backup data to verify how long a VACUUM FULL nodestore_node would take and found that it would take at least 8 hours. This means that if we perform this operation in production, Sentry would be down for 8 hours, which is unacceptable.
However, we clearly understand that nodestore_node is just a KV (key-value) abstract storage, using id as the key, to store or query event JSONs, each JSON being approximately 6~70 KB. Of course, the data stored in nodestore_node is already compressed. So I naturally thought of using an object storage product like S3, which can very well solve the storage problem of nodestore_node, using id as the file path, with each event JSON being a file.
Now, I have managed to use S3 to store the messages in nodestore_node and have submitted a PR :feat(nodestore): A file system-based node storage backend, with added support for S3 and GCS storage sentry#76250. I hope others facing similar issues can provide some suggestions.

postgresql size

oss files

getsantry bot added the Status: Untriaged label Nov 14, 2022

aminvakil added the Type: Enhancement label Nov 14, 2022

getsantry bot removed the Status: Untriaged label Nov 14, 2022

github-actions bot added the Status: Stale label Dec 7, 2022

github-actions bot closed this as completed Dec 15, 2022

aminvakil reopened this Dec 17, 2022

aminvakil removed the Status: Stale label Dec 17, 2022

github-actions bot added the Status: Stale label Jan 8, 2023

chadwhitacre added the Status: Backlog label Jan 9, 2023

github-actions bot removed the Status: Stale label Jan 9, 2023

hubertdeng123 removed the Status: Backlog label Jul 19, 2023

getsantry bot added the Waiting for: Product Owner label Aug 13, 2024

hubertdeng123 removed the Waiting for: Product Owner label Aug 15, 2024

getsantry bot added the Waiting for: Product Owner label Aug 16, 2024

klboke mentioned this issue Aug 16, 2024

Discussion on nodestore Solutions andrsp/sentry-nodestore-elastic#1

Open

aldy505 removed the Waiting for: Product Owner label Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleaning nodestore_node table #1808

Cleaning nodestore_node table #1808

aminvakil commented Nov 14, 2022 •

edited

Loading

aminvakil commented Nov 14, 2022

BYK commented Nov 14, 2022

untitaker commented Nov 15, 2022

chadwhitacre commented Nov 15, 2022

github-actions bot commented Dec 7, 2022

aminvakil commented Dec 17, 2022

github-actions bot commented Jan 8, 2023

mrmachine commented Jun 22, 2023

mrmachine commented Jun 22, 2023

chadwhitacre commented Jun 22, 2023

aminvakil commented Jun 23, 2023

lcd1232 commented Aug 13, 2024

klboke commented Aug 16, 2024 •

edited

Loading

Cleaning nodestore_node table #1808

Cleaning nodestore_node table #1808

Comments

aminvakil commented Nov 14, 2022 • edited Loading

Problem Statement

Solution Brainstorm

aminvakil commented Nov 14, 2022

BYK commented Nov 14, 2022

untitaker commented Nov 15, 2022

chadwhitacre commented Nov 15, 2022

github-actions bot commented Dec 7, 2022

aminvakil commented Dec 17, 2022

github-actions bot commented Jan 8, 2023

mrmachine commented Jun 22, 2023

mrmachine commented Jun 22, 2023

chadwhitacre commented Jun 22, 2023

aminvakil commented Jun 23, 2023

lcd1232 commented Aug 13, 2024

klboke commented Aug 16, 2024 • edited Loading

aminvakil commented Nov 14, 2022 •

edited

Loading

klboke commented Aug 16, 2024 •

edited

Loading