Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(nodestore): A file system-based node storage backend, with added support for S3 and GCS storage #76250

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

klboke
Copy link
Contributor

@klboke klboke commented Aug 15, 2024

A file system-based node storage backend, with added support for S3 and GCS storage.

My starting point is as follows:

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. and is gonna need some rights from me in order to utilize my contributions in this here PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Aug 15, 2024
@PMExtra
Copy link
Contributor

PMExtra commented Aug 16, 2024

Great job! But I worry a little about the performance while reading multiple objects as a batch. (I didn't dig it, just an intuitive guess)

@PMExtra
Copy link
Contributor

PMExtra commented Aug 16, 2024

I'm not sure how frequently the node storage is accessed or how large the stored files are.

I don't think object storage services (such as S3 or GCS) are designed for frequent reads and writes of very small files.

This may lead to performance issues or higher costs.

@klboke
Copy link
Contributor Author

klboke commented Aug 16, 2024

@PMExtra

I'm not sure how frequently the node storage is accessed or how large the stored files are.

Based on our recorded data, each node data is approximately between 15~76KB in size, with the majority being around 15KB

This may lead to performance issues or higher costs.

I investigated the places where nodestore is written to and read from and found that there are only two scenarios that trigger these actions:

    1. Writing to nodestore occurs in the Sentry worker when processing Kafka event messages.
    1. Reading from nodestore happens in sentry-web when someone views detailed information.

Writing is essentially offline data stream processing, so a slower speed is acceptable. The QPS (queries per second) during reading is very low, so it is not a significant issue either. Additionally, OSS storage costs are definitely the lowest, which is why many projects in recent years (like Paimon and OpenObserve) support the separation of storage (such as S3 and other object storage) and computation.

@PMExtra
Copy link
Contributor

PMExtra commented Aug 16, 2024

@klboke Awesome. I am concerned that reading a large number of tiny files from object storage services might incur higher costs than from table store services. However, according to your analysis, the files are not that small, and the access frequency is not that high. Therefore, my concern may be unnecessary.

@PMExtra
Copy link
Contributor

PMExtra commented Aug 16, 2024

@klboke

Another point worth noting is that Sentry stores a large amount of highly repetitive ASCII text, and table store services can achieve high compression rates for this, thereby reducing storage costs.

However, object storage services usually do not support transparent compression. While each piece of data can be compressed separately, the additional computational overhead will be incurred, and the compression rate may be difficult to match compared to table store services.

Anyway, having one more option is always good. I’m just pointing out some potential drawbacks for discussion, not to undermine your work. Thank you for your contribution.

@klboke
Copy link
Contributor Author

klboke commented Aug 16, 2024

@PMExtra

However, object storage services usually do not support transparent compression. While each piece of data can be compressed separately, the additional computational overhead will be incurred, and the compression rate may be difficult to match compared to table store services.

Thank you very much for your reminder and suggestions. If we switch to S3, considering our 3TB nodestore (which actually is more than most people have), cost won't be an issue at all. I would prefer storing plain JSON files directly, as it would be more convenient for observing and debugging some issues.

Comment on lines +26 to +27
if not settings.DEBUG and options_store.get("filestore.backend") == "filesystem":
raise ValueError("Local fileSystem should only be used in development!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would break existing self-hosted users that don't use (or don't have access to) S3 compatible services. This should just log a warning instead of throwing a ValueError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. The default NodeStore is sentry.nodestore.django.DjangoNodeStorage rather than this. So it won't break any existing users. If you want to configure this FileSystemNodeStorage in non-debug environment, you must configure FileStore also.

@getsantry
Copy link
Contributor

getsantry bot commented Sep 19, 2024

This pull request has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you add the label WIP, I will leave it alone unless WIP is removed ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

@getsantry getsantry bot added Stale and removed Stale labels Sep 19, 2024
@getsantry
Copy link
Contributor

getsantry bot commented Oct 12, 2024

This pull request has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you add the label WIP, I will leave it alone unless WIP is removed ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

@getsantry getsantry bot added Stale and removed Stale labels Oct 12, 2024
@getsantry
Copy link
Contributor

getsantry bot commented Nov 4, 2024

This pull request has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you add the label WIP, I will leave it alone unless WIP is removed ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

@getsantry getsantry bot added Stale and removed Stale labels Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants