Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(spans): Batch insert span data to spanstore #55980

Closed
wants to merge 4 commits into from

Conversation

phacops
Copy link
Contributor

@phacops phacops commented Sep 11, 2023

We want to store some span data in a cheaper datastore. Given the access patterns, Bigtable was chosen.

This PR will:

  • add support for our Bigtable KV store to batch inserts
  • create a new store we'll call spanstore based on the nodestore implementation
  • store span data in the span consumer

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Sep 11, 2023
@codecov
Copy link

codecov bot commented Sep 11, 2023

Codecov Report

Merging #55980 (ffe53dc) into master (9679974) will decrease coverage by 17.97%.
Report is 4 commits behind head on master.
The diff coverage is 31.25%.

@@             Coverage Diff             @@
##           master   #55980       +/-   ##
===========================================
- Coverage   79.95%   61.98%   -17.97%     
===========================================
  Files        5060     5050       -10     
  Lines      217568   217346      -222     
  Branches    36831    36800       -31     
===========================================
- Hits       173952   134718    -39234     
- Misses      38287    78240    +39953     
+ Partials     5329     4388      -941     
Files Changed Coverage
src/sentry/utils/kvstore/bigtable.py 15.78%
src/sentry/spans/consumers/process/processor.py 27.55%
src/sentry/spans/consumers/process/factory.py 75.00%
src/sentry/conf/server.py 100.00%
src/sentry/spans/store/__init__.py 100.00%

@@ -2051,6 +2051,10 @@ def SOCIAL_AUTH_DEFAULT_USERNAME() -> str:
SENTRY_TAGSTORE = os.environ.get("SENTRY_TAGSTORE", "sentry.tagstore.snuba.SnubaTagStorage")
SENTRY_TAGSTORE_OPTIONS: dict[str, Any] = {}

# Node storage backend used for ArtifactBundle indexing (aka FlatFileIndex aka BundleIndex)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Copy/paste error

Comment on lines +154 to +156
bigtable_project_id: str,
bigtable_instance_id: str,
bigtable_table_id: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably not provide these parameters here. Instead we can rely on the SENTRY_SPANSTORE setting and the corresponding SENTRY_SPANSTORE_OPTIONS to populate spanstore specific configurations. That way the solution works for self hosted as well.

Comment on lines +124 to +127
if batcher_options := self.client_options.get("batcher"):
self.__batcher = self.__table.mutations_batcher(**batcher_options)
self._mutate_rows = self.__batcher.mutate_rows
self._mutate = self.__batcher.mutate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does the batcher gets flushed? The default configuration seems to suggest 1 second.

One possible idea to explore is making the batch flush explicit rather than implicit. So there could be a batcher step as the next step of the RunTaskWithMultiprocessing. The batcher would explicitly fill in bigtable's batcher until a threshold. Once the batch finishes, it will perform a flush to bigtable. The next step after batcher would be unfold which would produce each individual message on the snuba-spans topic.

There are a few advantages of the above approach.

  1. Explicit writes instead of implicit time based writes which gives stronger data durability guarantees.
  2. In case of the consumers crashing, there would not be a possibility of losing context/tags data from bigtable and having the indexed spans data in clickhouse. This could currently happen because of asynchronous writes.

@getsantry
Copy link
Contributor

getsantry bot commented Oct 19, 2023

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you remove the label Waiting for: Community, I will leave it alone ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

@getsantry getsantry bot added the Stale label Oct 19, 2023
@getsantry getsantry bot closed this Oct 27, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Nov 11, 2023
@phacops phacops reopened this Nov 27, 2023
@phacops phacops closed this Nov 27, 2023
@phacops phacops deleted the pierre/spans-batch-insert-to-nodestore branch January 10, 2024 20:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Scope: Backend Automatically applied to PRs that change backend components Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants