deploy: d8b2360

getsentry · Jun 20, 2023 · 2778e8d · 2778e8d
commit 2778e8d
Show file tree

Hide file tree

Showing 120 changed files with 22,447 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: ee10fcae1221319d5ff054cc0e656345
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.doctrees/architecture/datamodel.doctree b/.doctrees/architecture/datamodel.doctree
diff --git a/.doctrees/architecture/overview.doctree b/.doctrees/architecture/overview.doctree
diff --git a/.doctrees/architecture/queryprocessing.doctree b/.doctrees/architecture/queryprocessing.doctree
diff --git a/.doctrees/architecture/slicing.doctree b/.doctrees/architecture/slicing.doctree
diff --git a/.doctrees/clickhouse/death_queries.doctree b/.doctrees/clickhouse/death_queries.doctree
diff --git a/.doctrees/clickhouse/schema_design.doctree b/.doctrees/clickhouse/schema_design.doctree
diff --git a/.doctrees/clickhouse/supported_versions.doctree b/.doctrees/clickhouse/supported_versions.doctree
diff --git a/.doctrees/clickhouse/topology.doctree b/.doctrees/clickhouse/topology.doctree
diff --git a/.doctrees/configuration/dataset.doctree b/.doctrees/configuration/dataset.doctree
diff --git a/.doctrees/configuration/entity.doctree b/.doctrees/configuration/entity.doctree
diff --git a/.doctrees/configuration/entity_subscription.doctree b/.doctrees/configuration/entity_subscription.doctree
diff --git a/.doctrees/configuration/intro.doctree b/.doctrees/configuration/intro.doctree
diff --git a/.doctrees/configuration/migration_group.doctree b/.doctrees/configuration/migration_group.doctree
diff --git a/.doctrees/configuration/overview.doctree b/.doctrees/configuration/overview.doctree
diff --git a/.doctrees/configuration/readable_storage.doctree b/.doctrees/configuration/readable_storage.doctree
diff --git a/.doctrees/configuration/writable_storage.doctree b/.doctrees/configuration/writable_storage.doctree
diff --git a/.doctrees/contributing/environment.doctree b/.doctrees/contributing/environment.doctree
diff --git a/.doctrees/environment.pickle b/.doctrees/environment.pickle
diff --git a/.doctrees/getstarted.doctree b/.doctrees/getstarted.doctree
diff --git a/.doctrees/index.doctree b/.doctrees/index.doctree
diff --git a/.doctrees/intro.doctree b/.doctrees/intro.doctree
diff --git a/.doctrees/language/snql.doctree b/.doctrees/language/snql.doctree
diff --git a/.doctrees/migrations/modes.doctree b/.doctrees/migrations/modes.doctree
diff --git a/.doctrees/query/overview.doctree b/.doctrees/query/overview.doctree
diff --git a/.nojekyll b/.nojekyll
diff --git a/_images/clickhouse_nodes.png b/_images/clickhouse_nodes.png
diff --git a/_images/compositeprocessing.png b/_images/compositeprocessing.png
diff --git a/_images/datamodel.png b/_images/datamodel.png
diff --git a/_images/deployment_legend.png b/_images/deployment_legend.png
diff --git a/_images/errors_transactions_deployment.png b/_images/errors_transactions_deployment.png
diff --git a/_images/joins.png b/_images/joins.png
diff --git a/_images/multientity.png b/_images/multientity.png
diff --git a/_images/outcomes_deployment.png b/_images/outcomes_deployment.png
diff --git a/_images/overview.png b/_images/overview.png
diff --git a/_images/queryprocessing.png b/_images/queryprocessing.png
diff --git a/_images/sessions_deployment.png b/_images/sessions_deployment.png
diff --git a/_images/singleentity.png b/_images/singleentity.png
diff --git a/_images/snubaUI.png b/_images/snubaUI.png
diff --git a/_sources/architecture/datamodel.rst.txt b/_sources/architecture/datamodel.rst.txt
@@ -0,0 +1,177 @@
+================
+Snuba Data Model
+================
+
+This section explains how data is organized in Snuba and how user facing
+data is mapped to the underlying database (Clickhouse in this case).
+
+The Snuba data model is divided horizontally into a **logical model** and
+a **physical model**. The logical data model is what is visible to the Snuba
+clients through the Snuba query language. Elements in this model may or may
+not map 1:1 to tables in the database. The physical model, instead, maps 1:1
+to database concepts (like tables and views).
+
+The reasoning behind this division is that it allows Snuba to expose a
+stable interface through the logical data model and perform complex mapping
+internally to execute a query on different tables (part of the physical
+model) to improve performance in a way that is transparent to the client.
+
+The rest of this section outlines the concepts that compose the two models
+and how they are connected to each other.
+
+The main concepts, described below are dataset, entity and storage.
+
+.. image:: /_static/architecture/datamodel.png
+
+Datasets
+========
+
+A Dataset is a name space over Snuba data. It provides its own schema and
+it is independent from other datasets both in terms of logical model and
+physical model.
+
+Examples of datasets are, discover, outcomes, sessions. There is no
+relationship between them.
+
+A Dataset can be seen as a container for the components that define its
+abstract data model and its concrete data model that are described below.
+
+In term of query language, every Snuba query targets one and only one
+Dataset, and the Dataset can provide extensions to the query language.
+
+Entities and Entity Types
+=========================
+
+The fundamental block of the logical data model Snuba exposes to the client
+is the Entity. In the logical model an entity represents an instance of an
+abstract concept (like a transaction or an error). In practice an *Entity*
+corresponds to a row in a table in the database. The *Entity Type* is the
+class of the Entity (like Error**s** or Transaction**s**).
+
+The logical data model is composed by a set of *Entity Types* and by their
+relationships.
+
+Each *Entity Type* has a schema which is defined by a list of fields with
+their associated abstract data types. The schemas of all the *Entity Types*
+of a Dataset (there can be several) compose the logical data model that is
+visible to the Snuba client and against which Snuba Queries are validated.
+No lower level concept is supposed to be exposed.
+
+Entity Types are unequivocally contained in a Dataset. An Entity Type cannot
+be present in multiple Datasets.
+
+Relationships between Entity Types
+----------------------------------
+
+Entity Types in a Dataset are logically related. There are two types of
+relationships we support:
+
+- Entity Set Relationship. This mimics foreign keys. This relationship is
+ meant to allow joins between Entity Types. It only supports one-to-one
+ and one-to-many relationships at this point in time.
+- Inheritance Relationship. This mimics nominal subtyping. A group of Entity
+ Types can share a parent Entity Type. Subtypes inherit the schema from the
+ parent type. Semantically the parent Entity Type must represent the union
+ of all the Entities whose type inherit from it. It also must be possible
+ to query the parent Entity Type. This cannot be just a logical relationship.
+
+Entity Type and consistency
+---------------------------
+
+The Entity Type is the largest unit where Snuba **can** provide some strong
+data consistency guarantees. Specifically it is possible to query an Entity
+Type expecting Serializable Consistency (please don't use that. Seriously,
+if you think you need that, you probably don't). This does not extend to
+any query that spans multiple Entity Types where, at best, we will have
+eventual consistency.
+
+This also has an impact on Subscription queries. These can only work on one
+Entity Type at a time since, otherwise, they would require consistency between
+Entity Types, which we do not support.
+
+.. ATTENTION::
+ To be precise the unit of consistency (depending on the Entity Type)
+ can be even smaller and depend on how the data ingestion topics
+ are partitioned (project_id for example), the Entity Type is the
+ maximum Snuba allows. More details are (ok, will be) provided in
+ the Ingestion section of this guide.
+
+Storage
+=======
+
+Storages represent and define the physical data model of a Dataset. Each
+Storage represent is materialized in a physical database concept like a table
+or a materialized view. As a consequence each Storage has a schema defined
+by fields with their types that reflects the physical schema of the DB
+table/view the Storage maps to and it is able to provide all the details to
+generate DDL statements to build the tables on the database.
+
+Storages are able to map the logical concepts in the logical model discussed
+above to the physical concept of the database, thus each Storage needs to be
+related with an Entity Type. Specifically:
+
+- Each Entity Type must be backed by least one Readable Storage (a Storage we
+ can run query on), but can be backed by multiple Storages (for example a
+ pre-aggregate materialized view). Multiple Storages per Entity Type are meant
+ to allow query optimizations.
+- Each Entity Type must be backed by one and only one Writable
+ Storage that is used to ingest data and fill in the database tables.
+- Each Storage is backing exclusively one Entity Type.
+
+
+
+Examples
+========
+
+This section provides some examples of how the Snuba data model can represent
+some real world models.
+
+These case studies are not necessarily reflecting the current Sentry production
+model nor they are part of the same deployment. They have to be considered as
+examples taken in isolation.
+
+Single Entity Dataset
+---------------------
+
+This looks like the Outcomes dataset used by Sentry. This actually does not
+reflect Outcomes as of April 2020. It is though the design Outcomes should
+move towards.
+
+.. image:: /_static/architecture/singleentity.png
+
+This Dataset has one Entity Type only which represent an individual Outcome
+ingested by the Dataset. Querying raw Outcomes is painfully slow so we have
+two Storages. One is the Raw storage that reflects the data we ingest and a
+materialized view that computes hourly aggregations that are much more efficient
+to query. The Query Planner would pick the storage depending if the query
+can be executed on the aggregated data or not.
+
+Multi Entity Type Dataset
+-------------------------
+
+The canonical example of this Dataset is the Discover dataset.
+
+.. image:: /_static/architecture/multientity.png
+
+This has three Entity Types. Errors, Transaction and they both inherit from
+Events. These form the logical data model, thus querying the Events Entity
+Type gives the union of Transactions and Errors but it only allows common
+fields between the two to be present in the query.
+
+The Errors Entity Type is backed by two Storages for performance reasons.
+One is the main Errors Storage that is used to ingest data, the other is a
+read only view that is putting less load on Clickhosue when querying but
+that offers lower consistency guarantees. Transactions only have one storage
+and there is a Merge Table to serve Events (which is essentially a view over
+the union of the two tables).
+
+Joining Entity types
+--------------------
+
+This is a simple example of a dataset that includes multiple Entity Types
+that can be joined together in a query.
+
+.. image:: /_static/architecture/joins.png
+
+GroupedMessage and GroupAssingee can be part of a left join query with Errors.
+The rest is similar with what was discussed in the previous examples.
diff --git a/_sources/architecture/overview.rst.txt b/_sources/architecture/overview.rst.txt
@@ -0,0 +1,156 @@
+===========================
+Snuba Architecture Overview
+===========================
+
+Snuba is a time series oriented data store backed by
+`Clickhouse <https://clickhouse.tech/>`_, which is a columnary storage
+distributed database well suited for the kind of queries Snuba serves.
+
+Data is fully stored in Clickhouse tables and materialized views,
+it is ingested through input streams (only Kafka topics today)
+and can be queried either through point in time queries or through
+streaming queries (subscriptions).
+
+.. image:: /_static/architecture/overview.png
+
+Storage
+=======
+
+Clickhouse was chosen as backing storage because it provides a good balance
+of the real time performance Snuba needs, its distributed and replicated
+nature, its flexibility in terms of storage engines and consistency guarantees.
+
+Snuba data is stored in Clickhouse tables and Clickhouse materialized views.
+Multiple Clickhouse `storage engines <https://clickhouse.tech/docs/en/engines/table-engines/>`_
+are used depending on the goal of the table.
+
+Snuba data is organized in multiple Datasets which represent independent
+partitions of the data model. More details in the :doc:`/architecture/datamodel`
+section.
+
+Ingestion
+=========
+
+Snuba does not provide an api endpoint to insert rows (except when running
+in debug mode). Data is loaded from multiple input streams, processed by
+a series of consumers and written to Clickhouse tables.
+
+A consumer consumes one or multiple topics and writes on one or multiple
+tables. No table is written onto by multiple consumers as of today. This
+allows some consistency guarantees discussed below.
+
+Data ingestion is most effective in batches (both for Kafka but especially
+for Clickhouse). Our consumers support batching and guarantee that one batch
+of events taken from Kafka is passed to Clickhouse at least once. By properly
+selecting the Clickhouse table engine to deduplicate rows we can achieve
+exactly once semantics if we accept eventual consistency.
+
+Query
+=====
+
+The simplest query system is point in time. Queries are expressed in a
+the SnQL language (:doc:`/language/snql`) and are sent as post HTTP calls.
+The query engine processes the query (process described in
+:doc:`/architecture/queryprocessing`) and transforms it into a ClickHouse
+query.
+
+Streaming queries (done through the Subscription Engine) allow the client
+to receive query results in a push way. In this case an HTTP endpoint allows
+the client to register a streaming query. Then The Subscription Consumer consumes
+to the topic that is used to fill the relevant Clickhouse table for updates,
+periodically runs the query through the Query Engine and produces the result
+on the subscriptions Kafka topic.
+
+Data Consistency
+================
+
+Different consistency models coexist in Snuba to provide different guarantees.
+
+By default Snuba is eventually consistent. When running a query, by default,
+there is no guarantee of monotonic reads since Clickhouse is multi-leader
+and a query can hit any replica and there is no guarantee the replicas will
+be up to date. Also, by default, there is no guarantee Clickhouse will have
+reached a consistent state on its own.
+
+It is possible to achieve strong consistency on specific query by forcing
+Clickhouse to reach consistency before the query is executed (FINAL keyword),
+and by forcing queries to hit the specific replica the consumer writes onto.
+This essentially uses Clickhouse as if it was a single leader system and it
+allows Sequential consistency.
+
+================================
+Snuba within a Sentry Deployment
+================================
+
+This sections explains the role Snuba plays within a Sentry deployment showing
+the main data flows. If you are deploying Snuba stand alone, this won't be
+useful for you.
+
+Legend:
+
+.. image:: /_static/architecture/deployment_legend.png
+
+Deployments:
+
+Errors and transaction:
+
+.. image:: /_static/architecture/errors_transactions_deployment.png
+
+
+Sessions:
+
+.. image:: /_static/architecture/sessions_deployment.png
+
+Outcomes:
+
+.. image:: /_static/architecture/outcomes_deployment.png
+
+Errors and Transactions data flow
+=================================
+
+The main section at the top of the diagram illustrates the ingestion process
+for the ``Events`` and ``Transactions`` Entities. These two entities serve
+most issue/errors related features in Sentry and the whole Performance
+product.
+
+There is only one Kafka topic (``events``) shared between errors and transactions
+that feeds this pipeline. This topic contains both error messages and transaction
+messages.
+
+The Errors consumers consumes the ``events`` topic, writes messages in the Clickhouse
+``errors`` table. Upon commit it also produces a record on the ``snuba-commit-log``
+topic.
+
+Alerts on Errors are generated by the Errors Subscription Consumer. This is synchronized
+consumer that consumes both the main ``events`` topic and the ``snuba-commit-log`` topic
+so it can proceed in lockstep with the main consumer.
+
+The synchronized consumer then produces alerts by querying Clickhouse and produces
+the result on the result topic.
+
+An identical but independent pipeline exists for transactions.
+
+The Errors pipeline has an additional step: writing to the ``replacements`` topic.
+Errors mutations (merge/unmerge/reprocessing/etc.) are produced by Sentry on the
+``events`` topic. They are then forwarded to the ``replacements`` topic by the
+Errors Consumer and executed by the Replacement Consumer.
+
+The ``events`` topic must be partitioned semantically by Sentry project id to
+allow in order processing of the events within a project. This, as of today, is a
+requirement for alerts and replacements.
+
+Sessions and Outcomes
+=====================
+
+``Sessions`` and ``Outcomes`` work in very similar and simpler way. Specifically
+``Sessions`` power Release Health features, while ``Outcomes`` mainly provide
+data to the Sentry ``stats`` page.
+
+Both pipelines have their own Kafka topic, Kafka consumer and they write on their
+own table in Clickhouse.
+
+Change Data Capture pipeline
+============================
+
+This pipeline is still under construction. It consumes the ``cdc`` topic and fills
+two independent tables in Clickhouse.