deploy: c3b10a7

getsentry · Nov 12, 2024 · 965c2e7 · 965c2e7
commit 965c2e7
Show file tree

Hide file tree

Showing 129 changed files with 23,708 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: 516b136fe673602ffad261b5e36d4c20
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.doctrees/architecture/consumer.doctree b/.doctrees/architecture/consumer.doctree
diff --git a/.doctrees/architecture/datamodel.doctree b/.doctrees/architecture/datamodel.doctree
diff --git a/.doctrees/architecture/overview.doctree b/.doctrees/architecture/overview.doctree
diff --git a/.doctrees/architecture/queryprocessing.doctree b/.doctrees/architecture/queryprocessing.doctree
diff --git a/.doctrees/architecture/slicing.doctree b/.doctrees/architecture/slicing.doctree
diff --git a/.doctrees/clickhouse/death_queries.doctree b/.doctrees/clickhouse/death_queries.doctree
diff --git a/.doctrees/clickhouse/schema_design.doctree b/.doctrees/clickhouse/schema_design.doctree
diff --git a/.doctrees/clickhouse/supported_versions.doctree b/.doctrees/clickhouse/supported_versions.doctree
diff --git a/.doctrees/clickhouse/topology.doctree b/.doctrees/clickhouse/topology.doctree
diff --git a/.doctrees/configuration/dataset.doctree b/.doctrees/configuration/dataset.doctree
diff --git a/.doctrees/configuration/entity.doctree b/.doctrees/configuration/entity.doctree
diff --git a/.doctrees/configuration/entity_subscription.doctree b/.doctrees/configuration/entity_subscription.doctree
diff --git a/.doctrees/configuration/intro.doctree b/.doctrees/configuration/intro.doctree
diff --git a/.doctrees/configuration/migration_group.doctree b/.doctrees/configuration/migration_group.doctree
diff --git a/.doctrees/configuration/overview.doctree b/.doctrees/configuration/overview.doctree
diff --git a/.doctrees/configuration/readable_storage.doctree b/.doctrees/configuration/readable_storage.doctree
diff --git a/.doctrees/configuration/writable_storage.doctree b/.doctrees/configuration/writable_storage.doctree
diff --git a/.doctrees/contributing/environment.doctree b/.doctrees/contributing/environment.doctree
diff --git a/.doctrees/environment.pickle b/.doctrees/environment.pickle
diff --git a/.doctrees/getstarted.doctree b/.doctrees/getstarted.doctree
diff --git a/.doctrees/index.doctree b/.doctrees/index.doctree
diff --git a/.doctrees/intro.doctree b/.doctrees/intro.doctree
diff --git a/.doctrees/language/mql.doctree b/.doctrees/language/mql.doctree
diff --git a/.doctrees/language/snql.doctree b/.doctrees/language/snql.doctree
diff --git a/.doctrees/migrations/modes.doctree b/.doctrees/migrations/modes.doctree
diff --git a/.doctrees/profiler.doctree b/.doctrees/profiler.doctree
diff --git a/.doctrees/query/overview.doctree b/.doctrees/query/overview.doctree
diff --git a/.nojekyll b/.nojekyll
diff --git a/_images/clickhouse_nodes.png b/_images/clickhouse_nodes.png
diff --git a/_images/compositeprocessing.png b/_images/compositeprocessing.png
diff --git a/_images/datamodel.png b/_images/datamodel.png
diff --git a/_images/deployment_legend.png b/_images/deployment_legend.png
diff --git a/_images/errors_transactions_deployment.png b/_images/errors_transactions_deployment.png
diff --git a/_images/joins.png b/_images/joins.png
diff --git a/_images/multientity.png b/_images/multientity.png
diff --git a/_images/outcomes_deployment.png b/_images/outcomes_deployment.png
diff --git a/_images/overview.png b/_images/overview.png
diff --git a/_images/queryprocessing.png b/_images/queryprocessing.png
diff --git a/_images/sessions_deployment.png b/_images/sessions_deployment.png
diff --git a/_images/singleentity.png b/_images/singleentity.png
diff --git a/_images/snubaUI.png b/_images/snubaUI.png
diff --git a/_sources/architecture/consumer.rst.txt b/_sources/architecture/consumer.rst.txt
@@ -0,0 +1,171 @@
+===============
+Snuba Consumers
+===============
+
+Summary
+-------
+
+* ``snuba consumer`` is still the easiest way to deploy a consumer for a new
+  dataset, as the message processor can be written in Python.
+
+* ``snuba rust-consumer --use-rust-processor`` can be used to serve the needs
+  of high-throughput datasets, as it sidesteps all problems with Python
+  concurrency. For that, the message processor needs to be written to Rust.
+
+* ``snuba rust-consumer --use-python-processor`` is an experiment that attempts
+  to consolidate both consumers' logic, so that we can eventually remove the
+  code beneath ``snuba consumer``. It is not deployed anywhere in prod.
+
+* In order to port a dataset fully to Rust, register a new message processor in
+  ``rust_snuba/src/processors/mod.rs`` and add it to
+  ``tests/consumers/test_message_processors.py``. Later, use
+  ``RustCompatProcessor`` to get rid of the Python implementation.
+
+Message processors
+------------------
+
+Each storage in Snuba defines a conversion function mapping the layout of a
+Kafka message to a list of ClickHouse rows.
+
+Message processors are defined in :py:mod:`snuba.dataset.processors`, and
+need to subclass from
+:py:class:`snuba.dataset.processors.DatasetMessageProcessor`. Just by
+subclassing, their name becomes available for reference in
+``snuba/datasets/configuration/*/storages/*.yaml``.
+
+Python consumers
+----------------
+
+``snuba consumer`` runs the Python consumer. It uses the Python message
+processor mentioned above to convert Kafka messages into rows, batches those
+rows up into larger ``INSERT``-statements, sends off the ``INSERT`` statement
+to the cluster defined in ``settings.py``.
+
+Test endpoints
+--------------
+
+In sentry we have a lot of tests that want to insert into ClickHouse. Tests
+have certain requirements that our Kafka consumers can't meet and which don't
+apply to production:
+
+
+* They require strong consistency, as they want to run a handful of queries +
+  assertions, then insert a few rows, wait for them to be inserted, then run
+  some more assertions depending on that new data.
+
+* Because tests wait for every insert synchronously, insertion latency is
+  really important, while throughput isn't.
+
+* Basically, people want to write e2e tests involving Snuba similarly to how
+  tests involving relational DBs in Django ORM are being written.
+
+Every storage can be inserted into and wiped using HTTP as well, using the
+endpoints defined in ``snuba.web.views`` prefixed with ``/tests/``. Those
+endpoints use the same message processors as the Python consumer, but there's
+no batching at all. One HTTP request gets directly translated into a blocking
+``INSERT``-statement towards ClickHouse.
+
+Rust consumers
+--------------
+
+``snuba rust-consumer`` runs the Rust consumer. It comes in two flavors, "pure
+Rust" and "hybrid":
+
+* ``--use-rust-processor`` ("pure rust") will attempt to find and load a Rust
+  version of the message processor. There is a mapping from Python class names
+  like ``QuerylogProcessor`` to the relevant function, defined in
+  ``rust_snuba/src/processors/mod.rs``. If that function exists, it is being
+  used. The resulting running consumer is sometimes 20x faster than the Python
+  version.
+
+  If a Rust port of the message processor can't be found, the consumer silently
+  falls back to the second flavor:
+
+* ``--use-python-processor`` ("hybrid") will use the Python message processor from
+  within Rust. For this mode, no dataset-specific logic has to be ported to
+  Rust, but at the same time the performance benefits of using a Rust consumer
+  are negligible.
+
+Python message processor compat shims
+-------------------------------------
+
+Even when a dataset is being processed with 100% Rust in prod (i.e. pure-rust
+consumer is being used), we still have those ``/tests/`` API endpoints, as
+there is a need for testing, and so there still needs to be a Python
+implementation of the same message processor. For this purpose
+``RustCompatProcessor`` can be used as a baseclass that will delegate all logic
+back into Rust. This means that:
+
+* ``snuba consumer`` will connect to Kafka using Python, but process messages
+  in Rust (using Python's multiprocessing)
+* ``snuba rust-consumer --use-python-processor`` makes no sense to deploy
+  anywhere, as it will connect to Kafka using Rust, then perform a roundtrip
+  through Python only to call Rust business logic again.
+
+Testing
+-------
+
+When porting a message processor to Rust, we validate equivalence to Python by:
+
+1. Registering the Python class in
+   ``tests/consumers/test_message_processors.py``, where it will run all
+   payloads from ``sentry-kafka-schemas`` against both message processors and
+   assert the same rows come back. If there is missing test coverage, it's
+   preferred to add more payloads to ``sentry-kafka-schemas`` than to write
+   custom tests.
+
+2. Remove the Python message processor and replace it with
+   ``RustCompatProcessor``, in which case the existing Python tests (of both
+   Snuba and Sentry) will be directly run against the Rust message processor.
+
+Architecture
+------------
+
+Python
+~~~~~~
+
+In order to get around the GIL, Python consumers use arroyo's multiprocessing
+support to be able to use multiple cores. This comes with significant
+serialization overhead and an amount of complexity that is out of scope for
+this document.
+
+Pure Rust
+~~~~~~~~~
+
+Despite the name this consumer is still launched from the Python CLI. The way
+this works is that all Rust is compiled into a shared library exposing a
+``consumer()`` function, and packaged using ``maturin`` into a Python wheel.
+The Python CLI for ``snuba rust-consumer``:
+
+1. Parses CLI arguments
+2. Resolves config (loads storages, clickhouse settings)
+3. Builds a new config JSON payload containing only information relevant to the
+   Rust consumer (name of message processor, name of physical Kafka topic, name
+   of ClickHouse table, and connection settings)
+4. calls ``rust_snuba.consumer(config)``, at which point Rust takes over the
+   process entirely.
+
+Concurrency model in pure-rust is very simple: The message processors run on a
+``tokio::Runtime``, which means that we're using regular OS threads in order to
+use multiple cores. The GIL is irrelevant since no Python code runs.
+
+Hybrid
+~~~~~~
+
+Hybrid consumer is mostly the same as pure-rust. The main difference is that it
+calls back into Python message processors. How that works is work-in-progress,
+but fundamentally it is subject to the same concurrency problems as the regular
+pure-Python consumer, and is therefore forced to spawn subprocesses and perform
+IPC one way or the other.
+
+Since the consumer is launched from the Python CLI, it will find the Python
+interpreter already initialized, and does not have to re-import Snuba again
+(except in subprocesses)
+
+Signal-handling is a bit tricky. Since no Python code runs for the majority of
+the consumer's lifetime, Python's signal handlers cannot run. This also means
+that the Rust consumer has to register its own handler for ``Ctrl-C``, but
+doing so also means that Python's own signal handlers are completely ignored.
+This is fine for the pure-rust case, but in the Hybrid case we have some Python
+code still running. For that Python code, ``KeyboardInterrupt`` does
+not work.
diff --git a/_sources/architecture/datamodel.rst.txt b/_sources/architecture/datamodel.rst.txt
@@ -0,0 +1,177 @@
+================
+Snuba Data Model
+================
+
+This section explains how data is organized in Snuba and how user facing
+data is mapped to the underlying database (Clickhouse in this case).
+
+The Snuba data model is divided horizontally into a **logical model** and
+a **physical model**. The logical data model is what is visible to the Snuba
+clients through the Snuba query language. Elements in this model may or may
+not map 1:1 to tables in the database. The physical model, instead, maps 1:1
+to database concepts (like tables and views).
+
+The reasoning behind this division is that it allows Snuba to expose a
+stable interface through the logical data model and perform complex mapping
+internally to execute a query on different tables (part of the physical
+model) to improve performance in a way that is transparent to the client.
+
+The rest of this section outlines the concepts that compose the two models
+and how they are connected to each other.
+
+The main concepts, described below are dataset, entity and storage.
+
+.. image:: /_static/architecture/datamodel.png
+
+Datasets
+========
+
+A Dataset is a name space over Snuba data. It provides its own schema and
+it is independent from other datasets both in terms of logical model and
+physical model.
+
+Examples of datasets are, discover, outcomes, sessions. There is no
+relationship between them.
+
+A Dataset can be seen as a container for the components that define its
+abstract data model and its concrete data model that are described below.
+
+In term of query language, every Snuba query targets one and only one
+Dataset, and the Dataset can provide extensions to the query language.
+
+Entities and Entity Types
+=========================
+
+The fundamental block of the logical data model Snuba exposes to the client
+is the Entity. In the logical model an entity represents an instance of an
+abstract concept (like a transaction or an error). In practice an *Entity*
+corresponds to a row in a table in the database.  The *Entity Type* is the
+class of the Entity (like Error**s** or Transaction**s**).
+
+The logical data model is composed by a set of *Entity Types* and by their
+relationships.
+
+Each *Entity Type* has a schema which is defined by a list of fields with
+their associated abstract data types. The schemas of all the *Entity Types*
+of a Dataset (there can be several) compose the logical data model that is
+visible to the Snuba client and against which Snuba Queries are validated.
+No lower level concept is supposed to be exposed.
+
+Entity Types are unequivocally contained in a Dataset. An Entity Type cannot
+be present in multiple Datasets.
+
+Relationships between Entity Types
+----------------------------------
+
+Entity Types in a Dataset are logically related. There are two types of
+relationships we support:
+
+- Entity Set Relationship. This mimics foreign keys. This relationship is
+  meant to allow joins between Entity Types. It only supports one-to-one
+  and one-to-many relationships at this point in time.
+- Inheritance Relationship. This mimics nominal subtyping. A group of Entity
+  Types can share a parent Entity Type. Subtypes inherit the schema from the
+  parent type. Semantically the parent Entity Type must represent the union
+  of all the Entities whose type inherit from it. It also must be possible
+  to query the parent Entity Type. This cannot be just a logical relationship.
+
+Entity Type and consistency
+---------------------------
+
+The Entity Type is the largest unit where Snuba **can** provide some strong
+data consistency guarantees. Specifically it is possible to query an Entity
+Type expecting Serializable Consistency (please don't use that. Seriously,
+if you think you need that, you probably don't). This does not extend to
+any query that spans multiple Entity Types where, at best, we will have
+eventual consistency.
+
+This also has an impact on Subscription queries. These can only work on one
+Entity Type at a time since, otherwise, they would require consistency between
+Entity Types, which we do not support.
+
+.. ATTENTION::
+    To be precise the unit of consistency (depending on the Entity Type)
+    can be even smaller and depend on how the data ingestion topics
+    are partitioned (project_id for example), the Entity Type is the
+    maximum Snuba allows. More details are (ok, will be) provided in
+    the Ingestion section of this guide.
+
+Storage
+=======
+
+Storages represent and define the physical data model of a Dataset. Each
+Storage represent is materialized in a physical database concept like a table
+or a materialized view. As a consequence each Storage has a schema defined
+by fields with their types that reflects the physical schema of the DB
+table/view the Storage maps to and it is able to provide all the details to
+generate DDL statements to build the tables on the database.
+
+Storages are able to map the logical concepts in the logical model discussed
+above to the physical concept of the database, thus each Storage needs to be
+related with an Entity Type. Specifically:
+
+- Each Entity Type must be backed by least one Readable Storage (a Storage we
+  can run query on), but can be backed by multiple Storages (for example a
+  pre-aggregate materialized view). Multiple Storages per Entity Type are meant
+  to allow query optimizations.
+- Each Entity Type must be backed by one and only one Writable
+  Storage that is used to ingest data and fill in the database tables.
+- Each Storage is backing exclusively one Entity Type.
+
+
+
+Examples
+========
+
+This section provides some examples of how the Snuba data model can represent
+some real world models.
+
+These case studies are not necessarily reflecting the current Sentry production
+model nor they are part of the same deployment. They have to be considered as
+examples taken in isolation.
+
+Single Entity Dataset
+---------------------
+
+This looks like the Outcomes dataset used by Sentry.  This actually does not
+reflect Outcomes as of April 2020. It is though the design Outcomes should
+move towards.
+
+.. image:: /_static/architecture/singleentity.png
+
+This Dataset has one Entity Type only which represent an individual Outcome
+ingested by the Dataset. Querying raw Outcomes is painfully slow so we have
+two Storages. One is the Raw storage that reflects the data we ingest and a
+materialized view that computes hourly aggregations that are much more efficient
+to query. The Query Planner would pick the storage depending if the query
+can be executed on the aggregated data or not.
+
+Multi Entity Type Dataset
+-------------------------
+
+The canonical example of this Dataset is the Discover dataset.
+
+.. image:: /_static/architecture/multientity.png
+
+This has three Entity Types. Errors, Transaction and they both inherit from
+Events. These form the logical data model, thus querying the Events Entity
+Type gives the union of Transactions and Errors but it only allows common
+fields between the two to be present in the query.
+
+The Errors Entity Type is backed by two Storages for performance reasons.
+One is the main Errors Storage that is used to ingest data, the other is a
+read only view that is putting less load on Clickhosue when querying but
+that offers lower consistency guarantees. Transactions only have one storage
+and there is a Merge Table to serve Events (which is essentially a view over
+the union of the two tables).
+
+Joining Entity types
+--------------------
+
+This is a simple example of a dataset that includes multiple Entity Types
+that can be joined together in a query.
+
+.. image:: /_static/architecture/joins.png
+
+GroupedMessage and GroupAssingee can be part of a left join query with Errors.
+The rest is similar with what was discussed in the previous examples.