Skip to content

Commit

Permalink
deploy: c3b10a7
Browse files Browse the repository at this point in the history
  • Loading branch information
davidtsuk committed Nov 12, 2024
0 parents commit 965c2e7
Show file tree
Hide file tree
Showing 129 changed files with 23,708 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 516b136fe673602ffad261b5e36d4c20
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file added .doctrees/architecture/consumer.doctree
Binary file not shown.
Binary file added .doctrees/architecture/datamodel.doctree
Binary file not shown.
Binary file added .doctrees/architecture/overview.doctree
Binary file not shown.
Binary file added .doctrees/architecture/queryprocessing.doctree
Binary file not shown.
Binary file added .doctrees/architecture/slicing.doctree
Binary file not shown.
Binary file added .doctrees/clickhouse/death_queries.doctree
Binary file not shown.
Binary file added .doctrees/clickhouse/schema_design.doctree
Binary file not shown.
Binary file added .doctrees/clickhouse/supported_versions.doctree
Binary file not shown.
Binary file added .doctrees/clickhouse/topology.doctree
Binary file not shown.
Binary file added .doctrees/configuration/dataset.doctree
Binary file not shown.
Binary file added .doctrees/configuration/entity.doctree
Binary file not shown.
Binary file not shown.
Binary file added .doctrees/configuration/intro.doctree
Binary file not shown.
Binary file added .doctrees/configuration/migration_group.doctree
Binary file not shown.
Binary file added .doctrees/configuration/overview.doctree
Binary file not shown.
Binary file not shown.
Binary file added .doctrees/configuration/writable_storage.doctree
Binary file not shown.
Binary file added .doctrees/contributing/environment.doctree
Binary file not shown.
Binary file added .doctrees/environment.pickle
Binary file not shown.
Binary file added .doctrees/getstarted.doctree
Binary file not shown.
Binary file added .doctrees/index.doctree
Binary file not shown.
Binary file added .doctrees/intro.doctree
Binary file not shown.
Binary file added .doctrees/language/mql.doctree
Binary file not shown.
Binary file added .doctrees/language/snql.doctree
Binary file not shown.
Binary file added .doctrees/migrations/modes.doctree
Binary file not shown.
Binary file added .doctrees/profiler.doctree
Binary file not shown.
Binary file added .doctrees/query/overview.doctree
Binary file not shown.
Empty file added .nojekyll
Empty file.
Binary file added _images/clickhouse_nodes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/compositeprocessing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/datamodel.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/deployment_legend.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/errors_transactions_deployment.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/joins.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/multientity.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/outcomes_deployment.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/queryprocessing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/sessions_deployment.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/singleentity.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/snubaUI.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
171 changes: 171 additions & 0 deletions _sources/architecture/consumer.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
===============
Snuba Consumers
===============

Summary
-------

* ``snuba consumer`` is still the easiest way to deploy a consumer for a new
dataset, as the message processor can be written in Python.

* ``snuba rust-consumer --use-rust-processor`` can be used to serve the needs
of high-throughput datasets, as it sidesteps all problems with Python
concurrency. For that, the message processor needs to be written to Rust.

* ``snuba rust-consumer --use-python-processor`` is an experiment that attempts
to consolidate both consumers' logic, so that we can eventually remove the
code beneath ``snuba consumer``. It is not deployed anywhere in prod.

* In order to port a dataset fully to Rust, register a new message processor in
``rust_snuba/src/processors/mod.rs`` and add it to
``tests/consumers/test_message_processors.py``. Later, use
``RustCompatProcessor`` to get rid of the Python implementation.

Message processors
------------------

Each storage in Snuba defines a conversion function mapping the layout of a
Kafka message to a list of ClickHouse rows.

Message processors are defined in :py:mod:`snuba.dataset.processors`, and
need to subclass from
:py:class:`snuba.dataset.processors.DatasetMessageProcessor`. Just by
subclassing, their name becomes available for reference in
``snuba/datasets/configuration/*/storages/*.yaml``.

Python consumers
----------------

``snuba consumer`` runs the Python consumer. It uses the Python message
processor mentioned above to convert Kafka messages into rows, batches those
rows up into larger ``INSERT``-statements, sends off the ``INSERT`` statement
to the cluster defined in ``settings.py``.

Test endpoints
--------------

In sentry we have a lot of tests that want to insert into ClickHouse. Tests
have certain requirements that our Kafka consumers can't meet and which don't
apply to production:


* They require strong consistency, as they want to run a handful of queries +
assertions, then insert a few rows, wait for them to be inserted, then run
some more assertions depending on that new data.

* Because tests wait for every insert synchronously, insertion latency is
really important, while throughput isn't.

* Basically, people want to write e2e tests involving Snuba similarly to how
tests involving relational DBs in Django ORM are being written.

Every storage can be inserted into and wiped using HTTP as well, using the
endpoints defined in ``snuba.web.views`` prefixed with ``/tests/``. Those
endpoints use the same message processors as the Python consumer, but there's
no batching at all. One HTTP request gets directly translated into a blocking
``INSERT``-statement towards ClickHouse.

Rust consumers
--------------

``snuba rust-consumer`` runs the Rust consumer. It comes in two flavors, "pure
Rust" and "hybrid":

* ``--use-rust-processor`` ("pure rust") will attempt to find and load a Rust
version of the message processor. There is a mapping from Python class names
like ``QuerylogProcessor`` to the relevant function, defined in
``rust_snuba/src/processors/mod.rs``. If that function exists, it is being
used. The resulting running consumer is sometimes 20x faster than the Python
version.

If a Rust port of the message processor can't be found, the consumer silently
falls back to the second flavor:

* ``--use-python-processor`` ("hybrid") will use the Python message processor from
within Rust. For this mode, no dataset-specific logic has to be ported to
Rust, but at the same time the performance benefits of using a Rust consumer
are negligible.

Python message processor compat shims
-------------------------------------

Even when a dataset is being processed with 100% Rust in prod (i.e. pure-rust
consumer is being used), we still have those ``/tests/`` API endpoints, as
there is a need for testing, and so there still needs to be a Python
implementation of the same message processor. For this purpose
``RustCompatProcessor`` can be used as a baseclass that will delegate all logic
back into Rust. This means that:

* ``snuba consumer`` will connect to Kafka using Python, but process messages
in Rust (using Python's multiprocessing)
* ``snuba rust-consumer --use-python-processor`` makes no sense to deploy
anywhere, as it will connect to Kafka using Rust, then perform a roundtrip
through Python only to call Rust business logic again.

Testing
-------

When porting a message processor to Rust, we validate equivalence to Python by:

1. Registering the Python class in
``tests/consumers/test_message_processors.py``, where it will run all
payloads from ``sentry-kafka-schemas`` against both message processors and
assert the same rows come back. If there is missing test coverage, it's
preferred to add more payloads to ``sentry-kafka-schemas`` than to write
custom tests.

2. Remove the Python message processor and replace it with
``RustCompatProcessor``, in which case the existing Python tests (of both
Snuba and Sentry) will be directly run against the Rust message processor.

Architecture
------------

Python
~~~~~~

In order to get around the GIL, Python consumers use arroyo's multiprocessing
support to be able to use multiple cores. This comes with significant
serialization overhead and an amount of complexity that is out of scope for
this document.

Pure Rust
~~~~~~~~~

Despite the name this consumer is still launched from the Python CLI. The way
this works is that all Rust is compiled into a shared library exposing a
``consumer()`` function, and packaged using ``maturin`` into a Python wheel.
The Python CLI for ``snuba rust-consumer``:

1. Parses CLI arguments
2. Resolves config (loads storages, clickhouse settings)
3. Builds a new config JSON payload containing only information relevant to the
Rust consumer (name of message processor, name of physical Kafka topic, name
of ClickHouse table, and connection settings)
4. calls ``rust_snuba.consumer(config)``, at which point Rust takes over the
process entirely.

Concurrency model in pure-rust is very simple: The message processors run on a
``tokio::Runtime``, which means that we're using regular OS threads in order to
use multiple cores. The GIL is irrelevant since no Python code runs.

Hybrid
~~~~~~

Hybrid consumer is mostly the same as pure-rust. The main difference is that it
calls back into Python message processors. How that works is work-in-progress,
but fundamentally it is subject to the same concurrency problems as the regular
pure-Python consumer, and is therefore forced to spawn subprocesses and perform
IPC one way or the other.

Since the consumer is launched from the Python CLI, it will find the Python
interpreter already initialized, and does not have to re-import Snuba again
(except in subprocesses)

Signal-handling is a bit tricky. Since no Python code runs for the majority of
the consumer's lifetime, Python's signal handlers cannot run. This also means
that the Rust consumer has to register its own handler for ``Ctrl-C``, but
doing so also means that Python's own signal handlers are completely ignored.
This is fine for the pure-rust case, but in the Hybrid case we have some Python
code still running. For that Python code, ``KeyboardInterrupt`` does
not work.
177 changes: 177 additions & 0 deletions _sources/architecture/datamodel.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
================
Snuba Data Model
================

This section explains how data is organized in Snuba and how user facing
data is mapped to the underlying database (Clickhouse in this case).

The Snuba data model is divided horizontally into a **logical model** and
a **physical model**. The logical data model is what is visible to the Snuba
clients through the Snuba query language. Elements in this model may or may
not map 1:1 to tables in the database. The physical model, instead, maps 1:1
to database concepts (like tables and views).

The reasoning behind this division is that it allows Snuba to expose a
stable interface through the logical data model and perform complex mapping
internally to execute a query on different tables (part of the physical
model) to improve performance in a way that is transparent to the client.

The rest of this section outlines the concepts that compose the two models
and how they are connected to each other.

The main concepts, described below are dataset, entity and storage.

.. image:: /_static/architecture/datamodel.png

Datasets
========

A Dataset is a name space over Snuba data. It provides its own schema and
it is independent from other datasets both in terms of logical model and
physical model.

Examples of datasets are, discover, outcomes, sessions. There is no
relationship between them.

A Dataset can be seen as a container for the components that define its
abstract data model and its concrete data model that are described below.

In term of query language, every Snuba query targets one and only one
Dataset, and the Dataset can provide extensions to the query language.

Entities and Entity Types
=========================

The fundamental block of the logical data model Snuba exposes to the client
is the Entity. In the logical model an entity represents an instance of an
abstract concept (like a transaction or an error). In practice an *Entity*
corresponds to a row in a table in the database. The *Entity Type* is the
class of the Entity (like Error**s** or Transaction**s**).

The logical data model is composed by a set of *Entity Types* and by their
relationships.

Each *Entity Type* has a schema which is defined by a list of fields with
their associated abstract data types. The schemas of all the *Entity Types*
of a Dataset (there can be several) compose the logical data model that is
visible to the Snuba client and against which Snuba Queries are validated.
No lower level concept is supposed to be exposed.

Entity Types are unequivocally contained in a Dataset. An Entity Type cannot
be present in multiple Datasets.

Relationships between Entity Types
----------------------------------

Entity Types in a Dataset are logically related. There are two types of
relationships we support:

- Entity Set Relationship. This mimics foreign keys. This relationship is
meant to allow joins between Entity Types. It only supports one-to-one
and one-to-many relationships at this point in time.
- Inheritance Relationship. This mimics nominal subtyping. A group of Entity
Types can share a parent Entity Type. Subtypes inherit the schema from the
parent type. Semantically the parent Entity Type must represent the union
of all the Entities whose type inherit from it. It also must be possible
to query the parent Entity Type. This cannot be just a logical relationship.

Entity Type and consistency
---------------------------

The Entity Type is the largest unit where Snuba **can** provide some strong
data consistency guarantees. Specifically it is possible to query an Entity
Type expecting Serializable Consistency (please don't use that. Seriously,
if you think you need that, you probably don't). This does not extend to
any query that spans multiple Entity Types where, at best, we will have
eventual consistency.

This also has an impact on Subscription queries. These can only work on one
Entity Type at a time since, otherwise, they would require consistency between
Entity Types, which we do not support.

.. ATTENTION::
To be precise the unit of consistency (depending on the Entity Type)
can be even smaller and depend on how the data ingestion topics
are partitioned (project_id for example), the Entity Type is the
maximum Snuba allows. More details are (ok, will be) provided in
the Ingestion section of this guide.

Storage
=======

Storages represent and define the physical data model of a Dataset. Each
Storage represent is materialized in a physical database concept like a table
or a materialized view. As a consequence each Storage has a schema defined
by fields with their types that reflects the physical schema of the DB
table/view the Storage maps to and it is able to provide all the details to
generate DDL statements to build the tables on the database.

Storages are able to map the logical concepts in the logical model discussed
above to the physical concept of the database, thus each Storage needs to be
related with an Entity Type. Specifically:

- Each Entity Type must be backed by least one Readable Storage (a Storage we
can run query on), but can be backed by multiple Storages (for example a
pre-aggregate materialized view). Multiple Storages per Entity Type are meant
to allow query optimizations.
- Each Entity Type must be backed by one and only one Writable
Storage that is used to ingest data and fill in the database tables.
- Each Storage is backing exclusively one Entity Type.



Examples
========

This section provides some examples of how the Snuba data model can represent
some real world models.

These case studies are not necessarily reflecting the current Sentry production
model nor they are part of the same deployment. They have to be considered as
examples taken in isolation.

Single Entity Dataset
---------------------

This looks like the Outcomes dataset used by Sentry. This actually does not
reflect Outcomes as of April 2020. It is though the design Outcomes should
move towards.

.. image:: /_static/architecture/singleentity.png

This Dataset has one Entity Type only which represent an individual Outcome
ingested by the Dataset. Querying raw Outcomes is painfully slow so we have
two Storages. One is the Raw storage that reflects the data we ingest and a
materialized view that computes hourly aggregations that are much more efficient
to query. The Query Planner would pick the storage depending if the query
can be executed on the aggregated data or not.

Multi Entity Type Dataset
-------------------------

The canonical example of this Dataset is the Discover dataset.

.. image:: /_static/architecture/multientity.png

This has three Entity Types. Errors, Transaction and they both inherit from
Events. These form the logical data model, thus querying the Events Entity
Type gives the union of Transactions and Errors but it only allows common
fields between the two to be present in the query.

The Errors Entity Type is backed by two Storages for performance reasons.
One is the main Errors Storage that is used to ingest data, the other is a
read only view that is putting less load on Clickhosue when querying but
that offers lower consistency guarantees. Transactions only have one storage
and there is a Merge Table to serve Events (which is essentially a view over
the union of the two tables).

Joining Entity types
--------------------

This is a simple example of a dataset that includes multiple Entity Types
that can be joined together in a query.

.. image:: /_static/architecture/joins.png

GroupedMessage and GroupAssingee can be part of a left join query with Errors.
The rest is similar with what was discussed in the previous examples.
Loading

0 comments on commit 965c2e7

Please sign in to comment.