Skip to content
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.

Kafka tiered storage concept content #2142

Merged
merged 19 commits into from
Nov 3, 2023
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions _toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,19 @@ entries:
- file: docs/products/kafka/concepts/monitor-consumer-group
- file: docs/products/kafka/concepts/kafka-quotas
title: Quotas
- file: docs/products/kafka/concepts/list-kafka-tiered-storage
title: Tiered storage
entries:
- file: docs/products/kafka/concepts/kafka-tiered-storage
title: Overview
- file: docs/products/kafka/concepts/tiered-storage-how-it-works
title: How it works
- file: docs/products/kafka/concepts/tiered-storage-guarantees
title: Guarantees
- file: docs/products/kafka/concepts/tiered-storage-limitations
title: Limitations


- file: docs/products/kafka/howto
title: HowTo
entries:
Expand Down
51 changes: 51 additions & 0 deletions docs/products/kafka/concepts/kafka-tiered-storage.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
Tiered storage in Aiven for Apache Kafka® overview
=====================================================

Tiered storage in Aiven for Apache Kafka® enables more effective data management by utilizing two different storage types—local disk and remote cloud storage solutions such as AWS S3, Google Cloud Storage, and Azure Blob Storage. This feature offers a tailored approach to data storage, allowing you to allocate frequently accessed data to high-speed local disks while offloading less critical or infrequently accessed data to more cost-effective remote storage solutions. Tiered storage enables you to indefinitely store data on specific topics without running out of space. Once enabled, it is configured per topic, giving you granular control over data storage needs.

.. important::

Aiven for Apache Kafka® tiered storage is an early availability feature, which means it has some restrictions on the functionality and service level agreement. It is intended for non-production environments, but you can test it with production-like workloads to assess the performance. To enable this feature, navigate to the :doc:`Feature preview </docs/platform/howto/feature-preview>` page within your user profile.


.. note::
- Tiered storage for Aiven for Apache Kafka® is supported starting from Apache Kafka® version 3.6.
- Tiered storage for Aiven for Apache Kafka® is not available for startup-2 plans.


Benefits of tiered storage
----------------------------
Tiered storage offers multiple benefits, including:

* **Scalability:** With tiered storage in Aiven for Apache Kafka, storage and computing are effectively decoupled, enabling them to scale independently. This flexibility ensures that while the storage capacity can expand almost infinitely with cloud solutions, compute resources can also be adjusted based on demand, thus eliminating any concerns about storage or processing limitations.
* **Cost efficiency:** By moving less frequently accessed data to a cost-effective storage tier, you can achieve significant financial savings.
* **Operational speed:** With the bulk of data offloaded to remote storage, service rebalancing in Aiven for Apache Kafka becomes faster, making for a smoother operational experience.
* **Infinite data retention:** With the scalability of cloud storage, you can achieve unlimited data retention, valuable for analytics and compliance.
* **Transparency:** Even older Kafka clients can benefit from tiered storage without needing to be explicitly aware of it.

When and why to use it
------------------------

Understanding when and why to use tiered storage in Aiven for Apache Kafka will help you maximize its benefits, particularly around cost savings and system performance.

**Scenarios for use:**

* **Long-term data retention**: Many organizations require large-scale data storage for extended periods, either for regulatory compliance or historical data analysis. Cloud services provide an almost limitless storage capacity, making it possible to keep data accessible for as long as required at a reasonable cost. This is where tiered storage becomes especially valuable.
* **High-speed data ingestion**: Tiered storage can offer a solution when dealing with unpredictable or sudden influxes of data. By supplementing the local disks with cloud storage, sudden increases in incoming data can be managed, ensuring optimum system performance.
* **Unlock unexplored opportunities:** Tiered storage in Aiven for Apache Kafka addresses existing storage challenges and opens the door to new and innovative use cases that were once unfeasible or cost-prohibitive. By eliminating traditional storage limitations, organizations gain the flexibility to support a wide range of applications and workflows, even those where Apache Kafka might have been considered impractical before. We encourage users to leverage this newfound flexibility to think creatively and redefine their experience with Apache Kafka.



Pricing
-------
Tiered storage costs are determined by the amount of remote storage used, measured in GB/hour. The highest usage level within each hour is the basis for calculating charges.


Related reading
----------------

* :doc:`How tiered storage works in Aiven for Apache Kafka® </docs/products/kafka/concepts/tiered-storage-how-it-works>`

Check failure on line 47 in docs/products/kafka/concepts/kafka-tiered-storage.rst

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/products/kafka/concepts/kafka-tiered-storage.rst#L47

[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.
Raw output
{"message": "[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.", "location": {"path": "docs/products/kafka/concepts/kafka-tiered-storage.rst", "range": {"start": {"line": 47, "column": 77}}}, "severity": "ERROR"}
* :doc:`Guarantees </docs/products/kafka/concepts/tiered-storage-guarantees>`

Check failure on line 48 in docs/products/kafka/concepts/kafka-tiered-storage.rst

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/products/kafka/concepts/kafka-tiered-storage.rst#L48

[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.
Raw output
{"message": "[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.", "location": {"path": "docs/products/kafka/concepts/kafka-tiered-storage.rst", "range": {"start": {"line": 48, "column": 36}}}, "severity": "ERROR"}
* :doc:`Limiations </docs/products/kafka/concepts/tiered-storage-limitations>`

Check failure on line 49 in docs/products/kafka/concepts/kafka-tiered-storage.rst

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/products/kafka/concepts/kafka-tiered-storage.rst#L49

[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.
Raw output
{"message": "[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.", "location": {"path": "docs/products/kafka/concepts/kafka-tiered-storage.rst", "range": {"start": {"line": 49, "column": 36}}}, "severity": "ERROR"}
* Enabled tiered storage for Aiven for Apache Kafka® service

9 changes: 9 additions & 0 deletions docs/products/kafka/concepts/list-kafka-tiered-storage.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Tiered storage in Aiven for Apache Kafka®
===========================================

Discover how tiered storage works in Aiven for Apache Kafka®, explore its use cases, and learn why you might need it and what benefits it offers.



.. tableofcontents::

25 changes: 25 additions & 0 deletions docs/products/kafka/concepts/tiered-storage-guarantees.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
Guarantees
============
With Aiven for Apache Kafka®'s tiered storage, there are two primary types of data retention guarantees: *total retention* and *local retention*.

**Total retention**: Tiered storage ensures that your data will be available up to the limit defined by the total retention threshold, regardless of whether it is stored locally or remotely. This means that your data will not be deleted until the total retention threshold, whether on local or remote storage, is reached.

**Local retention**: Log segments are only removed from local storage after successfully being uploaded to remote storage, even if the data exceeds the local retention threshold.


Example
--------

Let's say you have a topic with a **total retention threshold** of **1000 GB** and a **local retention threshold** of **200 GB**. This means that:

* All data for the topic will be retained, regardless of whether it is stored locally or remotely, as long as the total size of the data does not exceed 1000 GB.
* If tiered storage is enabled per topic, older segments will be uploaded immediately to remote storage, irrespective of whether the local retention threshold of 200 GB is exceeded. Data will be deleted from local storage only after it has been safely transferred to remote storage.
* If the total size of the data exceeds 1000 GB, Apache Kafka will begin deleting the oldest data from remote storage.


Related reading
----------------

* :doc:`Tiered storage in Aiven for Apache Kafka® overview </docs/products/kafka/concepts/kafka-tiered-storage>`

Check failure on line 23 in docs/products/kafka/concepts/tiered-storage-guarantees.rst

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/products/kafka/concepts/tiered-storage-guarantees.rst#L23

[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.
Raw output
{"message": "[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.", "location": {"path": "docs/products/kafka/concepts/tiered-storage-guarantees.rst", "range": {"start": {"line": 23, "column": 76}}}, "severity": "ERROR"}

Check failure on line 23 in docs/products/kafka/concepts/tiered-storage-guarantees.rst

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/products/kafka/concepts/tiered-storage-guarantees.rst#L23

[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.
Raw output
{"message": "[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.", "location": {"path": "docs/products/kafka/concepts/tiered-storage-guarantees.rst", "range": {"start": {"line": 23, "column": 91}}}, "severity": "ERROR"}
* :doc:`How tiered storage works in Aiven for Apache Kafka® </docs/products/kafka/concepts/tiered-storage-how-it-works>`

Check failure on line 24 in docs/products/kafka/concepts/tiered-storage-guarantees.rst

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/products/kafka/concepts/tiered-storage-guarantees.rst#L24

[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.
Raw output
{"message": "[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.", "location": {"path": "docs/products/kafka/concepts/tiered-storage-guarantees.rst", "range": {"start": {"line": 24, "column": 77}}}, "severity": "ERROR"}
* Enabled tiered storage for Aiven for Apache Kafka® service
57 changes: 57 additions & 0 deletions docs/products/kafka/concepts/tiered-storage-how-it-works.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
How tiered storage works in Aiven for Apache Kafka®
===================================================

Aiven for Apache Kafka® tiered storage is a feature that optimizes data management across two distinct storage tiers:

* **Local tier**: Primarily consists of faster and typically more expensive storage solutions like solid-state drives (SSDs).
* **Remote tier**: Relies on slower, cost-effective options like cloud object storage.

In Aiven for Apache Kafka's tiered storage architecture, **remote storage** refers to storage options external to the Kafka broker's local disk. This typically includes cloud-based or self-hosted object storage solutions like AWS S3 and Google Cloud. Although network-attached block storage solutions like AWS EBS are technically external to the broker machine, Apache Kafka considers them local storage within its tiered storage architecture.

Tiered storage operates in a way that is seamless for both Apache Kafka producers and consumers. This means that producers and consumers interact with Apache Kafka in the same way, regardless of whether tiered storage is enabled or not.

Administrators can configure tiered storage per topic by defining the retention period and retention bytes to specify how much data should be retained on the local disk instead of remote storage.


Local vs. remote data retention
---------------------------------
When tiered storage is enabled, data is initially stored on the local disk of the Kafka broker. Data is then asynchronously transferred to remote storage based on the pre-defined local retention threshold. During periods of high data ingestion or transient errors, such as network connectivity issues, the local storage might temporarily hold more data than specified by the local retention threshold.

.. image:: /images/products/kafka/tiered-storage/data-retention.png
:alt: Diagram depicting the concept of local vs. remote data retention in a tiered storage system.

Segment management
-------------------
Data is organized into segments, which are uploaded to remote storage individually. The active (newest) segment remains in local storage, which means that the segment size can also influence local data retention. For instance, if the local retention threshold is 1 GB, but the segment size is 2 GB, the local storage will exceed the 1 GB limit until the active segment is rolled over and uploaded to remote storage.


Asynchronous uploads and replication
--------------------------------------
Data is transferred to remote storage asynchronously and does not interfere with the producer activity. While the broker aims to move data as swiftly as possible, certain conditions, such as high-throughput or connectivity issues, may cause more data to be stored in the local storage than the specified local retention policy.
harshini-rangaswamy marked this conversation as resolved.
Show resolved Hide resolved

Any data exceeding the local retention threshold will not be purged by the log cleaner until it is successfully uploaded to remote storage.
The replication factor is not considered during the upload process, and only one copy of each segment is uploaded to the remote storage. Most remote storage options have their own measures, including data replication, to ensure data durability.


Data retrieval
-----------------
When consumers fetch records stored in remote storage, the broker downloads and caches these records locally. This allows for quicker access in subsequent retrieval operations.

Aiven allocates a small amount of disk space, ranging from 2GB to 16GB, equivalent to 5% of the broker's total available disk, for the temporary storage of fetched records.

Security
--------
Segments are encrypted with 256-bit AES encryption before being uploaded to the remote storage. The encryption keys are not shared with the cloud storage provider and generally do not leave Aiven machines.




Related reading
----------------

* :doc:`Tiered storage in Aiven for Apache Kafka® overview </docs/products/kafka/concepts/kafka-tiered-storage>`

Check failure on line 52 in docs/products/kafka/concepts/tiered-storage-how-it-works.rst

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/products/kafka/concepts/tiered-storage-how-it-works.rst#L52

[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.
Raw output
{"message": "[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.", "location": {"path": "docs/products/kafka/concepts/tiered-storage-how-it-works.rst", "range": {"start": {"line": 52, "column": 76}}}, "severity": "ERROR"}

Check failure on line 52 in docs/products/kafka/concepts/tiered-storage-how-it-works.rst

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/products/kafka/concepts/tiered-storage-how-it-works.rst#L52

[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.
Raw output
{"message": "[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.", "location": {"path": "docs/products/kafka/concepts/tiered-storage-how-it-works.rst", "range": {"start": {"line": 52, "column": 91}}}, "severity": "ERROR"}
* :doc:`Guarantees </docs/products/kafka/concepts/tiered-storage-guarantees>`

Check failure on line 53 in docs/products/kafka/concepts/tiered-storage-how-it-works.rst

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/products/kafka/concepts/tiered-storage-how-it-works.rst#L53

[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.
Raw output
{"message": "[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.", "location": {"path": "docs/products/kafka/concepts/tiered-storage-how-it-works.rst", "range": {"start": {"line": 53, "column": 36}}}, "severity": "ERROR"}
* :doc:`Limiations </docs/products/kafka/concepts/tiered-storage-limitations>`

Check failure on line 54 in docs/products/kafka/concepts/tiered-storage-how-it-works.rst

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/products/kafka/concepts/tiered-storage-how-it-works.rst#L54

[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.
Raw output
{"message": "[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.", "location": {"path": "docs/products/kafka/concepts/tiered-storage-how-it-works.rst", "range": {"start": {"line": 54, "column": 36}}}, "severity": "ERROR"}
* Enabled tiered storage for Aiven for Apache Kafka® service


20 changes: 20 additions & 0 deletions docs/products/kafka/concepts/tiered-storage-limitations.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Trade-offs and limitations
============================

The main trade-off of tiered storage is the higher latency while accessing and reading data from remote storage compared to local disk storage. While adding local caching can partially solve this problem, it cannot eliminate the latency completely.

Limitations
-------------

* Tiered storage currently does not support compacted topics.
* If you enable tiered storage for a topic, you cannot deactivate it without losing data in the remote storage. To deactivate tiered storage, contact `Aiven support <mailto:support@aiven.io>`_.
* Increasing the local retention threshold won't move segments already uploaded to remote storage back to local storage. This change only affects new data segments.
* If you enable tiered storage on a service, you can't migrate the service to a different region or cloud, except for moving to a virtual cloud in the same region. For migration to a different region or cloud, contact `Aiven support <mailto:support@aiven.io>`_.


Related reading
----------------

* :doc:`Tiered storage in Aiven for Apache Kafka® overview </docs/products/kafka/concepts/kafka-tiered-storage>`

Check failure on line 18 in docs/products/kafka/concepts/tiered-storage-limitations.rst

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/products/kafka/concepts/tiered-storage-limitations.rst#L18

[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.
Raw output
{"message": "[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.", "location": {"path": "docs/products/kafka/concepts/tiered-storage-limitations.rst", "range": {"start": {"line": 18, "column": 76}}}, "severity": "ERROR"}

Check failure on line 18 in docs/products/kafka/concepts/tiered-storage-limitations.rst

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/products/kafka/concepts/tiered-storage-limitations.rst#L18

[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.
Raw output
{"message": "[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.", "location": {"path": "docs/products/kafka/concepts/tiered-storage-limitations.rst", "range": {"start": {"line": 18, "column": 91}}}, "severity": "ERROR"}
* :doc:`How tiered storage works in Aiven for Apache Kafka® </docs/products/kafka/concepts/tiered-storage-how-it-works>`

Check failure on line 19 in docs/products/kafka/concepts/tiered-storage-limitations.rst

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/products/kafka/concepts/tiered-storage-limitations.rst#L19

[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.
Raw output
{"message": "[Aiven.common_replacements] Use 'Kafka' instead of 'kafka'.", "location": {"path": "docs/products/kafka/concepts/tiered-storage-limitations.rst", "range": {"start": {"line": 19, "column": 77}}}, "severity": "ERROR"}
* Enabled tiered storage for Aiven for Apache Kafka® service
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading