Time Series - Pre Development Stage #1180

lvca · 2023-07-21T16:36:24Z

lvca
Jul 21, 2023
Maintainer

We have some users that are already using ArcadeDB for time series despite it's not an official model. We collected many use cases and come up with a design that should be easy to implement and, most importantly, blazing fast and space efficient.

The idea is simple: when you create a time series type, you define the following special attributes:

timestamp property name
file aggregation unit (year, month, day, hour)
page aggregation unit (day, hour, minute, second)
page size (default 64k)

You can find some of these concepts in clustered tables.

For example, if you have sensor data with millisecond precision, let's say around 1-10K measurements per minute, you could create a type "Sensor" with the following settings:

timestamp property name = "timestamp"
file aggregation unit = "day"
page aggregation = "minute"
page size = 32K

This means ArcadeDB will create a new file every day (with a name such as "Sensor_20230721") and it will start storing sensor data from this day in that file.

Each page stores only a minute in this case. The page size is configurable. Let's say you are keeping the following data arrived from a measurement:

{
  "timestamp": 1689956195339,
  "sensorId": 2321,
  "temperature": 40.5
}

Then ArcadeDB will save the record that hosts the minute relative to the timestamp 1689956195339 that is Friday, July 21, 2023 4:16:35.339 PM (GMT) in the file Sensor_20230721.

A page has the following header (8,210 bytes total):

bucket header (2 + 4,096 pointers to the record inside the page, 2 bytes each - max 65KB page)
page timestamp (long - 8 bytes))
previous page id (int - 4 bytes)
next page id (int - 4 bytes)

The record above will be stored in the following way:

record's timestamp as the delta in milliseconds between the actual timestamp and page timestamp. This is to reduce the number of bytes required to store the integer number, stored as a varint. In this way, many timestamps can be stored with 1-2 bytes instead of the 8 bytes required for a fixed-size long type.
the record will be stored with the normal serializer, striped of the timestamp property (because stored as above)

The record in the example above could be stored in only 3 bytes with the most favorable conditions. A 64K page, without the header (that is 8,210 bytes), can use up to 57,326 bytes of content = an average of 13 bytes per record.

The other page attributes (previous page id and next page id) work as a linked list. In the perfect scenario that sensor data are coming ordered by timestamp, a dichotomic search would be very efficient to look up the right page during a query. In the case some records arrive late, the relative page is updated until there is space, otherwise a new page is appended and linked to the previous one.

If you're looking for a sensor in a particular range you will be able to issue this query:

SELECT FROM Sensor
WHERE timestamp >= 1689956195339 AND timestamp <= 168995699999
AND sensorId = 2321

And ArcadeDB will use this special search to look for the record in this range. This works as a clustered index and there is no need to create an index on timestamp for an efficient retrieval. Also, this clustered index layout allows fast lookups and minimal storage = fast search and blazing fast insert.

We run some benchmarks internally to simulate this structure with the current buckets and we were able to measure >3M insert per second on a MacBook Pro 2019 using 7 parallel threads (!)

Another topic is a configurable pre-aggregation of data. In the example above, you could specify to aggregate the temperature by minute, using the AVERAGE function. In this way, during the insertion, the aggregated value would be updated and ready to be returned without any calculation. This is meant for phase 2 of the time series module.

WDYT? Any feedback about this?

topofocus · 2023-08-15T08:16:07Z

topofocus
Aug 15, 2023

I appreciate your thoughts on establishing effective time-series in arcadedb.

For me its not clear, how a time-series differs form an ordinary embedded Hash with limited update functionality.
If I understood correctly, the enhancement is an optimized index which is applied automatically.
Because its a time-series, the index has to be a timestamp.

I am asking myself, why not enhance the already implemented embedded List?
Then it might just as simple as implementing indizes on embedded documents and the support of time-series is just a special case of the document-database usage.

1 reply

lvca Aug 15, 2023
Maintainer Author

The idea above is to avoid having an index but rather rely on the unique property of time series structures:

order by timestamp
mostly append-only (but with support for delayed data)

Instead of creating an index, the data itself could be partitioned in a way that is O(1) or close to retrieving a range of data points.

gazillion101 · 2024-05-10T21:14:59Z

gazillion101
May 10, 2024

Hello! Any update on the time series model?

1 reply

lvca May 11, 2024
Maintainer Author

Hi @gazillion101, we've been busy with other priorities and nobody sponsored this feature yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time Series - Pre Development Stage #1180

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Time Series - Pre Development Stage #1180

lvca Jul 21, 2023 Maintainer

Replies: 2 comments · 2 replies

topofocus Aug 15, 2023

lvca Aug 15, 2023 Maintainer Author

gazillion101 May 10, 2024

lvca May 11, 2024 Maintainer Author

lvca
Jul 21, 2023
Maintainer

Replies: 2 comments 2 replies

topofocus
Aug 15, 2023

lvca Aug 15, 2023
Maintainer Author

gazillion101
May 10, 2024

lvca May 11, 2024
Maintainer Author