Clustered index #354

andrei-ionescu · 2021-02-10T16:11:06Z

andrei-ionescu
Feb 10, 2021

Does Hyperspace support clustered indexes?

This feature is important in quick select all (SELECT *) searches, for example when is need to extract a very small amount of full rows from a dataset.

SELECT * FROM ds WHERE col2 > 10 AND col2 < 12

Can Hyperspace help in this respect?

Answered by imback82

Feb 12, 2021

Yes, we are exploring this as a non-covering index where we store indexed columns + pointers back to the source files (we will start with file-level granularity). We will update this thread as we make progress. Related issue: #342

View full answer

sezruby · 2021-02-11T23:40:16Z

sezruby
Feb 11, 2021

Since the index data is sorted & bucketed, pushed down filters can skip to read the parquet by using min/max stats in footer.
We can exclude the data file paths from the input file paths by creating a custom FileScanExec.

Otherwise, we could consider z-order index for this.

0 replies

andrei-ionescu · 2021-02-12T16:24:10Z

andrei-ionescu
Feb 12, 2021
Author

I don't know if I was clear enough when I used the "clustered index" term. A clustered index is an index with leaves that do point to the position (like a C++ pointer) in the table where to find the data (ie: full record).

In my opinion, a Hyperspace clustered index should be an index that would positionally map the indexed column value to one or many files + positions in those files of the full row matching the indexed column value. I specified "files" (plural) because in columnar formats a row may land in multiple places.

Let me give you the following example...

In Hyperspace case, given a table with 50 columns 10 of them nested - similar to the following schema:

root
 |-- timestamp: timestamp (nullable = true)
 |-- _id: string (nullable = true)
 |-- ps: array (nullable = true)
 |    |-- el: struct (containsNull = true)
 |    |    |-- sku: string (nullable = true)
 |    |    |-- q: integer (nullable = true)
 |    |    |-- total: double (nullable = true)
 |-- comm: struct (nullable = true)
 |    |-- c: struct (nullable = true)
 |    |    |-- value: double (nullable = true)
 |    |-- p: struct (nullable = true)
 |    |    |-- value: double (nullable = true)
 |    |-- l: struct (nullable = true)
 |    |    |-- value: double (nullable = true)
 |    |-- o: struct (nullable = true)
 |    |    |-- currency: string (nullable = true)
 |    |-- v: struct (nullable = true)
 |    |    |-- value: double (nullable = true)
 |-- rTs: timestamp (nullable = true)
 |-- other: struct (nullable = true)
 |    |-- a: struct (nullable = true)
 |    |    |-- e_1_100: struct (nullable = true)
 |    |    |    |-- e_1: struct (nullable = true)
 |    |    |    |    |-- value: double (nullable = true)
 |    |    |    |    ...
 |    |    |    |-- e_100: struct (nullable = true)
 |    |    |    |    |-- value: double (nullable = true)
 |    |    |-- e_101_200: struct (nullable = true)
 |    |    |    |-- e_101: struct (nullable = true)
 |    |    |    |    |-- value: double (nullable = true)
 |    |    |    |    ...
 |    |    |    |-- e_200: struct (nullable = true)
 |    |    |    |    |-- value: double (nullable = true)
 |    |    |-- e_201_300: struct (nullable = true)
 |    |    |    |-- e_201: struct (nullable = true)
 |    |    |    |    |-- value: double (nullable = true)
 |    |    |    |    ...
 |    |    |    |-- e_300: struct (nullable = true)
 |    |    |    |    |-- value: double (nullable = true)
 |    |    |    ...
 |-- d: struct (nullable = true)
 |    |-- tid: string (nullable = true)
 |    |-- w: integer (nullable = true)
 |    |-- h: integer (nullable = true)
 |    |-- c: integer (nullable = true)
 |-- s: struct (nullable = true)
 |    |-- se: string (nullable = true)
 |    |-- ks: string (nullable = true)
 |-- _ns: string (nullable = true)
 |-- f_1: integer (nullable = true)
 |-- f_2: integer (nullable = true)
 |-- f_3: integer (nullable = true)
 ...
 |-- f_50: integer (nullable = true)

I have create the following index:

hs.createIndex(ds, IndexConfig("idx1", indexedColumns = Seq("f_1"), includedColums=Seq("timestamp", "rTs", "_ns", "_id")))

When I do

SELECT * FROM ds WHERE f_1 > 10 AND f_1 <= 20

The index will not be used because it won't match the projection fields against the indexedColums + includedColumns.

To add all the 50 columns to the includedColumns it is not a good idea because that would mean to have the same dataset once more just sorted and bucketed in another way and this would double the storage cost.

I would expect Hyperspace to go the the index and fetch the real data files that contains those 10 f_1 values (that were previously indexed) and apply the search only on those files.

0 replies

imback82 · 2021-02-12T17:04:11Z

imback82
Feb 12, 2021

Yes, we are exploring this as a non-covering index where we store indexed columns + pointers back to the source files (we will start with file-level granularity). We will update this thread as we make progress. Related issue: #342

0 replies

rapoth · 2021-02-17T20:05:59Z

rapoth
Feb 17, 2021

@andrei-ionescu I was thinking more about this and want to make sure I'm not missing any scenarios. In your question, were you referring to a clustered index or will the non-clustered non-covering index that @imback82 was suggesting work for you?

From a traditional database standpoint, there can only be one clustered index because when we choose to build a clustered index on a particular column (or set of columns), the underlying table will be physically reorganized according to the clustering key. While this kind of an index will give a lot of performance boost for certain classes of queries, it may also require that Hyperspace manage the underlying data, which is not true today.

On the other hand, if we go with a non-clustered index, this index will still contain pointers to the underlying data (and those pointers can be really, anything e.g., URIs to files, partitions etc.) but Hyperspace does not need to own the management of the underlying dataset.

1 reply

andrei-ionescu Mar 4, 2021
Author

@rapoth I'm thinking of the SELECT ALL use case .

The non-clustered index support is the best to have for use cases where you need to collect full rows (ie: SELECT *). It has the best ratio of performance vs cost in regards to storage and management. The full covering index (covering all columns) may have great performance also but the storage and management cost is also greater.

Currently, many of my use cases are of this select all kind. For example: the dataset A, a 100 columns dataset, joined with dataset B of a few columns, for the sole purpose of adding a new column to the dataset A. And both dataset A and dataset B are updating as new data is landing in them.

Regarding clustered index, you're right in traditional databases there can be only one because when the clustered index is applied the data gets re-sorted and in fact the clustered index is in fact the data in itself with some additional info. Any other index added besides the clustered index in the traditional RDBMS is in fact a non-clustered index with pointers towards the clustered index.

In Hyperspace case the current type of index is the covering index which adds great performance improvements for well defined queries which are "covered" by the index.

For non-clustered index there will still be needed to have some sort of management on Hyperspace side because the datasets are evolving with time - meaning that files are added, other are deleted, thus it will need to add new information into the non-clustered index for each new file that gets added, or delete the rows that refer a deleted file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustered index #354

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Clustered index #354

andrei-ionescu Feb 10, 2021

Replies: 4 comments · 1 reply

sezruby Feb 11, 2021

andrei-ionescu Feb 12, 2021 Author

imback82 Feb 12, 2021

rapoth Feb 17, 2021

andrei-ionescu Mar 4, 2021 Author

andrei-ionescu
Feb 10, 2021

Replies: 4 comments 1 reply

sezruby
Feb 11, 2021

andrei-ionescu
Feb 12, 2021
Author

imback82
Feb 12, 2021

rapoth
Feb 17, 2021

andrei-ionescu Mar 4, 2021
Author