The RapidMiner Belt library implements a labeled two-dimensional data table with support for columns of different types. It provides utility methods for creating, reading, and modifying such tables.
This copy of RapidMiner Belt is licensed under the Affero General Public License version 3. See the attached LICENSE file for details.
The Belt library aims at being efficient both in terms of runtime and memory usage while ensuring data consistency on the API level. This is achieved by building the library around the following concepts:
- Column-oriented design: a column-oriented data layout allows for using compact representations for the different column types.
- Immutability: all columns and tables are immutable. This not only guarantees data integrity but also allows for safely reusing components, e.g., multiple tables can safely reference the same column.
- Thread-safety: all public data structures are thread-safe and designed to perform well when used concurrently.
- Implicit parallelism: Many of Belt's built-in functionality, such as the transformations shown in the example below, automatically scale out to multiple cores.
Belt provides multiple mechanisms to construct new tables. For example, one can specify cell values via lambda functions:
Table table = Builders.newTableBuilder(10)
.addInt53Bit("id", i -> i)
.addNominal("sensor", i -> String.format("Sensor #%03d", rng.nextInt(3)))
.addReal("value_a", i -> rng.nextDouble())
.addReal("value_b", i -> rng.nextDouble())
.build(context);
Table (4x10)
id | sensor | value_a | value_b
Integer | Nominal | Real | Real
0 | Sensor #002 | 0.957 | 0.262
1 | Sensor #002 | 0.938 | 0.475
2 | Sensor #001 | 0.515 | 0.585
... | ... | ... | ...
9 | Sensor #001 | 0.197 | 0.753
Different transformations are supported.
The code below creates a new column containing the maximum of the two value columns using one of the provided apply(...)
methods.
The result, a buffer of numeric values, can easily added to the table as additional column:
NumericBuffer max = table.transform("value_a", "value_b")
.applyNumericToReal(Double::max, context);
table = Builders.newTableBuilder(table)
.add("max_value", max.toColumn())
.build(context);
Table (5x10)
id | sensor | value_a | value_b | max_value
Integer | Nominal | Real | Real | Real
0 | Sensor #002 | 0.957 | 0.262 | 0.957
1 | Sensor #002 | 0.938 | 0.475 | 0.938
2 | Sensor #001 | 0.515 | 0.585 | 0.585
... | ... | ... | ... | ...
9 | Sensor #001 | 0.197 | 0.753 | 0.753
Another supported transformation is the reduction of one or multiple columns to a single value. For example, the following code computes the column sum for the column created above:
double sumOfMax = table.transform("max_value")
.reduceNumeric(0, Double::sum, context);
7.562
The following sections will discuss the different means to create and work with Belt tables in more detail.
Alternatively, you can jump right into the Javadoc.
To know more about the implicit parallelism and the context
used in the examples above you can visit the concurrency section.
The Belt library supports both column-wise and row-wise construction of tables. Furthermore, tables can be derived from existing tables, e.g., by selecting row or column subsets.
When creating a new table, the column-wise construction is preferred: it is usually more efficient due to the column-oriented design of the library. The purpose of the row-wise construction is mostly to support use cases such as reading a row-oriented file format.
To create a new table from scratch, the Builders
class provides the method newTableBuilder(int)
.
The returned builder supports multiple methods to add columns to the new table.
The first method used below add(String, Column)
expects an existing column of matching height.
In this example, this column is created from a temporary numeric buffer in which most values are undefined.
The second method addReal(String, IntToDoubleFunction)
constructs a new column of type real by invoking the given operator for values i = 0, 1, ..., numberOfRows - 1
.
Similar methods exist for the other column types supported by Belt.
int height = 10;
NumericBuffer buffer = Buffers.realBuffer(height);
buffer.set(1, Math.PI);
Table table = Builders.newTableBuilder(height)
.add("From buffer", buffer.toColumn())
.addReal("From operator", Math::sqrt)
.build(context);
Table (2x10)
From buffer | From operator
Real | Real
? | 0.000
3.142 | 1.000
? | 1.414
... | ...
? | 3.000
The NumericBuffer
used above is one of many buffer implementations provided by the Buffers
class.
Buffers differ from columns in that they are mutable and can basically be used like an array.
However, they become immutable once toColumn()
is invoked:
subsequent calls to set(...)
will fail.
This allows for buffer implementations that do not need to duplicate data in order to guarantee the immutability of the resulting column. Thus, it is safe to assume that using buffers is as memory-efficient as constructing columns via lambda functions.
For more information about different types of buffers visit the buffer appendix.
The Writers
class provides multiple writer implementations.
Specialized ones that support only a certain column type and generic ones that allow to combine different types.
To recreate the example above, we can use the specialized real row writer. Unlike with the column-wise construction, you need to specify all columns upfront. On the flip side, specifying the number of rows is optional:
NumericRowWriter writer = Writers.realRowWriter(Arrays.asList("first", "second"), true);
for (int i = 0; i < 10; i++) {
writer.move();
if (i == 1) writer.set(0, Math.PI);
writer.set(1, Math.sqrt(i));
}
Table table = writer.create();
Table (2x10)
first | second
Real | Real
? | 0.000
3.142 | 1.000
? | 1.414
... | ...
? | 3.000
All writers provide two methods:
The move()
method which creates a new row,
and the set(int, ...)
method which allows to set the values of individual cells of the current row.
Cells are referenced by their index which is the same as the corresponding column index.
To recreate the example from the introductory example, we will need to use a MixedRowWriter
which supports combining different types.
All writers need to know the table schema in advance.
When working with mixed writers, this is accomplished by providing a separate list of column types:
MixedRowWriter writer = Writers.mixedRowWriter(Arrays.asList("id", "sensor", "value_a", "value_b"),
Arrays.asList(TypeId.INTEGER_53_BIT, TypeId.NOMINAL, TypeId.REAL, TypeId.REAL),
false);
for (int i = 0; i < 10; i++) {
writer.move();
writer.set(0, i);
writer.set(1, String.format("Sensor #%03d", rng.nextInt(3)));
writer.set(2, rng.nextDouble());
writer.set(3, rng.nextDouble());
}
Table table = writer.create();
Table (4x10)
id | sensor | value_a | value_b
Integer | Nominal | Real | Real
0 | Sensor #002 | 0.805 | 0.124
1 | Sensor #001 | 0.744 | 0.896
2 | Sensor #001 | 0.657 | 0.610
... | ... | ... | ...
9 | Sensor #000 | 0.378 | 0.282
You might have noticed that the factory methods for writers require an additional Boolean parameter which is set to true
and false
in the first and second example above respectively.
This parameter indicates whether writers should initialize rows as undefined or missing (shown as ?
above).
It is set to true
in the first example, since we do not always set all cell values in the loop.
In the second example however, we do set all values in the loop and no initialization is necessary.
Finally, you might have noticed that unlike with the column-wise construction, the row-wise construction does not require the use of an execution context. The reason for this is that all work is done inside the loop. In contrast, when constructing columns using lambda function, the work is done by the builder which will use the context for that purpose.
The Belt library provides three main mechanisms to derive a table from an existing one.
- Column set selection: creates a new table consisting only of the specified columns.
- Row set selection: creates a new table consisting only of the specified rows.
- The table builder: starts the table building process outlined above from an existing table.
To specify a column set, we can use the Table#columns(List<String>)
method which takes a list of column labels as input.
The selected columns will be reordered according to the order in the list.
The following example takes the table generated at the end of the last section, drops the value_b
column, and moves the sensor
column to the end of the table:
Table reorderedColumns = table.columns(Arrays.asList("id", "value_a", "sensor"));
Table (3x10)
id | value_a | sensor
Integer | Real | Nominal
0 | 0.805 | Sensor #002
1 | 0.744 | Sensor #001
2 | 0.657 | Sensor #001
... | ... | ...
9 | 0.378 | Sensor #000
Similarly, we can select a set of rows using the Table#rows(int[], Context)
method.
The following example selects three rows and changes their order (see the id
column in the output):
Table threeRows = table.rows(new int[]{2, 0, 7}, context);
Table (4x3)
id | sensor | value_a | value_b
Integer | Nominal | Real | Real
2 | Sensor #001 | 0.657 | 0.610
0 | Sensor #002 | 0.805 | 0.124
7 | Sensor #001 | 0.787 | 0.200
This method also supports duplicate row ids.
For example, passing in the index array {0, 0, 0}
instead, would create a 4×3 table containing the first row three times.
As a special case of row selection there are filter methods that only select the rows satisfying a certain condition, and sorting methods to reorder the rows.
The third mechanism restarts the table building process (without modifying the input table).
It provides all of the methods discussed in the section on column-wise table creation and additional methods that allow to replace or remove columns.
For example, the following code replaces the column value_a
with one scaled by factor 500
, renames the sensor
column, and again drops the value_b
column:
NumericBuffer valueAScaled = table.transform("value_a")
.applyNumericToReal(v -> v * 500, context);
Table complexChanges = Builders.newTableBuilder(table)
.rename("sensor", "sensor_id")
.replace("value_a", valueAScaled.toColumn())
.remove("value_b")
.build(context);
Table (3x10)
id | sensor_id | value_a
Integer | Nominal | Real
0 | Sensor #002 | 402.686
1 | Sensor #001 | 371.767
2 | Sensor #001 | 328.636
... | ... | ...
9 | Sensor #000 | 189.064
For more details about what information a table holds, visit the Table properties appendix.
Tables can be read either column-wise or row-wise. If possible, column-wise reading is preferred since it allows to better choose the appropriate reader for a certain column. Due to the column-oriented design of the library, column-wise reading usually also performs better than row-wise reading.
The entry point for both column-wise and row-wise reading is the Readers
class.
There are three possible ways to read columns:
A column can be read as sequence of numbers (i.e. double values) with
NumericReader reader = Readers.numericReader(column);
while(reader.hasRemaining()){
double value = reader.read();
}
Alternatively, a column can be read as sequence of category indices with
CategoricalReader reader = Readers.categoricalReader(column);
while(reader.hasRemaining()){
int index = reader.read();
}
As third option, a column can be read as sequence of objects of a specified class with
ObjectReader<String> reader = Readers.objectReader(column, String.class);
while(reader.hasRemaining()){
String value = reader.read();
}
In every case a reader is constructed and the next value is read via reader.read()
.
If reading should start from a certain position,
reader.setPositon(desiredNextRead - 1)
can be used to move the reader in front of the desired position so that the next value is read at the position.
NumericReader
and ObjectReader
instances can be used for any column with capability Capability.NUMERIC_READABLE
and Capability.OBJECT_READABLE
.
For instance, to check whether a column is numeric-readable, we can call column.type().hasCapability(Capability.NUMERIC_READABLE)
.
Please note that a column must not be of category Category.NUMERIC
to be numeric-readable.
In contrast, a CategoricalReader
instance requires the columns to be of category Category.CATEGORICAL
, which can be checked with column.type().category() == Category.CATEGORICAL
.
For further information on different types of columns visit the section on Column types.
When reading categorical columns with a CategoricalReader
as in the following example, missing values are represented by CategoricalReader.MISSING_CATEGORY
.
Column column = getColumn();
if (column.type().category() == Column.Category.CATEGORICAL) {
int firstMissing = -1;
CategoricalReader reader = Readers.categoricalReader(column);
for (int i = 0; i < column.size(); i++) {
if (reader.read() == CategoricalReader.MISSING_CATEGORY) {
firstMissing = i;
break;
}
}
}
When reading with a NumericReader
missing values are Double.NaN
and with an ObjectReader
they are null
.
In the following code example, we read columns either as number or as objects.
// All columns are numeric-readable, object-readable, or both
if (column.type().hasCapability(Capability.NUMERIC_READABLE)) {
NumericReader reader = Readers.numericReader(column);
while (reader.hasNext()) {
double value = reader.read();
handle(value);
}
} else {
ObjectReader<Object> reader = Readers.objectReader(column, Object.class);
while (reader.hasNext()) {
Object value = reader.read();
handle(value);
}
}
It is important to know that every column is numeric-readable, object-readable or both.
All object-readable columns can be read with an ObjectReader<Object>
but if you know more about the column type you can also use a more specialized class.
Note that when reading a categorical column with a categorical reader, the associated String value (as obtained with an object reader) can be retrieved via column.getDictionary().get(categoryIndex)
.
For more information about dictionaries visit the dictionary appendix.
Similar to column-wise reading, there are three standard ways to read row-wise:
A list column can be read as numbers (i.e. double values) with
NumericRowReader reader = Readers.numericRowReader(columns);
while (reader.hasRemaining()) {
reader.move();
for (int i = 0; i < reader.width(); i++) {
double value = reader.get(i);
}
}
Alternatively, categorical columns can be read as category indices with
CategoricalRowReader reader = Readers.categoricalRowReader(columns);
while (reader.hasRemaining()) {
reader.move();
for (int i = 0; i < reader.width(); i++) {
int value = reader.get(i);
}
}
As third option, columns can be read as objects of a specified class with
ObjectRowReader<String> reader = Readers.objectRowReader(columns, String.class);
while (reader.hasRemaining()) {
reader.move();
for (int i = 0; i < reader.width(); i++) {
String value = reader.get(i);
}
}
where now the class must be compatible with all the columns.
For all readers, the reader.move()
method advances the reader by one row.
After that, row values can be accessed via reader.get(index)
.
As for column-wise reading, if reading starts from a certain row,
reader.setPositon(desiredNextRow - 1)
can be used to move the reader in front of the desired row so that the next values are read at the row position.
As before, missing values in NumericRowReader
are Double.NaN
,
in CategoricalRowReader
they are CategoricalReader.MISSING_CATEGORY
, and in ObjectRowReader
they are null
.
Analogous to column-wise reading, all columns to be read by a NumericRowReader
must be numeric-readable
and all columns for ObjectRowReader
must be object-readable.
In contrast, CategoricalRowReader
requires the columns to be categorical.
The readability and category can be checked as described for column-wise reading.
For further information on different types of columns visit the section on Column types.
To find all columns in a table with a certain readability property one can use the table.select()
method as in the following example:
List<Column> numericReadableColumns =
table.select().withCapability(Column.Capability.NUMERIC_READABLE).columns();
for (NumericRowReader reader = Readers.numericRowReader(numericReadableColumns); reader.hasRemaining(); ) {
reader.move();
for (int i = 0; i < reader.width(); i++) {
double value = reader.get(i);
handle(value);
}
}
Note that when reading categorical columns with a numeric reader, the associated String values (as obtained with an object reader) can be retrieved via column.getDictionary().get((int) value)
.
For more information about dictionaries visit the dictionary appendix.
The next example uses the column selector to find all nominal columns which are readable as String objects.
ColumnSelector columnSelector = table.select().ofTypeId(Column.TypeId.NOMINAL);
List<Column> nominalColumns = columnSelector.columns();
List<String> nominalLabels = columnSelector.labels();
List<String> found = new ArrayList<>();
for (int i = 0; i < nominalColumns.size(); i++) {
ObjectReader<String> reader = Readers.objectReader(nominalColumns.get(i), String.class);
while (reader.hasRemaining()) {
String read = reader.read();
if (read != null && read.contains("toFind")) {
found.add(nominalLabels.get(i));
break;
}
}
}
Besides the specialized readers there is one other reader that can read all types of columns.
MixedRowReader reader = Readers.mixedRowReader(table);
while (reader.hasRemaining()) {
reader.move();
for(int i = 0; i< table.width(); i++){
ColumnType<?> type = table.column(i).type();
if(type.category()== Column.Category.CATEGORICAL){
int index = reader.getIndex(i);
}else if(type.hasCapability(Column.Capability.OBJECT_READABLE)){
Object value = reader.getObject(i);
}else{
double value = reader.getNumeric(i);
}
}
}
The method reader.getNumeric(index)
returns as double
the numeric value of the index-th column or Double.NaN
if the column is not numeric-readable.
The method reader.getIndex(index)
returns the category index of a categorical column or some undefined value if the index-th column is not categorical.
The method reader.getObject(index)
returns the object value of an object-readable column or null
if the column is not object-readable.
If a more specialized class of an object-readable column is known, the method reader.getObject(index, type)
can be used,
e.g. reader.getObject(5, String.class)
returns the String value.
Note that having different kinds of columns in a table does not lead to requiring a mixed row reader for row-wise reading.
Keep in mind that every column is either numeric or object readable.
Thus, you always cover all columns when reading the columns with and without capability Capability.NUMERIC_READABLE
using one numeric and one object row reader respectively.
Even though the row readers implement a row interface, these rows are only views and should never be stored. The following example will not allow future access to all rows but only the last:
// Don't do this!
List<NumericRow> rows = new ArrayList<>();
NumericRowReader reader = Readers.numericRowReader(column1, column2);
while (reader.hasRemaining()) {
reader.move();
rows.add(reader);
}
Much of the built-in functionality of the Belt library can scale out to multiple CPU cores.
In particular, most data transformations work concurrently.
It is easy to identify whether a particular API supports parallel processing or not.
If the API requires an execution Context
, it can scale the operation out to multiple cores.
If it does not, it will run the code single-threaded unless you manually parallelize it, e.g.,
by directly using the functionality provided by the context.
The Context
class provides methods similar to the ones provided by Java's thread pools.
For instance, the method Context#getParallelism()
specifies the desired parallelism level (number of worker threads used)
and the method Context#call(List<Callable>)
executes a list of functions concurrently.
For a complete list of context methods, please refer to the Javadoc.
The Context
class differs from thread pool implementations in that it does not necessarily maintain its own pool of worker threads.
The idea is that multiple short-lived Context
instances can share the same long-lived worker pool.
A single Context
instance is intended to be used for a single complex computation such as an analytical workflow:
the context manages the resources for that workflow and is typically marked as inactive as soon as the workflow is completed (or aborted).
This does not mean that a new context should be used for each interaction with the Belt library.
A workflow might consists of many such interactions.
Most of the time all you need to do is to pass a context to the library functions.
The library will then automatically decide on a suitable scheduling strategy.
How to use the Context
class directly for use cases not covered by these functions will be described later on this page.
An example for functions that automatically scale out to multiple cores are the table transformations. For instance, the following example might be executed concurrently:
NumericBuffer addOne = table.transform("data")
.applyNumericToReal(v -> v + 1, context);
Whether the code runs multi-threaded and with how many worker threads depends on:
- The parallelism level of the context. At most
Context#getParalellism()
worker threads will be used. If the parallelism level is one, the code will run single-threaded. - The size of the data table. Very small data sets might be processed using a single worker thread, since the scheduling overhead for the multi-threaded code might outweigh the performance increase due to the parallelization.
- The expected workload per data point. What is considered a very small data set also depends on the expected complexity of the given operator. By default, the Belt library assumes a medium workload such as a string manipulation.
The expected workload per data point can also be specified explicitly using the workload(Workload)
method.
For example, there are only a few faster than the simple addition in the example above.
Thus, we can explicitly set the workload Workload.SMALL
:
NumericBuffer addOne = table.transform("data")
.workload(Workload.SMALL)
.applyNumericToReal(v -> v + 1, context);
Whereas for expensive arithmetic operations such as Math#sin(double)
, a larger workload might be more appropriate:
NumericBuffer sin = table.transform("data")
.workload(Workload.MEDIUM)
.applyNumericToReal(Math::sin, context);
The workload not only plays a factor when deciding whether to run the core in parallel, but also when deciding on the actual scheduling strategy. Thus, it is recommended to specify the workload whenever it is known in advance. More information on the different workload settings can be found in the Javadoc.
Finally, if you want to keep track of the progress of a computation handed off to the execution context,
you can use the callback(DoubleConsumer)
method.
For example, the following code would write the progress to the standard output:
NumericBuffer sin = table.transform("data")
.workload(Workload.MEDIUM)
.callback(p -> System.out.println(String.format("Progress: %d%", p * 100)))
.applyNumericReal(Math::sin, context);
The progress is either in range [0, 1]
or of value Double.NAN
for indeterminate states.
Please note that invocations of the callback function are not synchronized.
For instance, if two threads complete the 4th and 5th batch of a workload at the same time, the callbacks with values 0.4
and 0.5
will be triggered simultaneously as well.
Thus, in the example above, the progress might be printed out of order.
While it would be possible to extract column statistics, e.g. the number of missing values, from a column using the Transformer#reduce*
methods,
there is built-in functionality for that and more sophisticated statistics accessible via the Statistics
class.
When you encounter a use case that is not covered by the auto-scaling functions described above, you can still use the Belt library in a concurrent way. However, you will have to implement your own scheduling logic.
The library's core data structures, such as buffers, columns, and tables are all thread safe and designed to allow efficient parallel use. However, temporary constructs such as readers or writers are usually not. For example, it is safe to read from one table with two threads, but each thread will require its own reader.
While you can use the core data structures of the Belt library with any threading framework,
it is recommended to use its Context
implementation.
Using the same context and thus worker pool for all concurrent code will keep all resource management at a central place.
Furthermore, the Context
implementation allows for advanced use cases such as nested submissions.
Let us, as simple example, recreate the computations above as simple single threaded loops.
In order to run them using a context, we will have to wrap these loops in Callables
.
The addOne
implementation could look like this:
Callable<NumericBuffer> addOne = () -> {
NumericBuffer buffer = Buffers.realBuffer(table.column("data"));
for (int i = 0; i < buffer.size(); i++) {
buffer.set(i, buffer.get(i) + 1);
}
return buffer;
};
The code applying the sinus function can be implemented the same way:
Callable<NumericBuffer> sin = () -> {
NumericBuffer buffer = Buffers.realBuffer(table.column("data"));
for (int i = 0; i < buffer.size(); i++) {
buffer.set(i, Math.sin(i));
}
return buffer;
};
To run the two loops concurrently, we can simply use the Context#call(List<Callable>)
method:
List<NumericBuffer> buffers = context.call(Arrays.asList(addOne, sin));
You might wonder why this method is blocking, i.e., does not return until all computations have completed.
Usually, such a design can cause issues when invoking call(List<Callable>)
from inside the worker thread:
the worker thread has to wait for other worker threads to complete their computations.
This might prevent the worker pool to be fully utilized or even cause deadlocks.
The implementation of the context prevents this problem by checking whether or not the invoking thread is a worker thread. If it is, the thread will take part of the computation. In particular, no deadlocks can occur due to nested submissions. To summarize, writing code like this using the Belt library is safe.
Callable<List<NumericBuffer>> nested = () -> context.call(Arrays.asList(addOne, sin));
List<List<NumericBuffer>> buffers = context.call(Arrays.asList(nested));
The example above makes little sense on its own. However, the same mechanism allows mixing single-threaded computations in the foreground and auto-scaling methods, while using only the resources managed by the context. Consider the following method that does some single-threaded work in the foreground and then calls one of the auto-scaling methods of the Belt library:
public static Table mixed(Table table, Context context) {
// Run in foreground
NumericBuffer buffer = Buffers.realBuffer(table.column("column_a"));
for (int i = 0; i < buffer.size(); i++) {
// Single threaded operation
}
// Use auto-scaling methods
buffer = new RowTransformer(Arrays.asList(buffer.toColumn(), table.column("column_b")))
.applyNumericToReal((a, b) -> { /* Concurrent operation */ }, context);
// Add result to input table
return Builders.newTableBuilder(table)
.add("result", buffer.toColumn())
.build(context);
}
By invoking the code as follows all operations are executed inside the worker pool managed by the context.
Due to Belt's scheduling logic, the nested call to RowTransformer#applyNumericToReal(Column, Column)
in the mixed(Table Context)
method shown above will not block any of the worker threads:
context.call(Arrays.asList(() -> mixed(table, context)));
With the ColumnIO
class Belt allows to write and read columns with primitive representation using ByteBuffer
s.
Writing a small numeric-readable column into a file can be done like this:
Path path = Paths.get("myfile");
try (FileChannel channel = FileChannel.open(path, StandardOpenOption.CREATE_NEW, StandardOpenOption.WRITE)) {
ByteBuffer buffer = ByteBuffer.allocate(column.size() * 8);
ColumnIO.putNumericDoubles(column, 0, buffer);
buffer.flip();
channel.write(buffer);
} catch (IOException e) {
e.printStackTrace();
}
There are different put-methods for the different column types, see the table at the end of this section.
If the column is too big for one ByteBuffer
, writing in a loop is recommended, as here for a time column:
int written = 0;
while(written < column.size()) {
buffer.clear();
written += ColumnIO.putTimeLongs(column, written, buffer);
buffer.flip();
channel.write(buffer);
}
The ColumnIO
class provides read methods like ColumnIO.readReal(length)
that return a column builder which can be
fed with ByteBuffer
s. Creating a small integer column from a file works as follows:
Path path = Paths.get("myfile");
try (FileChannel channel = FileChannel.open(path)) {
ByteBuffer buffer = ByteBuffer.allocate(1024);
int read = channel.read(buffer);
buffer.flip();
Column readColumn = ColumnIO.readInteger53Bit(read/8).put(buffer).toColumn();
System.out.println(readColumn);
} catch (IOException e) {
e.printStackTrace();
}
Note that when building integer columns, the provided double values are rounded, same as when using
Buffers.integer53BitBuffer(length)
.
In the common case that the column is too big to fit into a single ByteBuffer
, reading can be done in a loop:
NominalColumnBuilder builder = ColumnIO.readNominal(dictionaryValues, size);
while (builder.position() < size) {
buffer.clear();
channel.read(buffer);
buffer.flip();
builder.putIntegers(buffer);
}
Column column = builder.toColumn();
Note that for nominal columns the category indices can be written and read but the dictionary values are not stored in
ByteBuffer
s automatically. When using ColumnIO.readNominal(dictionaryValues, size)
it is required that the
dictionary values are known as a LinkedHashSet
where the order of the set aligns with the category indices, starting
with null
which is matched to the category index 0.
This table gives an overview over the ColumnIO
methods to use for writing and reading different column types.
(For information on types of columns visit the section on Column types).
Additional information can be found in the Javadoc.
type | writing | reading | format | missing value |
---|---|---|---|---|
REAL | putNumericDoubles |
readReal(length).put |
double values |
Double.NaN |
INTEGER_53_BIT | putNumericDoubles |
readInteger53Bit(length).put |
double values |
Double.NaN |
NOMINAL | putCategoricalIntegers |
readNominal(lengt).putIntegers |
int category indices |
0 |
NOMINAL with maximal dictionary index 32767 | putCategoricalShorts |
readNominal(lengt).putShorts |
short category indices |
0 |
NOMINAL with maximal dictionary index 127 | putCategoricalBytes |
readNominal(lengt).putBytes |
byte category indices |
0 |
DATE_TIME | putDateTimeLongs |
readDateTime(length).putSeconds |
long values representing seconds since epoch |
Long.MAX_VALUE |
DATE_TIME with subsecond precision additional | putDateTimeNanoInts |
readDateTime(length).putNanos |
int values between 0 and 999,999,999 representing nanoseconds |
undefined |
TIME | putTimeLongs |
readTime(length).put |
long values representing nanoseconds of the day |
Long.MAX_VALUE |
Tables consist of labelled columns and optional meta data, e.g., for annotations or to assign roles.
The list of column labels or column names can be accessed with table.labels()
and contains as many elements as
table.width()
. To check if a column with a certain label is part of a table, use table.contains(label)
. If the label
is part of the table, its position in the table can be ascertained with table.index(label)
.
Finding all column labels with certain properties can be done with the table.select()
method, for example
List<String> numericColumnsWithoutRole = table.select()
.withoutMetaData(ColumnRole.class)
.ofCategory(Column.Category.NUMERIC)
.labels();
For labels contained in the table, the associated column is accessible via table.column(label)
and has the
size table.height()
. Alternatively, columns can be accessed by index via table.column(index)
where the index is
between 0 (inclusive) and table.width()
(exclusive).
To obtain an (immutable) list of all columns, the method table.columnList()
can be used. A list of columns with
certain properties can be obtained with the select method
List<Column> numericColumnsWithoutRole = table.select()
.withoutMetaData(ColumnRole.class)
.ofCategory(Column.Category.NUMERIC)
.columns();
There are three built-in types of ColumnMetaData
:
ColumnRole
which contains a fixed set of roles for columnsColumnAnnotation
which can be used to associate text, e.g. descriptions, with a columnColumnReference
which can be used to indicate a relationship between columns
Further classes of ColumnMetaData
can be constructed but they must be immutable.
The list of meta data associated with a column can be obtained via table.getMetaData(label)
.
If one is only interested in meta data of a certain type, e.g. column annotations, one can use table.getMetaData(label, ColumnAnnotation.class)
to get the list of annotations associated with a column with a certain label.
If a meta datum is unique for each column,
one can use table.getFirstMetaData(label, ColumnRole.class)
to get that unique meta datum or null
.
Column roles and column references are examples of unique column meta data.
To find columns that have certain meta data, the select method can be used
List<String> columnsWithRoleButNotCluster = table.select()
.withMetaData(ColumnRole.class)
.withoutMetaData(ColumnRole.OUTLIER)
.labels();
To change the meta data associated with a column in a certain table, a new table must be constructed via a table builder.
Inside the builder, the methods TableBuilder#addMetaData
and TableBuilder#clearMetaData
can be used.
For more details about creating new tables from old ones using a table builder visit the table creation section.
Each column has a type, accessible by column.type()
.
Types are grouped into categories and have different capabilities such as being numeric-readable or sortable.
Columns are divided in three categories: numeric, categorical and object columns.
To determine to which category a column belongs, use column.type().category()
.
NUMERIC
columns contain numeric data, such as 64 bit real or 53 bit integer numbers.CATEGORICAL
columns contain non-unique integer indices paired with an index mapping to a String. The index mapping is called a dictionary. A dictionary of two or fewer values can additionally know if a value is positive or negative. It is important to note that not all values in the dictionary are required to appear in the data. For more information about dictionaries refer to Dictionaries.OBJECT
columns contain complex types such as instants of time. In contrast to categorical columns, object columns contain data that usually does not define a set of categories. For example date-time data usually contains so many different values that building a dictionary is not feasible in general.
Columns can have the capabilities NUMERIC_READABLE
, OBJECT_READABLE
and SORTABLE
.
To check if a column has a certain capability use column.type().hasCapability(capability)
.
Every column is either numeric-readable or object-readable or both. Numeric columns are numeric-readable, object columns are object-readable, and categorical columns are both. Columns can be readable in different ways, as shown by the time columns in the following section.
Checking the capabilities of one or multiple columns is important for picking the correct reader. See Reading tables for details.
While all current column types are sortable, for some cell types sorting might not make sense.
A column must be sortable when it is used to sort by in the table.sort
methods.
All columns that have a non-zero Comparator
accessible via column.type().comparator()
are sortable but there are columns that are sortable without supplying a comparator, for example real columns.
Every one of the types described in detail in the next section is associated with one of the type ids REAL
, INTEGER_53_BIT
, NOMINAL
, DATE_TIME
, TIME
, TEXT
, TEXT_LIST
or TEXT_SET
.
For object and categorical columns, the type of their elements can be accessed via column.type().elementType()
.
For instance, nominal columns have the element type String.class
.
Belt has the following column types.
Type Id | Description | Category | Capabilities | Element Type |
---|---|---|---|---|
REAL | 64 bit double values |
NUMERIC | NUMERIC_READABLE, SORTABLE | Void.class |
INTEGER_53_BIT | 64 bit double values without fractional digits |
NUMERIC | NUMERIC_READABLE, SORTABLE | Void.class |
NOMINAL | int category indices together with a dictionary of String values |
CATEGORICAL | NUMERIC_READABLE, OBJECT_READABLE, SORTABLE | String.class |
DATE_TIME | Java Instant objects |
OBJECT | OBJECT_READABLE, SORTABLE | Instant.class |
TIME | Java LocalTime objects |
OBJECT | OBJECT_READABLE, NUMERIC_READABLE, SORTABLE | LocalTime.class |
TEXT | Java String objects |
OBJECT | OBJECT_READABLE, SORTABLE | String.class |
TEXT_LIST | Custom StringList objects |
OBJECT | OBJECT_READABLE, SORTABLE | StringList.class |
TEXT_SET | Custom StringSet objects |
OBJECT | OBJECT_READABLE, SORTABLE | StringSet.class |
Text-list columns have elements that are immutable List<String>
s, defined by the StringList
class and
Text-set columns have elements that are immutable Set<String>
s, defined by the StringSet
class.
Time columns are numeric-readable as nano-seconds since 00:00.
Real, 53 Bit Integer, Nominal and Time columns are numeric-readable, the others are only object-readable.
Please note that the 53 Bit Integer columns internally store rounded 64 bit double
values and not int
values.
Therefore, the maximum / minimum integer value that can be stored without loss of information is +/- 2^53-1
,
or +/- 9,007,199,254,740,991
.
This range is a lot bigger than that of Java Integers.
Consequently, casting to int
may lead to loss of information for large numbers and should be avoided.
Instead, cast to long
if necessary (but note that the full range of long
cannot be stored in a 53 bit integer column).
While both nominal and text columns can be read as String
values, the text columns have no underlying category indices
and are meant for the case where the String
values of the column are (mostly) different.
All categorical columns (see Column types) have dictionaries.
Dictionaries are a mapping from category indices to String values.
Every assigned category index is associated to a different String value.
If a category index is not assigned to an object value, it is assumed to be assigned to null
.
The category index CategoricalReader.MISSING_CATEGORY
is always unassigned and is the category index that stands for a missing value.
For example, the following nominal column
Nominal Column (5)
(green, red, ?, red, ?)
could have the underlying category indices
[1, 2, 0, 2, 0]
and the dictionary
{1 -> green, 2 -> red}
To access a dictionary of a categorical column, use the method column.getDictionary()
.
If you have a category index and want to find out the associated object value, use dictionary.get(index)
.
This method returns null
for unassigned indices such as CategoricalReader.MISSING_CATEGORY
.
If you require the reverse mapping from object values to category indices, you can create one by using dictionary.createInverse()
.
To iterate through all assigned String values together with their category indices, the dictionary iterator can be used. Continuing with the example above we get:
Dictionary dictionary = column.getDictionary();
for (Dictionary.Entry entry : dictionary) {
System.out.println(entry.getIndex() + " -> " + entry.getValue());
}
1 -> green
2 -> red
Not every object value in a dictionary must appear in the data. But the objects in the dictionary give an upper bound for the different object values in the column. For example, the nominal column above could have the following underlying category indices and dictionary:
[1, 3, 0, 3, 0]
{1 -> green, 2 -> blue, 3 -> red}
To get rid of unused dictionary entries, one can use the method Columns#removeUnusedDictionaryValues
.
There are two options: first, just remove the unused values or compact the dictionary so that the remaining category indices are sequential.
Column newColumn = Columns.removeUnusedDictionaryValues(column, Columns.CleanupOption.REMOVE, context);
creates a new column with the same underlying category indices and the dictionary
{1 -> green, 3 -> red}
Second,
Column newCompactColumn = Columns.removeUnusedDictionaryValues(column, Columns.CleanupOption.COMPACT, context);
results in a new column with the dictionary
{1 -> green, 2 -> red}
and adjusted category indices.
If gaps between used category indices cause no problems, the first option should be preferred.
In case the gaps need to be removed later on, Columns.compactDictionary(newColumn)
can be used to obtain the newColumn2
from above.
If the category indices of two similar categorical columns with different dictionaries should be compared, these dictionaries need to be aligned.
The methods Columns#changeDictionary
and Columns#mergeDictionary
help with that.
Assume we have columnA
from the beginning of the section with the following dictionary:
Nominal Column (5)
(green, red, ?, red, ?)
{1 -> green, 2 -> red}
Furthermore, assume there is a columnB
that has similar entries and the dictionary below:
Nominal Column (6)
(?, red, yellow, green, ?, green)
{1 -> red, 2 -> yellow, 3 -> green}
Then the following code results in a new columnB with the dictionary shown below:
Column newColumnB = Columns.mergeDictionary(columnB, columnA);
Nominal Column (6)
(?, red, yellow, green, ?, green)
{1 -> green, 2 -> red, 3 -> yellow}
The dictionary starts with the same values as the dictionary of columnA
.
In contrast, the following code results in another new columnB with the dictionary shown below:
Column newColumnB = Columns.changeDictionary(columnB, columnA);
Nominal Column (6)
(?, red, ?, green, ?, green)
{1 -> green, 2 -> red}
The dictionary is exactly the same as the dictionary for columnA
.
Since the dictionary from columnA
does not contain yellow
, the new columnB has a missing value instead of yellow
.
Note that the columns do not have to be the same size.
When dictionaries are boolean, they know which values are positive and which are negative.
Whether or not a dictionary is boolean can be checked via Dictionary#isBoolean
.
Only dictionaries with at most two values can be boolean.
Using Columns#toBooleanColumn
, a column can be made into a boolean categorical column if it satisfies the requirements which can be checked via Columns#isAtMostBicategoricalColumn
.
For our running example
Nominal Column (5)
(green, red, ?, red, ?)
the following code leads to a boolean nominal column with the additional information that green
is positive:
Column booleanColumn = Columns.toBoolean(column, "green");
If there is only one value in the dictionary and it is supposed to be negative, Columns.toBoolean(column, null)
can be called.
Boolean information can be accessed as in the following example
Dictionary dictionary = booleanColumn.getDictionary();
if(dictionary.isBoolean()){
if(dictionary.hasPositive()){
String positiveValue = dictionary.get(dictionary.getPositiveIndex());
System.out.println("positive: " + positiveValue);
}
if(dictionary.hasNegative()){
String negativeValue = dictionary.get(dictionary.getNegativeIndex());
System.out.println("negative: " + negativeValue);
}
}
Note that even if a value is positive or negative in the dictionary, it could still be absent in the data.
For example, if the dictionary in the example above with a green
positive and a red
negative, it is possible that the data only contains red and missing values.
Column buffers are temporary data structures that help with the construction of new columns. One can think of them as mutable arrays that can be frozen and converted into immutable columns.
Buffers are created with a fixed size, and afterwards one can set or access values at certain indices.
NumericBuffer buffer = Buffers.realBuffer(10);
for (int i = 0; i < 10; i++) {
buffer.set(i, i + 0.123);
}
Column column = buffer.toColumn();
Real Column (10)
(0.123, 1.123, 2.123, 3.123, 4.123, 5.123, 6.123, 7.123, 8.123, 9.123)
To create a column from a buffer, buffer.toColumn
is called.
After that, subsequent calls to the set
-method will fail.
buffer.set(4, 2.71);
java.lang.IllegalStateException: Buffer cannot be changed after used to create column
at com.rapidminer.belt.buffer.RealBuffer.set(RealBuffer.java:72)
Instead of creating an empty buffer, a buffer can be created from a column. This is helpful in cases where only a few values of a column need to be changed. Of course, the column type needs to be compatible with the buffer type.
The different types of buffers have various special properties described below.
A real buffer can be thought of as a double
array.
NumericBuffer buffer = Buffers.realBuffer(10, false);
Random random = new Random(123);
for (int i = 0; i < buffer.size(); i++) {
buffer.set(i, random.nextInt(100) + Math.PI);
}
buffer.set(2, Double.NaN);
buffer.set(7, Double.POSITIVE_INFINITY);
buffer.set(1, Double.NEGATIVE_INFINITY);
Column column = buffer.toColumn();
Real Column (10)
(85.142, -Infinity, ?, 92.142, 98.142, 60.142, 37.142, Infinity, 88.142, 56.142)
In the example above, the method Buffers.realBuffer(size, false)
is used.
This turns off the initialization of the buffer and should be done in every case where the set(index, value)
method will be called for every index.
Buffers.realBuffer(size, true)
or just Buffers.realBuffer(size)
initializes the buffer to missing values so that the value at every unset index is a missing value.
Missing values can be set via buffer.set(index, Double.NaN)
and it is also possible to set infinite values.
When creating a real buffer from a column via Buffers.realBuffer(column)
, the column must have the capability NUMERIC_READABLE
(see Column types).
If a value should be changed depending on the current value, the method buffer.get(index)
which returns a double value can be used.
A 53 bit integer buffer is similar to a real buffer but in order to ensure double
values without fractional digits
required for INTEGER_53_BIT
columns (see Column types), the input is rounded by the buffer.
Please note that the buffer internally stores rounded 64 bit double
values and not int
values.
Therefore, the maximum / minimum integer value that can be stored without loss of information is +/- 2^53-1
,
or +/- 9,007,199,254,740,991
.
NumericBuffer buffer = Buffers.integer53BitBuffer(10, true);
buffer.set(2, 3.0);
buffer.set(1, 4.0);
buffer.set(9, 3.14);
buffer.set(5, 2.718);
buffer.set(2, Double.NaN);
buffer.set(7, Double.POSITIVE_INFINITY);
buffer.set(6, Double.NEGATIVE_INFINITY);
Column column = buffer.toColumn();
Integer Column (10)
(?, 4, ?, ?, ?, 3, -Infinity, Infinity, ?, 3)
Here, both 3.14
and 2.718
are rounded to 3
.
As for real buffers, it is possible to set infinite values and missing values to Double.NEGATIVE_INFINITY
, Double.POSITIVE_INFINITY
, and Double.NaN
respectively.
Since Buffers.integer53BitBuffer(size, true)
is used, the buffer is initialized and all unset values, e.g., at index 0, are missing.
Buffer.integer53BitBuffer(size, false)
should be used in case every value will be set.
To ascertain if a NumericBuffer
is a real or 53 bit integer buffer, the method buffer.type()
can be used.
When creating a 53 bit integer buffer from a column via Buffers.integer53BitBuffer(column)
, the column must have the capability NUMERIC_READABLE
(see Column types), and all values will be rounded.
If a value should be changed depending on the current value, the method buffer.get(index)
which returns a double value can be used.
Nominal buffers are for creating nominal columns.
NominalBuffer buffer = Buffers.nominalBuffer(10, 3);
buffer.set(0, "red");
buffer.set(2, "blue");
buffer.set(4, "green");
buffer.set(5, "blue");
buffer.set(9, "red");
buffer.set(7, "blue");
buffer.set(6, "green");
buffer.set(7, null);
Column column = buffer.toColumn();
Nominal Column (10)
(red, ?, blue, ?, green, blue, green, ?, ?, red)
In this example, nominal buffers are created with Buffers.nominalBuffer(size)
or Buffers.nominalBuffer(size, categories)
.
To obtain better compression, the method with the categories parameter should be used assuming an upper bound of the different (non-null) values is known.
When using Buffers.nominalBuffer(size)
there is no limit to the different values, but often a bound can be known.
If more different values are set than the maximum allowed by the compression, the set
method will throw an exception.
If a small number of categories is chosen and the goal is to stop setting values if this number is reached, the setSave(index, value)
-method can be used instead,
which returns false
if the limit is reached. Looking up the number of different values already encountered is possible with buffer.differentValues()
.
As in the example above, all values at indices that are not set are automatically missing.
To explicitly set a missing value, buffer.set(index, null)
can be used.
In the case where at most two (non-null) values are set and they have a positive/negative assignment, buffer.toBooleanColumn(positiveValue)
can be called instead of buffer.toColumn()
.
For example, if we change our code above to never use "blue"
then we could use buffer.toBooleanColumn("green")
to have "green"
as positive value and "red"
as negative one.
If only one negative value is set, buffer.toBooleanColumn(null)
can be used.
When creating a nominal buffer from a column via Buffers.nominalBuffer(column)
the column must be nominal.
As for empty buffers, if the maximum number of categories is known, the method Buffers.nominalBuffer(column, categories)
should be used instead.
However, the categories of the original column must be considered as starting point for the counting.
As for the other buffers, checking the current value at an index is possible with buffer.get(index)
.
Time buffers are used to create time columns, see Column types.
TimeBuffer buffer = Buffers.timeBuffer(10, true);
buffer.set(0, LocalTime.NOON);
buffer.set(2, LocalTime.MIDNIGHT);
buffer.set(2, null);
buffer.set(7, 45200100003005L);
buffer.set(8, LocalTime.ofNanoOfDay(45200100003005L));
buffer.set(5, LocalTime.MIDNIGHT);
Column column = buffer.toColumn();
Time Column (10)
(12:00, ?, ?, ?, ?, 00:00, ?, 12:33:20.100003005, 12:33:20.100003005, ?)
As for the 53 bit integer buffer, Buffers.timeBuffer(size, true)
or Buffers.timeBuffer(size)
creates a buffer with all values initially missing value.
If all index values are set, Buffers.timeBuffer(size, false)
is preferred.
The example above shows how it is possible to set values in a time buffer either by setting LocalTime
objects or by setting the value as nanoseconds of the day long
value directly.
In case the long
value is at hand, the last version should be preferred over creating a LocalTime
object just to set the value.
In order to set missing values, buffer.set(index, null)
should be used.
To create a column from the buffer, buffer.toColumn
is called.
To create a buffer from a column with Buffers.timeBuffer(column)
the column must be a time column.
Values can be accessed via buffer.get(index)
but only as LocalTime
object and not as raw nanoseconds.
Date-time buffers are for creating date-time columns, see Column types.
DateTimeBuffer buffer = Buffers.dateTimeBuffer(10, false);
buffer.set(1, Instant.ofEpochMilli(1549494662279L));
buffer.set(2, Instant.EPOCH);
buffer.set(2, null);
buffer.set(5, Instant.ofEpochSecond(1549454518));
buffer.set(7, 1549454518L);
buffer.set(8, 1549454518L, 254167070);
Column column = buffer.toColumn();
Date-Time Column (10)
(?, 2019-02-06T23:11:02Z, ?, ?, ?, 2019-02-06T12:01:58Z, ?, 2019-02-06T12:01:58Z, 2019-02-06T12:01:58Z, ?)
Date-time buffers (and columns) can have two different levels of precision, either precision to the nearest second or to the nearest nanosecond.
The precision needs to be specified when the buffer is created.
In the example above, Buffers.dateTimeBuffer(size, false)
creates a buffer without nanosecond-precision while Buffers.dateTimeBuffer(size, true)
is precise to the next nanosecond.
If no nanosecond-precision is required, the buffer will disregard all nanosecond information.
Data can be set either by setting Instant
objects, like in the example at indices 1, 2 and 5,
by setting the epoch seconds directly like at index 7,
or by setting the epoch seconds and additional nanoseconds like at index 8.
In the example above, the values at indices 7 and 8 are the same, even though nanoseconds were specified with buffer.set(index, epochSeconds, nanoseconds)
,
because only second-precision was requested.
Similarly, only the seconds data of the input at index 1 is stored.
If the code of the example is changed to Buffers.dateTimeBuffer(10, true)
the resulting column is
Date-Time Column (10)
(?, 2019-02-06T23:11:02.279Z, ?, ?, ?, 2019-02-06T12:01:58Z, ?, 2019-02-06T12:01:58Z, 2019-02-06T12:01:58.254167070Z, ?)
which differs from the previous result at indices 1 and 8.
If the data is not expected to have millisecond or nanosecond precision then the date-time buffer should be created without sub-second-precision to save memory.
In cases where the data is present as primitive values, set(index, long)
or set(index, long, int)
should be preferred over creating an instant first like at index 5 in the example above.
Missing values are again set by buffer.set(index, null)
.
As for real buffers, date-time buffers are initially filled with missing values except if Buffers.dateTimeBuffer(size, precision, false)
is used.
No initialization should be specified in cases where every value is set anyway.
When creating a buffer from a column with Buffers.dateTimeBuffer(column)
the column must be a date-time column.
The precision of the buffer will match the precision of the data in the column.
Values can be accessed with buffer.get(index)
but only as Instant
objects and not as raw seconds or nanoseconds data.
Object buffers are used to create object columns.
Note that the standard column types time and date-time have their own buffer.
So object buffers are only of interest for the other object column types like ColumnType.TEXT
, ColumnType.TEXT_LIST
or ColumnType.TEXT_SET
.
ObjectBuffer<String> buffer = Buffers.textBuffer(10);
for (int i = 0; i < buffer.size(); i++) {
buffer.set(i, "value_" + i);
}
buffer.set(3, null);
Column column = buffer.toColumn();
Column (10)
(value_0, value_1, value_2, ?, value_4, value_5, value_6, value_7, value_8, value_9)
In the example, the textBuffer method creates a buffer of column type ColumnType.TEXT
which is a column type with String
element type.
Missing values are again set via null
.
When creating an object buffer from a column via Buffers.textBuffer(column)
, the column must be object-readable and the column's element type must be String
.
Analogously, buffers for ColumnType.TEXT_SET
can be created using the Buffers.textsetBuffer
methods,
and for ColumnType.TEXT_LIST
using the Buffers.textlistBuffer
methods.
In this paragraph we will show you how to take advantage of your data being sparse. With sparse data we mean data columns with most rows (>= 50%) taking on one and the same value. We call this value the default value.
The buffers we have presented so far already take advantage of your data being sparse in some sense.
They will automatically check the data for sparsity in their toColumn()
method and return
what we call a sparse column if our heuristic decides that the data is sufficiently sparse for that.
These sparse columns come with the same interface as the default dense columns.
But under the hood they have been heavily optimized to be more time and memory efficient when working
with sparse data.
Sparse columns come with a memory efficient sparse data format that will allow you
to import a lot more data without running out of memory.
Sounds good? Even better: If you already know that your data is sparse your can use one of our sparse buffer implementations. Sparse buffers are, similar to sparse columns, optimized to be time and memory efficient when handling sparse data. This avoids the buffers themselves becoming the new memory bottleneck. All of the presented buffer types (except for object buffers) are available in this sparse buffer format. Please note that sparse buffers and columns come with an overhead for their sparse format so that they should not be used with very small data (less than 1024 rows) or data with sparsity < 50%. If you are unsure if the data is sparse, we advice you to use the default dense buffers.
Sparse buffers come with a slightly different interface compared to the default dense buffers that we have been explaining so far:
RealBufferSparse buffer = Buffers.sparseRealBuffer(0, 10);
buffer.setNext(2, 0);
buffer.setNext(7, 0.54);
buffer.setNext(9, 0.72);
Column column = buffer.toColumn();
In the above example sparseRealBuffer(double, int)
is used to create a new sparse buffer with a
default value of 0
and a length of 10
.
All buffer positions are initially set to the default value.
setNext(int, double)
is then used to set some of the buffer positions to non-default values.
In the example we do this for the positions 7
and 9
, setting them to the non-default values
0.5
and 0.72
.
Setting position 2
to 0
has no effect as 0
is the default value anyway.
The resulting column looks like this:
Real Column (10)
(0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.540, 0.000, 0.720)
Please note that the buffer has to be filled monotonously, starting at the lower positions and moving to higher positions. Trying to access a position with a smaller index than another index that has already been set will lead to an exception:
RealBufferSparse buffer = Buffers.sparseRealBuffer(0, 10);
buffer.setNext(2, 0);
buffer.setNext(0, 0.54);
java.lang.IllegalArgumentException: Cannot modify index 0. An equal or larger index has already been set.
Sparse buffers are write only. They do not have get
methods like their dense counterparts.
Also please note that the sparse buffers are thread safe but filling a buffer from multiple threads will
potentially be slow.
Another way to fill a sparse buffer is to call setNext(double)
.
This method fills the given value into the next higher buffer position.
RealBufferSparse buffer = Buffers.sparseRealBuffer(0, 10);
buffer.setNext(2, 0.72);
buffer.setNext(0.54);
Column column = buffer.toColumn();
Real Column (10)
(0.000, 0.000, 0.720, 0.540, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000)
For detailed information on the individual sparse buffers and their performance, please take a look at their Javadoc.