mkdir build && cd build && cmake .. && make
Building blocks:
- Transformer: Input -> Output.
- Pipeline: Two transformers in sequence, pipeline itself is a transformer.
- Combiner: Two transformers in parallel, combiner itself is a transformer.
Transformer is a mathematical function. (From something to another
thing). Yet the transformer (may) has to learn from data to decide its
parameters. The process of learning is done in step
by continuously
observing samples.
For example, to scale a numerical feature (linearly to 0-1), the
transformer has to know the lower bound (min) and upper bound (max) of
the given feature. In the process of step
, this transformer will
update its parameter according to each sample.
After observing all samples in step
, finalize
should be called to
do some (possible) finishing works.
Two transformers in sequence, for example:
- Transformer A categorizes a feature
- Transformer B binarize (1-of-K binary encoding) of a categorical variable.
To make a pipeline out of them, we overloaded +
. So one just needs
to:
A + B
The type of first transformer’s output should match that of the second’s input.
Two transformers in parallel, the two’s output will be combined with
std::tuple
. They have to have same input types, but output types could be different.
**Special rule: When both transformers’s output are of the same std::vector
type, they will be combined as a concatenated std::vector
**
Combiner works well with pipeline. For example, let’s say you want to binarize (transformer C) according to the combination of two-features (transformer A and B) (e.g. 2-gram).
We overloaded |
. So one just needs to:
(A | B) + C
LazyTransformer
exists to simplify a particular kind of
transformer. They don’t need to learn parameters, so step
and
finalize
are useless for them. Examples include literally taking
some features, taking Log
for a given feature.
#include <iostream>
#include "transformer.hpp"
using transformer::make_transformer;
using transformer::make_lazy_transformer;
using transformer::Binarizer;
struct Data {
std::string firstname;
std::string lastname;
};
int main(int argc, char *argv[])
{
std::function<std::string(const Data& sample)> get_firstname_lambda =
[](const Data& sample) -> std::string {
return sample.firstname;
};
std::function<std::string(const Data& sample)> get_lastname_lambda =
[](const Data& sample) -> std::string {
return sample.lastname;
};
auto get_firstname = make_lazy_transformer(get_firstname_lambda);
auto get_lastname = make_lazy_transformer(get_lastname_lambda);
auto binarizer = make_transformer<Binarizer<std::tuple<std::string, std::string>>>();
Data data1{"Mike", "Jordan"};
Data data2{"Mike", "James"};
Data data3{"Bill", "Jordan"};
Data data4{"Bill", "James"};
auto pipe = (get_firstname | get_lastname) + binarizer;
auto dataset = {data1, data2, data3, data4};
for (const auto& data: dataset) {
pipe->step(data);
}
pipe->finalize();
for (const auto& data: dataset) {
std::vector<double> out = pipe->transform(data);
for (const auto& item: out) {
std::cout<<item<<" ";
}
std::cout<<std::endl;
}
// Output will be:
// 1 0 0 0
// 0 1 0 0
// 0 0 1 0
// 0 0 0 1
return 0;
}