This example code base can be used as template or inspiration to create your own Feature Catalog. It contains some example code to define your features and expose them to users via a simple API. We assume that the size of your data justifies the use of Spark.
Below a simple example (see also example_usage.ipynb
).
all_avatars = spark.read.parquet("data/wow.parquet").select(“avatarId”).distinct()
features = compute_features(
spark=spark,
scope=all_avatars,
feature_groups=[
Zone(
features_of_interest=["darkshore_count", "total_count"],
aggregation_level=“avatarId”
),
])
Below you find more information about what a Feature Catalog is, why to create it and what the difference is with a Feature Store.
A Feature Catalog is a place where you define and document your features such that they can be easily created via a simple API.
Note that this not (yet) includes the storage of the features in a Feature Store. A Feature Catalog already gives you a lot of benefits without the complexity of a Feature Store or a full Feature Platform. Also see the Architecture section about this difference.
By creating a Feature Catalog you:
- define your features once (single source of truth)
- allow re-use of features across teams and projects (better collaboration)
By increasing collaboration you will get the following benefits:
- increase in speed of development (easy to re-use code)
- increase in reliability / quality of code (more contributors)
The full architecture can be found in the docs folder, but here is the main overview:
When including a Feature Store it could look somewhat like this:
Install using poetry install
.
To create a graph of the dependencies between all feature groups you also need to install graphviz: https://graphviz.org/download/
curl https://storage.googleapis.com/shareddatasets/wow.parquet -o data/wow.parquet
After downloading the example data and installing the package you can run the example notebook example_usage.ipynb
.