- Python library to do end to end data analytics from reading to transformation, analysis and visualization.
- Support for reading and writing multiple data formats from local machine or S3.
- Functional programming style APIs.
- Advanced APIs for doing join, aggregation, sampling, and processing time series data.
- Schema evolution.
- Visualization APIs to provide simple interface to matplotlib, seaborn, and other popular libraries.
- Data Exploration phase when we don't know what we are looking for, and what might work.
- Wide datasets with 100s or 1000s of columns.
- Complex business logic is involved.
- TSV is used as the in-memory data format for simplicity.
- Input data can be of different formats and read either locally or from external sources like S3 or web.
docker build -t omigo-data-analytics -f deploy/Dockerfile .
docker run --rm -p 8888:8888 -it -v $PWD:/code omigo-data-analytics
There are three packages: core, extensions and hydra.
The core package is built using core python with minimal external dependencies to keep it stable. The extensions package contains libraries for advanced functionalities like visualization, and can have lot of dependencies. The hydra package contains experimental code for distributed execution. This is not used at the moment.
pip3 install omigo-core omigo-ext omigo-hydra --upgrade
There are APIs provided to create new extension packages for custom needs and plugin easily into the existing code (See extend-class).
Note: Some working examples are in jupyter example-notebooks directory. Here is a simple example to run in command line.
python3
>>> from omigo_core import tsv
>>> x = tsv.read("data/iris.tsv.gz")
#
# other possible options
#
# x = tsv.read("data/iris.tsv")
# x = tsv.read("data/iris.tsv.zip")
# x = tsv.read("s3://bucket/path_to_file/data.tsv.gz")
# x = tsv.read("https://github.com/CrowdStrike/omigo-data-analytics/raw/main/data/iris.tsv")
>>> print(x.num_rows())
150
>>> x.to_df(10)
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
>>> y = x \
.eq_str("class", "Iris-setosa") \
.gt_float("sepal_width", 3.1) \
.select(["sepal_width", "sepal_length"])
>>> y.show(5)
sepal_width sepal_length
3.5 5.1
3.2 4.7
3.6 5.0
>>> from omigo_ext import graph_ext
>>> x.extend_class(graph_ext.VisualTSV).histogram("sepal_length", "class", yfigsize = 8)
>>> x.extend_class(graph_ext.VisualTSV).pairplot(["sepal_length", "sepal_width"], kind = "kde", diag_kind = "auto")
>>> tsv.write(y, "output.tsv.gz")
There are lot of extensions to add advanced functionalities
This extension provides visualization APIs like linechart, barchart.
This extension provides APIs to call external web service for all the rows in the data. All web service parameters can be templatized and mapped to individual columns including url, query parameters, headers, and payload. The extension supports multi threading.
This extension provides a simple wrapper to call different APIs within a thread pool. Usually used inside other extensions.
This extension allows reading the data through Kafka and return as tsv object. Lot of custom parameters are provided to simplify parsing of the data.
This extension is a placeholder for wrapping any interesting pandas apis like reading parquet file (local or s3).
This extension provides APIS to read data that is stored in some ETL format. Useful for reading time series data stored in a partitioned manner.
- README: Good starting point to get a basic overview of the library.
- API Documentation: Detailed API docs with simple examples to illustrate the usage.
- example-notebooks: Working examples to show different use cases.
- This library is built for simplicity, functionality and robustness. Good engineering practices are being followed slowly.
- More examples with real life use cases is currently in progress. Feel free to reach out for any questions.
- This project is in active research phase and not to be deployed in production.