awesome-modern-bigdata

A list of awesome modern big data libraries, frameworks and platforms.

Awesome Modern Big Data

Computing

Flink Stateful Computations over Data Streams.
Spark Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Orchestration

NiFi Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.
StreamPipes A self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams.

Ingestion

Debezium Debezium is an open source distributed platform for change data capture.
Flink CDC CDC Connectors for Apache Flink® is a set of source connectors for Apache Flink®, ingesting changes from different databases using change data capture (CDC)

File Storage

MINIO MinIO offers high-performance, S3 compatible object storage.
JuiceFS JuiceFS is a high-performance shared file system designed for cloud-native use and released under the Apache License 2.0. It provides full POSIX compatibility, allowing almost all kinds of object storage to be used locally as massive local disks and to be mounted and read on different cross-platform and cross-region hosts at the same time.
Fluid Fluid is an open source Kubernetes-native Distributed Dataset Orchestrator and Accelerator for data-intensive applications, such as big data and AI applications.
ALLUXIO Alluxio, data orchestration for analytics and machine learning in the cloud.

OLAP Query Engine

Presto Presto is a distributed SQL query engine for big data.

Messaging

Kafka Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
Pulsar Apache Pulsar is a cloud-native, distributed messaging and streaming platform originally created at Yahoo! and now a top-level Apache Software Foundation project.

Database

Clickhouse ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).
StarRocks StarRocks is a next-gen sub-second MPP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics and ad-hoc query.
TiKV TiKV is an open-source, distributed, and transactional key-value database. Unlike other traditional NoSQL systems, TiKV not only provides classical key-value APIs, but also transactional APIs with ACID compliance.

Data Lake

Iceberg Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, and Hive to safely work with the same tables, at the same time.
Hudi Hudi is a rich platform to build streaming data lakes with incremental data pipelines on a self-managing database layer, while being optimized for lake engines and regular batch processing.
Delta Lake Delta Lake is an open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python.
Flink Table Store Flink Table Store is a unified streaming and batch store for building dynamic tables on Apache Flink.

Metadata

Datahub DataHub is an open-source metadata platform for the modern data stack.
Amundsen Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Data Quality

Network

Monitoring

Data Analytics

Zeppelin Zeppelin, a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.

Data Visualization

Superset Apache Superset is a modern data exploration and visualization platform.
Davinci Davinci is oriented towards product managers, business people, data engineers, data analysts, data scientists, etc.
DataEase DataEase is an open source data visualization analysis tool that helps users quickly analyze data and gain insight into business trends, so as to achieve business improvement and optimization.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-modern-bigdata

Computing

Orchestration

Ingestion

File Storage

OLAP Query Engine

Messaging

Database

Data Lake

Metadata

Data Quality

Network

Monitoring

Data Analytics

Data Visualization

About

Releases

Packages

License

yangliuyu/awesome-modern-bigdata

Folders and files

Latest commit

History

Repository files navigation

awesome-modern-bigdata

Computing

Orchestration

Ingestion

File Storage

OLAP Query Engine

Messaging

Database

Data Lake

Metadata

Data Quality

Network

Monitoring

Data Analytics

Data Visualization

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages