edcdict

Pequeno Dicionário de ferramentas para Engenharia de dados

Apache Druid

https://druid.apache.org/
Apache Druid is a high performance real-time analytics database.

Apache Pinot

https://pinot.apache.org/
Realtime distributed OLAP datastore, designed to answer OLAP queries with low latency

Apache Spark

https://spark.apache.org/
https://cloud.google.com/learn/what-is-apache-spark
Apache Spark™ is a unified analytics engine for large-scale data processing.

Apache Kafka

https://kafka.apache.org/
https://www.confluent.io/what-is-apache-kafka
Apache Kafka is a community distributed event streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Since being created and open sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue to a full-fledged event streaming platform.

Apache Samza

http://samza.apache.org/
A distributed stream processing framework
Samza allows you to build stateful applications that process data in real-time from multiple sources including Apache Kafka.

Apache Kafka connect

Kafka Connect is a free, open-source component of Apache Kafka® that works as a centralized data hub for simple data integration between databases, key-value stores, search indexes, and file systems.

Apache Flink

https://flink.apache.org/
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale

Apache Storm

https://storm.apache.org/
Apache Storm is a free and open source distributed realtime computation system. Apache Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Apache Storm is simple, can be used with any programming language, and is a lot of fun to use!
Apache Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Apache Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Apache Beam

https://beam.apache.org/
An advanced unified programming mode
Implement batch and streaming data processing jobs that run on any execution engine.

Apache Superset

https://superset.apache.org
Apache Superset is a modern data exploration and visualization platform

Apache Hive

https://hive.apache.org/
The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.

Strimzi

https://strimzi.io/
https://strimzi.io/documentation/
Strimzi provides a way to run an Apache Kafka cluster on Kubernetes in various deployment configurations

AWS EMR

https://aws.amazon.com/pt/emr
Execute e escale facilmente o Apache Spark, o Hive, o Presto e outras estruturas de big data
O Amazon EMR é a plataforma de big data em nuvem líder do setor para processar grandes quantidades de dados usando ferramentas de código aberto, como Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi e Presto.

AWS Glue

https://aws.amazon.com/pt/glue/
Integração de dados simples, escalável e sem servidor
AWS Glue é um serviço de integração de dados sem servidor que facilita descobrir, preparar e combinar dados para análise, machine learning e desenvolvimento da aplicação. Ele oferece todos os recursos necessários para a integração dos dados, portanto é possível começar a analisar seus dados e usá-los em minutos, ao invés de meses.

AWS QuickSight

https://aws.amazon.com/pt/Quicksight/
O QuickSight permite que você crie e publique facilmente painéis interativos que incluem o Insights de Machine Learning.

AWS Athena

https://docs.aws.amazon.com/pt_br/athena/index.html
O Amazon Athena é um serviço de consultas interativas que facilita a análise de dados no Amazon S3 usando SQL padrão. O Athena não exige um servidor. Não há necessidade de configurar ou gerenciar infraestrutura e você paga apenas pelas consultas executadas. Para começar a usar, basta apontar para os dados no S3, definir o schema e iniciar as consultas usando SQL padrão.

AWS S3

https://aws.amazon.com/pt/s3/
Armazenamento de objetos construído para armazenar e recuperar qualquer volume de dados de qualquer local
O Amazon Simple Storage Service (Amazon S3) é um serviço de armazenamento de objetos que oferece escalabilidade, disponibilidade de dados, segurança e performance líderes do setor.

AWS RDS

https://aws.amazon.com/pt/rds/
Configure, opere e escale um banco de dados relacional na nuvem com apenas alguns cliques. O Amazon Relational Database Service (Amazon RDS) facilita a configuração, a operação e a escalabilidade de bancos de dados relacionais na nuvem

Hadoop

https://hadoop.apache.org/
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

KsqlDB

https://ksqldb.io/
The database purpose-built for stream processing applications
Seamlessly leverage your existing Apache Kafka® infrastructure to deploy stream-processing workloads and bring powerful new capabilities to your applications.

Kubernetes

https://kubernetes.io/pt-br/
Orquestração de contêineres prontos para produção
Kubernetes (K8s) é um produto Open Source utilizado para automatizar a implantação, o dimensionamento e o gerenciamento de aplicativos em contêiner
Ele agrupa contêineres que compõem uma aplicação em unidades lógicas para facilitar o gerenciamento e a descoberta de serviço. O Kubernetes se baseia em 15 anos de experiência na execução de containers em produção no Google, combinado com as melhores ideias e práticas da comunidade.

Kubectl

https://kubernetes.io/docs/reference/kubectl/overview/
The kubectl command line tool lets you control Kubernetes clusters

Kubectx

https://ahmet.im/blog/kubectx/
kubectx: a tool to switch between Kubernetes contexts

EKS

https://aws.amazon.com/pt/eks/
Amazon Elastic Kubernetes Service
O Amazon Elastic Kubernetes Service (Amazon EKS) é um serviço Kubernetes totalmente gerenciado

Python

https://www.python.org/
Python is a programming language that lets you work quickly and integrate systems more effectively

GIT

https://git-scm.com/
Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency

Postgres

https://www.postgresql.org/
PostgreSQL is a powerful, open source object-relational database system with over 30 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.

Confluent

https://confluent.io
You love Apache Kafka, but not managing it. Our fully managed service means your best people can now focus on delivering value to your customers.

Docker

https://www.docker.com
A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings.

ElasticSearch

https://www.elastic.co/pt/what-is/elasticsearch
O Elasticsearch é um mecanismo de busca e análise de dados distribuído, gratuito e aberto para todos os tipos de dados, incluindo textuais, numéricos, geoespaciais, estruturados e não estruturados. O Elasticsearch é desenvolvido sobre o Apache Lucene e foi lançado pela primeira vez em 2010 pela Elasticsearch N.V

Artigos

Arquitetura de Software: Explicando Stream Processing, Event Source e Data Streaming

https://oieduardorabelo.medium.com/

Big Data File Formats Explained

https://towardsdatascience.com/big-data-file-formats-explained-dfaabe9e8b33

Publishing with Apache Kafka at The New York Times

https://www.confluent.io/blog/publishing-apache-kafka-new-york-times

Kubectl cheatsheet

https://kubernetes.io/pt-br/docs/reference/kubectl/cheatsheet/

Kafka-Python explained in 10 lines of code

https://towardsdatascience.com/kafka-python-explained-in-10-lines-of-code-800e3e07dad1

CATALYST ANALYST: A DEEP DIVE INTO SPARK’S OPTIMIZER

https://www.unraveldata.com/resources/catalyst-analyst-a-deep-dive-into-sparks-optimizer/

Files

README.md

Latest commit

History

README.md

File metadata and controls

edcdict

Apache Druid

Apache Pinot

Apache Spark

Apache Kafka

Apache Samza

Apache Kafka connect

Apache Flink

Apache Storm

Apache Beam

Apache Superset

Apache Hive

Strimzi

AWS EMR

AWS Glue

AWS QuickSight

AWS Athena

AWS S3

AWS RDS

Hadoop

KsqlDB

Kubernetes

Kubectl

Kubectx

EKS

Python

GIT

Postgres

Confluent

Docker

ElasticSearch

Artigos

Arquitetura de Software: Explicando Stream Processing, Event Source e Data Streaming

Big Data File Formats Explained

Publishing with Apache Kafka at The New York Times

Kubectl cheatsheet

Kafka-Python explained in 10 lines of code

CATALYST ANALYST: A DEEP DIVE INTO SPARK’S OPTIMIZER