- spark or apache spark
- open-source distributed computing system
- installed on top of operating system on cluster of machines
- allows users to process large datasets in parallel across a cluster of computers
- provides an interface for programming entire clusters with implicit data parallelism and fault tolerance
- supports a variety of programming languages including java, scala, python, and R
- its main programming abstraction is the Resilient Distributed Dataset (RDD); a fault-tolerant collection of data that can be processed in parallel across a cluster
- includes higher-level libraries such as Spark SQL for SQL queries, Spark Streaming for real-time processing of data streams, MLlib for machine learning, and GraphX for graph processing
- key advantage is speed (achieved through in-memory processing and optimized execution plans)
- a fast and general engine for large-scale data processing
- platform agnostic (can run on top of Hadoop YARN; Hadoop's resource manager)
- can also read/write data from other Hadoop ecosystem components such as HBase, Hive and Cassandra
- 100x faster than traditional MapReduce
- cluster computing platform designed to be a fast and general purpose
- also runs on Apache Mesos, Kubernetes, standalone or in the cloud