spark or apache spark
open-source distributed computing system
installed on top of operating system on cluster of machines
allows users to process large datasets in parallel across a cluster of computers
provides an interface for programming entire clusters with implicit data parallelism and fault tolerance
supports a variety of programming languages including java, scala, python, and R
its main programming abstraction is the Resilient Distributed Dataset (RDD); a fault-tolerant collection of data that can be processed in parallel across a cluster
includes higher-level libraries such as Spark SQL for SQL queries, Spark Streaming for real-time processing of data streams, MLlib for machine learning, and GraphX for graph processing
key advantage is speed (achieved through in-memory processing and optimized execution plans)
a fast and general engine for large-scale data processing
platform agnostic (can run on top of Hadoop YARN; Hadoop's resource manager)
can also read/write data from other Hadoop ecosystem components such as HBase, Hive and Cassandra
100x faster than traditional MapReduce
cluster computing platform designed to be a fast and general purpose
also runs on Apache Mesos, Kubernetes, standalone or in the cloud

Provide feedback

Saved searches