Skip to content

Latest commit

 

History

History
19 lines (17 loc) · 1.29 KB

spark.md

File metadata and controls

19 lines (17 loc) · 1.29 KB

Introduction

  • spark or apache spark
  • open-source distributed computing system
  • installed on top of operating system on cluster of machines
  • allows users to process large datasets in parallel across a cluster of computers
  • provides an interface for programming entire clusters with implicit data parallelism and fault tolerance
  • supports a variety of programming languages including java, scala, python, and R
  • its main programming abstraction is the Resilient Distributed Dataset (RDD); a fault-tolerant collection of data that can be processed in parallel across a cluster
  • includes higher-level libraries such as Spark SQL for SQL queries, Spark Streaming for real-time processing of data streams, MLlib for machine learning, and GraphX for graph processing
  • key advantage is speed (achieved through in-memory processing and optimized execution plans)
  • a fast and general engine for large-scale data processing
  • platform agnostic (can run on top of Hadoop YARN; Hadoop's resource manager)
  • can also read/write data from other Hadoop ecosystem components such as HBase, Hive and Cassandra
  • 100x faster than traditional MapReduce
  • cluster computing platform designed to be a fast and general purpose
  • also runs on Apache Mesos, Kubernetes, standalone or in the cloud