-
Apache Spark is an open-source distributed computing system that is designed to process large amounts of data quickly and efficiently. It is a powerful tool for data processing, machine learning, and analytics, and is used by organizations around the world to analyze and understand complex data sets.
-
Spark is built on top of the Scala programming language and utilizes a distributed computing model, allowing it to process data in parallel across multiple nodes. This makes it well-suited for handling large data sets and achieving fast processing times.
-
Spark has a wide range of applications, including data processing, machine learning, and real-time analytics. It is also highly scalable, making it suitable for use in a variety of environments, from small local clusters to large cloud-based systems.
-
Overall, Apache Spark is a valuable tool for organizations looking to analyze and understand large data sets in a fast and efficient manner.
-
This repository contains a range of projects completed using Apache Spark. One folder contains the basics of using the Python Spark library to manipulate NBA log data and small employee data.
-
Another folder in the repository contains notebooks for projects completed using the Spark ML library, utilizing data from various industries including healthcare, finance, beverages, automobiles, education, real estate, plants, and sports.
-
These projects utilize a range of algorithms including decision tree classifier and regressor, factorization machine classifier, gradient-boosted tree classifier, multilayer perceptron classifier, Naive Bayes, random forest classifier, support vector machine, regression, factorization machine regression, gradient-boosted regression, generalized linear regression, linear regression, logistics regression, and random forest regression.
-
Overall, this repository contains a range of projects utilizing Apache Spark and the Spark ML library to analyze and understand data from a variety of different industries and contexts. The inclusion of a range of algorithms allows for flexibility and the ability to tackle a variety of different data analysis tasks.