This repository contains solutions to common mapper and reducer problems in Hadoop using Python. Most online resources for Hadoop are geared towards Java environments, so this repository aims to provide Python solutions for Hadoop streaming.
- Watch this video for Hadoop installation on Windows.
- Follow this video for Hadoop installation on Ubuntu.
-
Format Namenode:
hdfs namenode -format
-
Start Hadoop Services:
start-all.sh
-
Create Input Directory in HDFS:
hdfs dfs -mkdir /input
-
Upload Input File to HDFS:
hdfs dfs -put /path/to/input.txt /input/input.txt
-
Run Hadoop Streaming:
hadoop jar /path/to/hadoop-streaming.jar \ -input /input/input.txt \ -output /output \ -file "/path/to/mapper.py" \ -mapper "python3 mapper.py" \ -file "/path/to/reducer.py" \ -reducer "python3 reducer.py"
-
Copy Output from HDFS to Local File:
hdfs dfs -text /output/* > /path/to/outputfile.txt
-
Remove Output and Input Directories from HDFS:
hadoop fs -rm -r /output hadoop fs -rm -r /input
You can test the mapper and reducer scripts separately to ensure they work correctly:
-
Test Mapper Script:
cat /path/to/input.txt | python3 /path/to/mapper.py
-
Test Reducer Script:
cat /path/to/mapper_output.txt | python3 /path/to/reducer.py
- Mapper: Preprocesses user-item ratings.
- Reducer: Generates recommendations based on similarity measures between users.
- Mapper: Prepares graph data with nodes and edges.
- Reducer: Calculates the PageRank algorithm to determine node importance in the graph.
- Mapper: Assigns data points to clusters based on centroid proximity.
- Reducer: Updates centroid positions based on cluster assignments.
- Mapper: Extracts relevant weather data from input records.
- Reducer: Aggregates weather data and computes statistics like average temperature or precipitation.
- Mapper: Splits text into words and emits key-value pairs for each word.
- Reducer: Counts the occurrences of each word.
You can find sample input and output files in the repository to test the scripts.