Skip to content

This repository contains solutions to common mapper and reducer problems in Hadoop using Python

License

Notifications You must be signed in to change notification settings

RobinMillford/Hadoop-MapRaduce-Problems

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Hadoop Mapper and Reducer Scripts for Python

This repository contains solutions to common mapper and reducer problems in Hadoop using Python. Most online resources for Hadoop are geared towards Java environments, so this repository aims to provide Python solutions for Hadoop streaming.

Hadoop Installation:

Windows:

  • Watch this video for Hadoop installation on Windows.

Ubuntu:

  • Follow this video for Hadoop installation on Ubuntu.

Basic Hadoop Commands:

  1. Format Namenode:

    hdfs namenode -format
  2. Start Hadoop Services:

    start-all.sh
  3. Create Input Directory in HDFS:

    hdfs dfs -mkdir /input
  4. Upload Input File to HDFS:

    hdfs dfs -put /path/to/input.txt /input/input.txt
  5. Run Hadoop Streaming:

    hadoop jar /path/to/hadoop-streaming.jar \
    -input /input/input.txt \
    -output /output \
    -file "/path/to/mapper.py" \
    -mapper "python3 mapper.py" \
    -file "/path/to/reducer.py" \
    -reducer "python3 reducer.py"
  6. Copy Output from HDFS to Local File:

    hdfs dfs -text /output/* > /path/to/outputfile.txt
  7. Remove Output and Input Directories from HDFS:

    hadoop fs -rm -r /output
    hadoop fs -rm -r /input

Testing Mapper and Reducer Scripts:

You can test the mapper and reducer scripts separately to ensure they work correctly:

  1. Test Mapper Script:

    cat /path/to/input.txt | python3 /path/to/mapper.py
  2. Test Reducer Script:

    cat /path/to/mapper_output.txt | python3 /path/to/reducer.py

Algorithm Explanations:

Recommendation System:

  • Mapper: Preprocesses user-item ratings.
  • Reducer: Generates recommendations based on similarity measures between users.

Page Rank:

  • Mapper: Prepares graph data with nodes and edges.
  • Reducer: Calculates the PageRank algorithm to determine node importance in the graph.

K-Means:

  • Mapper: Assigns data points to clusters based on centroid proximity.
  • Reducer: Updates centroid positions based on cluster assignments.

Weather Data Analysis:

  • Mapper: Extracts relevant weather data from input records.
  • Reducer: Aggregates weather data and computes statistics like average temperature or precipitation.

Word Count:

  • Mapper: Splits text into words and emits key-value pairs for each word.
  • Reducer: Counts the occurrences of each word.

Sample Input and Output:

You can find sample input and output files in the repository to test the scripts.


About

This repository contains solutions to common mapper and reducer problems in Hadoop using Python

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages