Skip to content

Amazon Web Services to process big data using a Hadoop cluster

Notifications You must be signed in to change notification settings

j-buitrago/Distributed-processing-AWS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed-processing-AWS

Description:

  • Tutorial of how to use Amazon Web Services to process big data using a Hadoop cluster.

First of all, extract some data from Twitter api:

python3 twitter_stream.py > big_data.txt

To run in local:

python3 twitter_analyzer.py ./big_data.txt > output.txt

To run in a Hadoop cluster using AWS:

ssh -i Ireland.pem *your cluster*
sudo update-alternatives --set python /usr/bin/python3.4
sudo easy_install pip
sudo pip install mrjob
mkdir project
cd project/
scp -i Ireland.pem mrjob.conf *your cluster*:project
scp -i Ireland.pem twitter_analyzer.py *your cluster*:project
scp -i Ireland.pem AFIN-111.txt *your cluster*:project
scp -i Ireland.pem States-USA.csv *your cluster*:project

Then upload your file big_data.txt to a S3 bucket, create a dir to your output and run:

python3 twitter_analyzer.py -r emr *your S3 path file*/big_data.txt --conf-path mrjob.conf --states=States-USA.csv --dic=AFINN-111.txt --output-dir=*your S3 path output* --instance-type m3.xlarge --num-core-instances 2 --region eu-west-1

About

Amazon Web Services to process big data using a Hadoop cluster

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages