Data-Processing-With-Spark

This repository contains data processing using Spark for MapReduce as a part of an academic project for data intensive computing. The project required obtaining word cooccurrence (for n-grams, n = 2, 3) with “line” as the scope or context for co- occurrence.
We not only pick adjacent words but also all words (forward) co-occurring with the current word but within the line.

Word Count on Classical Latin Text

This activity involved performing multiple pass on the input to obtain a specialized wordount.

Pass 1: Lemmetization using the lemmas.csv file

Pass 2: Identify the words in the texts by <word <docid, [chapter#, line#]> for two documents.

Pass 3: Repeat this for multiple documents.

The rough MR algorithm can be descirbed as

for each word in the text

normalize the word spelling by replacing j with i and v with u throughout check lemmatizer

for the normalized spelling of the word

if the word appears in the lemmatizer
	obtain the list of lemmas for this word
	for each lemma, create a key/value pair from the lemma and the location 
    where the word was found
    
else	
	create a key/value pair from the normalized spelling and the location 
    where the word was found

The output looked like this

Bigram Output

('{pronuntio, sinis}', [u'<sen. eld. fr. 1.4>'])
('{iuro, testis2}', [u'<sen. eld. fr. 1.3>'])
('{ego, non}', [u'<sen. eld. fr. 1.2>'])
(u'{inuolaturum, sui}', [u'<sen. eld. fr. 1.4>'])
('{qui, vinus}', [u'<mac. frag 5>', u'<mac. frag 5>'])
('{quidam, hic}', [u'<sen. eld. fr. 1.2>'])
('{vergilius, voco}', [u'<sen. eld. fr. 1.4>'])
('{dico2, qui}', [u'<mac. frag 7>'])
('{seneco, quis2}', [u'<sen. eld. fr. 1.2>', u'<sen. eld. fr. 1.2>', u'<sen. eld. fr. 1.2>'])

Trigram output

('{in, vir, convenio}', [u'<sen. eld. fr. 1.3>'])
('{audax, ut, sequor}', [u'<sen. eld. fr. 1.2>'])
('{non, quippe, vinum}', [u'<mac. frag 5>'])
('{sino, seneco, quocumque}', [u'<sen. eld. fr. 1.2>'])
('{verum2, quis2, filius}', [u'<sen. eld. fr. 1.2>', u'<sen. eld. fr. 1.2>', u'<sen. eld. fr. 1.2>', u'<sen. eld. fr. 1.2>'])
(u'{declamator, seneco, caligo\u201d}', [u'<sen. eld. fr. 1.2>'])

The code and instructions to run the code are present in Code folder

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
charts		charts
sample_2gram_output		sample_2gram_output
sample_3gram_output		sample_3gram_output
sample_input		sample_input
.gitignore		.gitignore
LICENSE		LICENSE
Lab5_PlotForFeaturedActivity.ipynb		Lab5_PlotForFeaturedActivity.ipynb
README.md		README.md
Report.pdf		Report.pdf
cooccurence.py		cooccurence.py
la.lexicon.csv		la.lexicon.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-Processing-With-Spark

Word Count on Classical Latin Text

Comparison of runtime for different n-grams

Bigram Scalability Chart

About

Releases

Packages

Languages

License

monisjaved/Data-Processing-With-Spark

Folders and files

Latest commit

History

Repository files navigation

Data-Processing-With-Spark

Word Count on Classical Latin Text

Comparison of runtime for different n-grams

Bigram Scalability Chart

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages