clint-kristopher-morris-p0

Introduction:

This project featured four sub-projects, testing various approaches of word counting and calculating TF-IDF scores.

S1 takes the books contained within the data folder and calculates the top 40 most commonly used words after converting to uniform letter case.
S2 repeats the task of S1 but also removes “stop words” located in the data folder.
S3 assumes the goals of S2 in addition it removes leading and trailing punctuation.
S4 calculates the top 5 TF-IDF scores from each book and complies them together.

Technologies Used:

Python 3.5
Apache Spark

How to Implement The Models

This project was segmented into two main files:

s1-s3.py obviously pertaining to sections s1, s2 and s3. While, s4.py handle the requirements from s4.

Examples of how to solve problems s1-s4 from the terminal are listed below:

python s1_s3.py --outfile ./result/sp1.json

python s1_s3.py --stopwords ./data/stopwords.txt --outfile ./result/sp2.json

python s1_s3.py --stopwords ./data/stopwords.txt --outfile ./result/sp3.json --punctuations True

python s4.py --outfile ./result/sp4.json

The files are dynamic enough to allow you to specify the following information:

--file which can be a single text document to a directory.
--stopwords location of the text files with the stop words.
--top how many of the sorted counted words should be returned.
--punctuations if True leading and trailing punctuation will be removed.
--outfile location of where the resultant json file is stored

Authors

Clint Morris

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
__pycache__		__pycache__
data		data
README.md		README.md
example command		example command
p0_tools.py		p0_tools.py
s1_s3.py		s1_s3.py
s4.py		s4.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

clint-kristopher-morris-p0

Introduction:

Technologies Used:

How to Implement The Models

Authors

About

Releases

Packages

Languages

dsp-uga/clint-kristopher-morris-p0

Folders and files

Latest commit

History

Repository files navigation

clint-kristopher-morris-p0

Introduction:

Technologies Used:

How to Implement The Models

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages