GitHub - nemosharma6/event-coding: event coding using spark and stanford-core-nlp

event coding

tech stack

kafka
spark
stanford-corenlp
mongodb
petrarch2

pipeline

data -> kafka -> spark RDD -> simhash alogirthm -> stanford-corenlp -> mongo -> create_final_xml -> petrarch

execution

sbt clean
sbt compile
sbt assembly
# Prodcuer is used to dump initial data into kafka
java -jar -Dinput=./src/main/input/test target/scala-2.11/deduplication-fat.jar # set main-class as Producer
# OneTime runs the deduplication algorithm + sentence parsing via corenlp
java -jar -Doffset=3 target/scala-2.11/deduplication-fat.jar # set main-class as OneTime
# MongoToXml reads mongo data into and creates a final xml file java -jar -Doutput=./src/main/output/result.xml target/scala-2.11/deduplication-fat.jar # set main-class as MongoToXml
# the final xml file is provided as input to petrarch2. its only compatible with python2.7 ./src/main/scripts/run_event_coding.sh /anaconda3/lib/python3.6/site-packages/petrarch2/petrarch2.py ./src/main/resources/output/result.xml ./src/main/resources/output/petrarch_result.xml

challenges

you would need atleast 6G of executor memory for parsing the sentences via stanford-corenlp. The spark cluster that I was using had a hard limit of 4G.

to-do list

read config information from separate conf files.
run jar with multiple class names rather than separate jar for each.
create separate spark cluster with max executor memory around 10G.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
project		project
src/main		src/main
.gitignore		.gitignore
Readme.md		Readme.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

event coding

tech stack

pipeline

execution

challenges

to-do list

About

Releases

Packages

Languages

nemosharma6/event-coding

Folders and files

Latest commit

History

Repository files navigation

event coding

tech stack

pipeline

execution

challenges

to-do list

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages