PhraseExtract

This project is used for distributed phrase extract based on a mature tokenizer: IkAnalyzer. We mainly use Map-Reduce to construct the pipeline, a version of spark is on the way.

Dataset

We use FinancialDatasets supported by https://github.com/smoothnlp/FinancialDatasets

Main Pipeline

We calculate three kinds of score: PMI(Pointwise Mutual Information), Entropy and Segment Entropy.

The final score = PMI + Entropy - Segment Entropy.

We calculate these successively and sort the final results.

How to use

We use maven to manage dependencies. The version in detail can be found in pom.xml.

A mature tokenizer needs to be installed in your local Maven Repository. A easy tutorial is supported here: https://github.com/wks/ik-analyzer .

You can replace with a DIY tokenizer to split the sentence as you want in Utils.splitLine.

Every Class has a main to test whether its function work well. The Main.main fuse all and you can pass 2 parameters: InputPath and OutputPath to run.

FAQ

Connection Timed Out while running preWordEntropyJob and postWordEntropyJob:
try to start the history server:
```
bash mr-jobhistory-daemon.sh start historyserver
```

Hadoop java.io.IOException: Mkdirs failed to create /some/path:

Remove META-INF/LICENSE as here talked

An example:

zip -d out/artifacts/PhraseExtract_jar/PhraseExtract.jar META-INF/LICENSE
zip -d out/artifacts/PhraseExtract_jar/PhraseExtract.jar LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.idea		.idea
src/main		src/main
.gitignore		.gitignore
FinancialDatasets.xlsx		FinancialDatasets.xlsx
README.md		README.md
data_extract.py		data_extract.py
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhraseExtract

Dataset

Main Pipeline

How to use

FAQ

About

Releases

Languages

Super-Special-Pookies/PhraseExtract

Folders and files

Latest commit

History

Repository files navigation

PhraseExtract

Dataset

Main Pipeline

How to use

FAQ

About

Topics

Resources

Stars

Watchers

Forks

Releases

Languages