GitHub - NengQian/text-analysis-lab

Overview

use programming language python with library spacy
apply methods TF/IDF and LDA to analysis the text

Folder/Structure

html contain all the html-version files including graphs
img contain the charts from the result of two methods TF/IDF and LDA
presentation has the slides of our presentation
src has all the configurations we need
file LDA apply the method LDA
file TFIDF apply the method TF/IDF

Theory and Algorithm

preprocessing

remove all the unnecessary symbols
remove all the stop words
remove all the numbers
transfer all the words into lowercase
classify all the words with their lemma

TF/IDF

Stand for term frequency-inverse document frequency
Our goal: find out the most important words from certain text and learn the key words trend during several years
Processing: input text file → results

take BMW as example

input [BMW-AnnualReport-2010 to 2017]
do Reprocessing
output content

input content
do TF/IDF
output TF/IDF-matrix

Result: all the words with importance value

LDA

Stand for Latent Dirichlet Allocation
Our goal: find out several topics from certain text, know the main pages of certain topic and learn the topics trend during several years
Processing: train the model → input text file → results
take Commerzbank as example

input [many reports from the bands]
do Preprocessing
do Train the model
output model

input commerzbank_report
do LDA
output topics

Result: topics of the text

Result

Visualization: use library plotly to draw the charts

TF/IDF

LDA

Reference

library spacy reference https://spacy.io/

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
html		html
img		img
presentation		presentation
src		src
.gitignore		.gitignore
LDA.ipynb		LDA.ipynb
README.md		README.md
TFIDF_company_chart.ipynb		TFIDF_company_chart.ipynb
load_and_prepro_document.ipynb		load_and_prepro_document.ipynb
lsi_model.ipynb		lsi_model.ipynb
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Folder/Structure

Theory and Algorithm

preprocessing

TF/IDF

LDA

Result

TF/IDF

LDA

Reference

About

Releases

Packages

Languages

NengQian/text-analysis-lab

Folders and files

Latest commit

History

Repository files navigation

Overview

Folder/Structure

Theory and Algorithm

preprocessing

TF/IDF

LDA

Result

TF/IDF

LDA

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages