Skip to content

NengQian/text-analysis-lab

 
 

Repository files navigation

Overview

  • use programming language python with library spacy
  • apply methods TF/IDF and LDA to analysis the text

Folder/Structure

  • html contain all the html-version files including graphs
  • img contain the charts from the result of two methods TF/IDF and LDA
  • presentation has the slides of our presentation
  • src has all the configurations we need
  • file LDA apply the method LDA
  • file TFIDF apply the method TF/IDF

Theory and Algorithm

preprocessing

  • remove all the unnecessary symbols
  • remove all the stop words
  • remove all the numbers
  • transfer all the words into lowercase
  • classify all the words with their lemma

TF/IDF

  • Stand for term frequency-inverse document frequency
  • Our goal: find out the most important words from certain text and learn the key words trend during several years
  • Processing: input text file → results
  • take BMW as example
input [BMW-AnnualReport-2010 to 2017]
do Reprocessing
output content
input content
do TF/IDF
output TF/IDF-matrix
  • Result: all the words with importance value

LDA

  • Stand for Latent Dirichlet Allocation
  • Our goal: find out several topics from certain text, know the main pages of certain topic and learn the topics trend during several years
  • Processing: train the model → input text file → results
  • take Commerzbank as example
input [many reports from the bands]
do Preprocessing
do Train the model
output model
input commerzbank_report
do LDA
output topics
  • Result: topics of the text

Result

  • Visualization: use library plotly to draw the charts

TF/IDF

alt text alt text alt text

LDA

alt text alt text


Reference

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 90.0%
  • Jupyter Notebook 9.8%
  • Python 0.2%