Predict Documents Label Project

I will use Google Natural Language API to classify text on the AI news data that Dr. Eckroth provided us for the CREU project. ----> BDfinal.py
- running time for this step was approximately 5 hours for the entire dataset of AI news
After classifying those news articles, I use that data for training and testing sets to attempt to train a bag-of-words model for future classification -----> spark_train.py
- since there is not enough memory to train the model with the entire data set, I try to train the model with a subset of the data. However, the accuracy score is pretty low, below is the table of the accuracy score with training size = number lines in the csv file
  
  training size tf-idf countVectorizer
  
  5000 0.38 0.371
  
  6000 0.329 0.328
  
  7000 0.365 0.3859
  
  8000 0.350 0.378
- I figure at this point, increasing the size of the training will not increase the accuracy, so I suspect it has to do with the training data itself, so I did some exploratory checks:
  
  -- it seems like my training data is pretty skewed, for example, some cateogories would have more than 4000 observations, but some only has 1.
  
  -- the categories are also not uniformed, some categories are more detailed than the other, for example, one article is classfied as Arts & Entertainment/Fun & Trivia/Flash-Based Entertainment, while some is simply "Reference"
Classifying step -----> save_and_load_model.py
- suppose we want to classify the content of the text.txt
- run this command: python save_and_load_model.py text.txt
Summary of tools:
- Exploratory analysis: spark
  
  -- sample code to create a list of trained data from Google API
  
  -- sample code to train my own model from the trained data
  
  -- retrieve summary of trained data
- Distributed workers: spark -- pipe articles to the python code
- numeric/string processing: spark + google API + sklearn
- machine learning: sklearn in spark
Current work:
- Separate out the hiearrchy of categories and get the largest counts
- narrow counts to top 5 of the categories

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
BDfinal.py		BDfinal.py
README.md		README.md
fullset.csv		fullset.csv
save_and_load_model.py		save_and_load_model.py
spark_train.py		spark_train.py
splitted.csv		splitted.csv
top10.py		top10.py
top100.csv		top100.csv
top100short.csv		top100short.csv
trainingset.csv		trainingset.csv
vocab.pkl		vocab.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predict Documents Label Project

About

Releases

Packages

Languages

training size	tf-idf	countVectorizer
5000	0.38	0.371
6000	0.329	0.328
7000	0.365	0.3859
8000	0.350	0.378

tramvn1996/Predict-documents-labels

Folders and files

Latest commit

History

Repository files navigation

Predict Documents Label Project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages