semantic_doc

Testing some semantic models for classification

Files:

edgar_scrapper.py -- Downloads the document data from https://www.sec.gov/Archives/edgar/data/51143/ and saves the documents in pdf format with the name formatted in this way: [name_of_folder]_[name_of_file]_[label(type)].pdf. Over 7000 files gathered for 34 classes.

get_data_classes.py -- Extracts the content of the files and has functions to divide the data in training and testing sets.

data_preprocess.py -- Has functions to preprocess the data for the models.

models.py -- Code for the training and testing of different models.

Results:

LDA: using topic modeling for classification. Partially based on https://towardsdatascience.com/unsupervised-nlp-topic-models-as-a-supervised-learning-input-cf8ee9e5cf28

Classifiers scores for LDA bow model.

Score for classifier Logistic Regression SGD: 0.640555
Score for classifier SVM Huber: 0.640555
Score for classifier Random Forest: 0.640555

Classifiers scores for LDA bigram model.

Score for classifier Logistic Regression SGD: 0.640555
Score for classifier SVM Huber: 0.000112
Score for classifier Random Forest: 0.640555

Classifiers scores for LDA trigram model.

Score for classifier Logistic Regression SGD: 0.641057
Score for classifier SVM Huber: 0.640555
Score for classifier Random Forest: 0.000075

Note: There was a problem using the default BLAS library with the numpy and scipy where the multicore functionality was not working properly. Changing to using the openBLAS library fixed the multicore issues. Also, openBLAS is fairly optimized in comparison to the default one so the models now run much faster. The TF-IDF based model is still too heavy for single laptop.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

semantic_doc

Files

README.md

Latest commit

History

README.md

File metadata and controls

semantic_doc