The repository implements the common algorithms for multi-class text classification. Note that it's just prototypes for experimental purposes only
- Word or char level representation: chi-square+tfidf, word2vec, glove, fasttext, elmo, bert, or concated one
- Model: CNN, BiLSTM, Self-attention,C-LSTM, RCNN, Capsule, HAN, SVM, XGBoost
- Multi task learning: for more than one multi_labels
pip install -r requirements.txt
- Python 3.6
- Tensorflow 1.12.0
python run_classifier.py
- in config.py, set the
new_data=True
, -> generate the./data/*.tf_record
-> utilize config.py parameters - in config.py, set the
new_data=False
, -> utilize the data from./data/*.tf_record
-> utilize config.json parameters
- word2vec Chinese pretrained download
- fasttext Chinese pretrained download
- bert Chinese pretrained download from google
- tips: make sure the text use the similar preprocessing trick like segmentation as the pretrained material
- create a word2vec pretrained model reference
- The classification is used to clarify the damaged part and damage type from vehicles comments
- Check in tensorboard:
tensorboard --logdir=./outputs
- Due to we have too many categories of labels (ca. 500 class for 100,000 examples), and they are not equally important, so we don’t use Macro- evaluation. And the Micro- precision/recall/F1 is the same for multi-label classification. So we check the accuracy and weighted F1.
- Sometimes in one sample, more than one label are valid
- Some labels have hierarchy relationship
- imbalance issue: weighted loss, data argument, anomaly detection, upsampling and downsampling
- Multi-task learning
- Multi-label classification