Skip to content

Fine-tuning BERT model for text classification with TensorFlow and TensorFlow Hub

License

Notifications You must be signed in to change notification settings

nipun-goyal/BERT-Text-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BERT Architecture

Classify Alignment Sheets

At CER, we receive applications from companies containing thousands of pages of documents. We wanted to develop a Machine Learning Algorithm to differentiate the pages which are alignment sheets (or maps) from pages which are not maps.

Sample Maps:

Sample Non-Maps:

Approach

The problem stated above was tackled by building and training a bunch of machine learning based classification algorithms using the features that were extracted from each page of a PDF file using Python PyMuPDF library. The names of some of the features that were extracted are area of images in a page, number of images in a page, count of words in a page. In addition, few more features were generated by simply checking if the page has certain words such as "North" or "N", "Figure", "Map", "Alignment Sheet" or "Sheet", "Legend", "scale", and "kilometers" or "km".

After feature extraction, different classification models were compiled and trained such as, XG Boost Classifier, Support Vector Classifier, Decision Tree Classifier, Random Forest Classifier, Random Forest Regressor and XG Boost Regressor. Post model training, the model accuracy and performance was evaluated on the validation dataset and the unseen data i.e. test dataset. After evaluation phase, the best performing model was saved in models repo for future use.

Note: The result from the regressor models was converted into binary output using sigmoid function, hence, these regression models are referred as classification models here.

The model training part has not been discussed in depth here. Rather, we present below the structure of this repo and how to run the jupyter notebook files

Description of the folder structure

  1. 0. Download PDFs and extract features of Alignment Sheets.ipynb: This file contains the funtions to download the PDF documents and to extract the features from each page of a PDF file. The ouput from this jupyter notebook file is a CSV containing all the extracted features

  2. 1. Save Alignment Sheets.ipynb: This file takes feature CSV as input and classify whether a PDF page is an alignment sheet or not by using the best performing classifier that we saved in repo models. The later section of this jupyter notebook file contains the functions to extract and assign the titles for alignment sheets

How to use the files in this repo?

  • Clone or download github files into a local directory
  • Install required python packages from requirements.txt file by creating virtual environment
  • Activate the virtual environment
  • Open Jupyter notebook and run the files in the following order and observe results:
    • 0. Download PDFs and extract features of Alignment Sheets.ipynb
    • 1. Save Alignment Sheets.ipynb

About

Fine-tuning BERT model for text classification with TensorFlow and TensorFlow Hub

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published