This project aims to classify Quora questions into two categories: sincere and insincere. It utilizes the powerful BERT (Bidirectional Encoder Representations from Transformers) model for text classification. The project involves data loading, text pre-processing, exploratory data analysis (EDA), BERT model implementation, training, evaluation, and prediction. The trained model achieves high accuracy in distinguishing between sincere and insincere questions.
- Dataset
- Data Loading and Pre-processing
- Exploratory Data Analysis (EDA)
- BERT Model Implementation
- Training and Evaluation
- Predictions and Classification Report
- Demo
- The Quora Question Pairs dataset used in this project can be found here.
- The dataset used for this project is "train.csv" and contains question texts with their corresponding target labels (0 for sincere, 1 for insincere).
The dataset is loaded using Pandas and then pre-processed to prepare it for analysis. The pre-processing steps include:
- Punctuation Removal: All punctuation marks are removed from the question text.
- Lowercasing the Text: The cleaned text is converted to lowercase.
- Tokenization: The text is tokenized into individual words.
- Stop Words Removal: Common English stopwords are removed from the tokens.
- Lemmatization: Words are lemmatized to convert them to their base forms.
- The script will clean, tokenize, and lemmatize the text data, and save the preprocessed dataset as "quora_preprocessed.csv".
EDA is performed to gain insights into the dataset. Various visualizations are generated, including:
- Bar Graph - Target Distribution: Shows the distribution of sincere and insincere questions.
- Bar Graph - Word Frequency: Displays the most frequent words in both categories.
- Word Cloud - Most Frequent Words: Visually represents the most common words.
- Histogram - Question Length: Analyzes the distribution of question lengths.
The BERT model is implemented for text classification using the "transformers" library. BERT tokenization is performed, and a custom dataset class is created to organize the data.
The BERT model is trained and evaluated. Training function iterates through the training data, while the evaluation function assesses the model's performance on the validation set. The best model is saved based on the highest validation accuracy.
The trained BERT model is used to make predictions on the test set. Classification metrics are computed and displayed, including precision, recall, F1-score, etc.
Check out the video demonstration of the project here