This repository contains my solution for the Kaggle competition Automated Essay Scoring 2.0. The goal of this project is to develop an automated system capable of scoring essays based on their content and quality using machine learning techniques.
- Data Loading & Preprocessing
- Feature Engineering
- Feature Selection
- Model Building & Training
- Inference
In this section, I loaded the dataset provided by Kaggle and performed initial preprocessing steps to prepare the data for feature engineering and model training.
- Loading Data: Imported the dataset using Pandas.
- Handling Missing Values: Checked for and handled any missing values to ensure data integrity.
- Data Cleaning: Removed any irrelevant information and standardized the format of the essays.
Feature engineering is crucial for improving the performance of the model. Various features were extracted from the essays to capture different aspects of the text.
Extracted features related to the structure of the paragraphs within the essays, such as the number of paragraphs, average length of paragraphs, and coherence between paragraphs.
Extracted features related to sentence structure, such as the number of sentences, average sentence length, grammatical correctness, and complexity of sentences.
Extracted features related to the words used in the essays, such as vocabulary richness, usage of unique words, frequency of stop words, and sentiment analysis.
Converted the text data into numerical format using different vectorization techniques to capture the importance and frequency of words.
Used TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer to assign weights to words based on their frequency and importance across the corpus. This technique helps in identifying significant words that contribute to the meaning of the essays.
Used Count Vectorizer to convert the text data into a matrix of token counts. This technique helps in capturing the frequency of words in the essays, providing a simple yet effective way to quantify textual data.
Utilized the DeBERTa (Decoding-enhanced BERT with disentangled attention) model to generate predictions for the essays. These predictions were then used as additional features in the LightGBM (LGBM) model to improve its performance.
Performed feature selection to identify the most significant features that contribute to the essay scores. This step helps in reducing the dimensionality of the data and improving the efficiency of the model.
In this section, I built and trained the machine learning models using the selected features.
- Model Selection: Chose LightGBM (LGBM) as the primary model due to its efficiency and performance.
- Training: Trained the LGBM model on the training dataset with the selected features.
- Validation: Validated the model using cross-validation techniques to ensure its robustness and generalizability.
During the inference phase, the trained model was used to score new, unseen essays. The process involved transforming the new essays using the same feature engineering pipeline and then predicting the scores using the trained model.
- Data Transformation: Transformed the new essays using the same preprocessing and feature engineering steps as the training data.
- Prediction: Used the trained LGBM model to predict the scores for the new essays.
- Output: Generated the final scores and saved the results in the required format for submission.
This project showcases a comprehensive approach to automated essay scoring using advanced machine learning techniques. By leveraging powerful models like DeBERTa and LGBM, and performing thorough feature engineering and selection, the solution aims to achieve high accuracy and robustness in scoring essays.
Special thanks to Learning Agency Lab for providing the dataset and hosting the competition. Also, gratitude to the open-source community for providing the tools and libraries that made this project possible.
Competition link - https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2