Representation learning is a subset of machine learning that focuses on learning how to represent input data. The goal is to transform raw data into a format that makes it easier for machine learning algorithms to make predictions.
In NLP, textual representations can capture syntax, semantics, and context, which are all essential for tasks like sentiment analysis, machine translation, text summarization, etc.
Representation learning can be categorized based on:
- Learning Type: Supervised, Unsupervised, Semi-Supervised, and Self-Supervised learning.
- Data Format: Tabular, Graph, Text, Audio, Image, Video, etc.
This repository focuses on text data.
This repository implements various methods for learning text representations, each suited for different tasks and goals in NLP. Below is a summary of the text representation methods implemented:
TF-IDF converts text into vectors by measuring the importance of words in a document relative to the entire corpus, enhancing document classification and information retrieval.
Citation: Salton, G., & Buckley, C. (1988). Link
Word2Vec generates dense vector embeddings by predicting surrounding words based on context, capturing semantic relationships in text.
Citation: Mikolov, T. et al. (2013). Link
LSTMs are a type of RNN designed to handle sequential data, making them effective for tasks like sentiment analysis by maintaining long-term dependencies.
Citation: Hochreiter, S., & Schmidhuber, J. (1997). Link
BERT reads text bidirectionally, capturing context from both sides, and is well-suited for various NLP tasks through fine-tuning.
Citation: Devlin, J. et al. (2019). Link
GPT is trained to generate text by predicting one token at a time, excelling in generative tasks like text completion and summarization.
Citation: Radford, A. et al. (2019). Link
BART combines a bidirectional encoder with an autoregressive decoder, making it effective for both understanding and generating text.
Citation: Lewis, M. et al. (2020). Link
T5 treats every NLP problem as a text-to-text task, enabling versatile applications across various NLP challenges.
Citation: Raffel, C. et al. (2020). Link
The data module handles:
- Loading and splitting the dataset
- Preparing the dataset for batch processing
- Preprocessing the text (removing stop words, lemmatization, etc)
The model module is responsible for:
- Managing models (training, evaluating, saving, loading, etc)
- Model selection and algorithm optimization
- Definition of the
TextRepresentation
abstract class, which all text representation methods implements - Implementing a linear classifier used across all appraoches
- Definition of the
TextClassifier
, which combines text representation and linear classifier for sentiment classification
In the represents
folder, there are implementations for various text representation methods:
- TF-IDF, Word2Vec, LSTM, BERT, GPT, BART, and T5
These modules demonstrate different approaches to text representation, showcasing their effectiveness in sentiment classification task. There is also an option for state-of-the-art methods to use only pretrained representations or to fine-tune them.
The raw dataset can be found in the raw/processed
directory. The processed dataset can be found in the data/processed
directory.
The dataset is sourced from Kaggle: Large Movie Review Dataset (Maas et al., 2011), focusing on classifying movie reviews as positive or negative.
- Expansion of Methods: Additional text representation methods
- Diverse Problem Types: Tackle different types of problems
- Automated Comparison Scripts: Implementing automated scripts for comparing representation methods would streamline experimentation
Ensure you have Python 3.11 installed. This project is compatible with CUDA, so if you have an NVIDIA GPU, it will be automatically detected for accelerated computation.
- Git: Version control system
- Poetry: Dependency management tool for Python
- Python 3.11: Programming language version
-
Clone the repository:
git clone https://github.com/milan-pavlovic-ai/representation-learning.git
-
Install dependecies in the virtual environment
poetry install
-
Activate virtual environment
poetry shell
Contributions are welcome! To contribute:
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch
). - Make your changes and commit them (
git commit -m 'Add feature'
). - Push to the branch (
git push origin feature-branch
). - Open a pull request.
Feel free to open issues for bug reports, feature requests, or discussions.