-
Problem Statement The project’s goal is to develop a model that can accurately classify tweets as related to a real disaster or not. This task involves using natural language processing (NLP) to distinguish between genuine disaster-related content and other unrelated tweets, a key task for improving information flow during emergencies.
-
Approach The solution applies transfer learning with a pre-trained BERT model. BERT is widely used in NLP due to its bidirectional transformer architecture, which allows the model to capture the context of words from both directions in a sentence, making it ideal for nuanced language tasks. The benefits of fine-tuning BERT include:
Reduced Training Time: BERT’s pre-trained weights contain general language knowledge, which helps the model learn from the new data more efficiently. Lower Data Requirements: Since the BERT model has been trained on vast corpora, even a limited dataset (like disaster-related tweets) can yield high performance. 3. Data Preprocessing Custom data cleaning steps were applied to ensure quality input for the BERT model. The preprocessing steps include:
Removing URLs: Hyperlinks in tweets were stripped to focus on the main text. Removing HTML Tags: Any embedded HTML tags were removed. Removing Punctuation: Special characters were deleted to standardize the text. Removing Stopwords: Common words with little semantic value (like “the,” “is”) were filtered out. Removing Emojis: Emojis were removed since they are not directly meaningful for this task and could introduce noise. These steps aimed to enhance the classifier’s ability to focus on the content relevant to disaster classification.
- Model Setup and Fine-Tuning BERT The BERT model architecture consists of multiple transformer layers, and for this task:
Tokenizer Setup: The BERT tokenizer converts each tweet into tokens (numerical representations), crucial for feeding into the BERT model. Input Preparation: Each tweet is tokenized, and a special [CLS] token is added to the beginning of each input, as BERT’s architecture uses the embedding from this token for classification tasks. Classification Layer: A simple dense (fully connected) layer is added to the [CLS] embedding from the final layer to enable binary classification (real disaster vs. not). 5. Model Training and Optimization The fine-tuning involves adjusting the weights of BERT’s layers to adapt it to the disaster classification task. Key techniques used during training include:
Adam Optimizer with Weight Decay: A modified Adam optimizer tailored for BERT’s parameter space. Learning Rate Scheduler: A linear decay scheduler is used to adjust the learning rate dynamically, which stabilizes training and improves convergence. Early Stopping and Dropout: Regularization techniques to prevent overfitting. 6. Model Evaluation After training, the model is evaluated using metrics such as accuracy, precision, recall, and F1-score to measure its performance in distinguishing real disaster tweets from unrelated content. The evaluation provides insights into model reliability, and these metrics guide any further tuning for optimal results.