This repository contains a comprehensive Multi-Language NER analysis project utilizing XLM-RoBERTa for Named Entity Recognition tasks across multiple languages. The project demonstrates cross-lingual transfer capabilities and includes both fine-tuned and zero-shot models. It features a Gradio-based application for real-time NER predictions, making it interactive and user-friendly.
- Project Overview
- Dataset
- Understanding the Dataset
- Data Preprocessing
- Tokenizer Comparison: BERT vs XLM-R
- Tokenizer for NER Analysis
- Model Metrics
- Model Training
- Cross-Lingual Transfer
- Zero-Shot vs Fine-Tuned Model
- Gradio Application
- Installation
- Usage
- Uploading to Hugging Face
- Contributing
- License
This project investigates the application of a multi-language transformer model, specifically XLM-RoBERTa, for Named Entity Recognition (NER) across languages. By fine-tuning on a German dataset and transferring the model to other languages such as French, Italian, and English, the project explores cross-lingual transfer in NER tasks.
The primary dataset for this project is the XTREME dataset, commonly used for benchmarking multi-lingual models. We specifically use the German, French, Italian, and English subsets for NER analysis. The dataset includes tokens, labels, and language identifiers.
Each dataset entry consists of:
- Tokens: Individual words or subwords for token-level classification.
- Labels: Corresponding entity labels (e.g., Person, Organization, Location).
- Language: The language in which the data is written.
This section of the project inspects the distribution of labels and languages, ensuring a balanced dataset representative of multiple languages.
Preprocessing includes:
- Converting entity labels into numerical representations.
- Tokenizing text for model compatibility.
- Splitting data into training, validation, and test sets based on language proportions.
Comparing BERT and XLM-RoBERTa tokenizers reveals differences in how tokens are processed:
- BERT: Uses special tokens
[CLS]
and[SEP]
. - XLM-R: Uses language-specific encoding and subword segmentation with an underscore (
_
).
The XLM-RoBERTa tokenizer is configured specifically for token classification in NER, allowing robust handling of multiple languages and entity labels.
Standard NER metrics are used for evaluation:
- Precision, Recall, F1 Score: To evaluate entity-level performance across languages.
- Per-language metrics: To assess cross-lingual transfer effectiveness.
The XLM-RoBERTa model is fine-tuned on German NER data. Key steps include:
- Setting up an NER-friendly architecture with token classification layers.
- Training with early stopping and monitoring loss for optimal performance.
After fine-tuning on German data, the model is tested on French, Italian, and English to observe cross-lingual transfer capabilities. This section analyzes performance in zero-shot settings.
We assess the difference between zero-shot (direct application without fine-tuning) and fine-tuned models. Fine-tuning improves accuracy significantly across languages, demonstrating the benefits of transfer learning in NER tasks.
A Gradio-based web application allows users to interact with the NER model in real time. Users can input text in different languages, and the application will highlight recognized entities.
- Enter text in one of the supported languages.
- The application will display the entities identified, along with their labels.
To run this project locally:
Additional dependencies for Gradio:
bash pip install gradio
Install the Hugging Face Transformers library:
bash pip install transformers
-
Run the Gradio application:
python gradio_app.py
-
Model Fine-tuning and Evaluation: Run the provided notebook for full training and evaluation steps.
Once fine-tuned, the model can be uploaded to Hugging Face for easy sharing:
-
Authenticate:
huggingface-cli login
-
Upload the model:
from transformers import AutoModelForTokenClassification model = AutoModelForTokenClassification.from_pretrained("path_to_your_fine_tuned_model") model.push_to_hub("your_hf_username/your_model_name")
my demo is
https://huggingface.co/spaces/RidaDogrul/NER-Demo
Contributions are welcome! Please fork this repository, create a new branch for your feature, and submit a pull request.
This project is licensed under the MIT License.