LlamaLens: Specialized Multilingual LLM forAnalyzing News and Social Media Content

Overview

LlamaLens is a specialized multilingual LLM designed for analyzing news and social media content. It focuses on 19 NLP tasks, leveraging 52 datasets across Arabic, English, and Hindi.

capablities_tasks_datasets

LlamaLens

This repo includes scripts needed to run our full pipeline, including data preprocessing and sampling, instruction dataset creation, model fine-tuning, inference and evaluation.

Features

Multilingual Support: Arabic, English, and Hindi.
Comprehensive NLP Tasks: 19 tasks utilizing 52 datasets.
Domain Optimization: Tailored for news and social media content analysis.

Dataset

The model was trained on the LlamaLens dataset.

Model

Access the LlamaLens model on Hugging Face.

Setup

Installation

Ensure you have Python and pip installed on your system.

Clone the repository (if applicable):

git clone https://github.com/firojalam/LlamaLens.git
cd LlamaLens

Install the required packages:
```
pip install -r requirements.txt
```
You may need to update transformers library
```
pip install --upgrade transformers
```

Instruction-following Dataset

This repository includes a script to prepare the Llama3.1 instruction dataset. You can customize the dataset preparation process using various parameters, including sample size, dataset split, shuffling strategy, and the output directory for the processed dataset. All datasets are available on hugginface

Parameters

`samples`

Description: Defines the number of samples to be used in the dataset.
Usage:
- Set to -1 to use the full dataset.
- Set to any positive integer to specify the maximum number of samples (using stratified sampling).

`split`

Description: Specifies the dataset split to generate.
Choices:
- "train": Training set
- "test": Testing set
- "dev": Development set

`intermediate_datasets_base`

Description: Path to the directory containing subdirectories for different languages (e.g., ar, en, hi).
Usage: The base directory should contain the following subdirectories:
- ar: Arabic language dataset
- en: English language dataset
- hi: Hindi language dataset
Example: /path/to/intermediate_datasets/ar, /path/to/intermediate_datasets/en, /path/to/intermediate_datasets/hi.

`shuffling`

Description: Defines how the dataset is shuffled.
Choices:
- "none": No shuffling.
- "by_task": Shuffle the dataset within each task.
- "by_language": Shuffle the dataset within each language.
- "fully": Shuffle the entire dataset.
Usage: Choose the option that best matches the configuration in the original paper.

`dataset_directory`

Description: Path to the directory where the prepared dataset will be saved.
Usage: Specify the directory where you want to save the final dataset after it is processed.

Running the Script

To run the dataset preparation script, use the following command. Adjust the parameters as needed:

python3 bin/data_preparation/llama3_dataset_preparation.py \
    --samples -1 \
    --split "train" \
    --intermediate_datasets_base "data/instruction_datasets" \
    --shuffling "none" \
    --dataset_directory "finetuning_datasets/testing_dataset"

Model Training

This is an example of how run the training script on full precision mode:

accelerate launch bin/model_training/parallel_fine_tuning_llama3_Full_precision.py \
  --model_name "base_models/Meta-Llama-3.1-8B-Instruct" \
  --max_seq_length 512 \
  --quant_bits 4 \
  --use_nested_quant False \
  --batch_size 16 \
  --grad_size 2 \
  --epochs 1\
  --out_dir "trained_models/Meta-Llama-3.1-8B-Instruct-shuffled_by_language_20k_4bit/outputs" \
  --save_steps 500 \
  --train_set_dir "data/finetuning_datasets/shuffled_by_language_20k" \
  --dev_set_dir "data/validation_data_500" \
  --start_from_last_checkpoint False \
  --lora_adapter_dir "trained_models/Meta-Llama-3.1-8B-Instruct-shuffled_by_language_20k_4bit/lora_adapter" \
  --merged_model_dir "trained_models/Meta-Llama-3.1-8B-Instruct-shuffled_by_language_20k_4bit/merged_model" \

This is an example of how run the training script on quantized mode:

accelerate launch bin/model_training/parallel_fine_tuning_llama3_quantized.py \
  --model_name "base_models/Meta-Llama-3.1-8B-Instruct" \
  --max_seq_length 512 \
  --quant_bits 4 \
  --use_nested_quant False \
  --batch_size 16 \
  --grad_size 2 \
  --epochs 1\
  --quant_bits 4 \
  --use_nested_quant False \
  --out_dir "trained_models/Meta-Llama-3.1-8B-Instruct-shuffled_by_language_20k_4bit/outputs" \
  --save_steps 500 \
  --train_set_dir "data/finetuning_datasets/shuffled_by_language_20k" \
  --dev_set_dir "data/validation_data_500" \
  --start_from_last_checkpoint False \
  --lora_adapter_dir "trained_models/Meta-Llama-3.1-8B-Instruct-shuffled_by_language_20k_4bit/lora_adapter" \
  --merged_model_dir "trained_models/Meta-Llama-3.1-8B-Instruct-shuffled_by_language_20k_4bit/merged_model" \

Model Inference

To run inference for a specific language, you have to specify the intermediate folder that contains multiple datasets.

python bin/evaluation/inference.py \
    --instructions-path support_data/instructions/instructions_gpt-4o_claude-3-5-sonnet_ar.json \
    --intermediate-base-path data/intermediate_datasets_ar \
    --results-folder-path "results/Test_results" \
    --model-path "base_models/Meta-Llama-3.1-8B-Instruct" \
    --samples -1 \
    --device 0

Results Evaluation

To score results, run the follwing script:

python bin/evaluation/evaluate.py \
--experiment_dir results/Meta-Llama-3.1-8B-Instruct-shuffled_by_language_20k_4bit/ar
--output_dir scores/Meta-Llama-3.1-8B-Instruct-shuffled_by_language_20k_4bit/ar

File Format

Each JSONL file in the dataset follows a structured format with the following fields:

id: Unique identifier for each data entry.
original_id: Identifier from the original dataset, if available.
input: The original text that needs to be analyzed.
output: The label assigned to the text after analysis.
dataset: Name of the dataset the entry belongs.
task: The specific task type.
lang: The language of the input text.
instruction: A brief set of instructions describing how the text should be labeled.
text: A formatted structure including instructions and response for the task in a conversation format between the system, user, and assistant, showing the decision process.

Example entry in JSONL file:

{
    "id": "d1662e29-11cf-45cb-bf89-fa5cd993bc78",
    "original_id": "nan",
    "input": "الدفاع الجوي السوري يتصدى لهجوم صاروخي على قاعدة جوية في حمص",
    "output": "not_claim",
    "dataset": "ans-claim",
    "task": "Claim detection",
    "lang": "ar",
    "instruction": "Analyze the given text and label it as 'claim' if it includes a factual statement that can be verified, or 'not_claim' if it's not a checkable assertion. Return only the label without any explanation, justification or additional text.",
    "text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a social media expert providing accurate analysis and insights.<|eot_id|><|start_header_id|>user<|end_header_id|>Analyze the given text and label it as 'claim' if it includes a factual statement that can be verified, or 'not_claim' if it's not a checkable assertion. Return only the label without any explanation, justification or additional text.\ninput: الدفاع الجوي السوري يتصدى لهجوم صاروخي على قاعدة جوية في حمص\nlabel: <|eot_id|><|start_header_id|>assistant<|end_header_id|>not_claim<|eot_id|><|end_of_text|>"
}

License

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Citation

Please cite our paper when using this model:

   @article{kmainasi2024llamalensspecializedmultilingualllm,
     title={LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content},
     author={Mohamed Bayan Kmainasi and Ali Ezzat Shahroor and Maram Hasanain and Sahinur Rahman Laskar and Naeemul Hassan and Firoj Alam},
     year={2024},
     journal={arXiv preprint arXiv:2410.15308},
     volume={},
     number={},
     pages={},
     url={https://arxiv.org/abs/2410.15308},
     eprint={2410.15308},
     archivePrefix={arXiv},
     primaryClass={cs.CL}
   }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LlamaLens: Specialized Multilingual LLM forAnalyzing News and Social Media Content

Overview

LlamaLens

Features

Dataset

Model

Setup

Installation

Instruction-following Dataset

Parameters

`samples`

`split`

`intermediate_datasets_base`

`shuffling`

`dataset_directory`

Running the Script

Model Training

Model Inference

Results Evaluation

File Format

License

Citation

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
bin		bin
support_data		support_data
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

firojalam/LlamaLens

Folders and files

Latest commit

History

Repository files navigation

LlamaLens: Specialized Multilingual LLM forAnalyzing News and Social Media Content

Overview

LlamaLens

Features

Dataset

Model

Setup

Installation

Instruction-following Dataset

Parameters

samples

split

intermediate_datasets_base

shuffling

dataset_directory

Running the Script

Model Training

Model Inference

Results Evaluation

File Format

License

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`samples`

`split`

`intermediate_datasets_base`

`shuffling`

`dataset_directory`

Packages