Spacy-Serbian-Transformer

This project aims to train tagger for the Serbian language using SpaCwith the use of Large Language Model (LLM)es. The training data is sourced from the Jertehs Corpus.

The project workflow involves several stages, including data preparation, model training, and evaluation. This repository contains scripts and instructions for each of these steps.

Data Preparation

SrpKor4Tagging Jerteh's Corpus data is provided in Verticalized Text (VRT) format. VRT is an annotated text format that includes essential linguistic information about each word, such as its lemma, part of speech, and morphological attributes.

There is two other corpuses in Coprus folder: UD_Serbian-SET (Universal Dependencies) from [here] (https://github.com/UniversalDependencies/UD_Serbian-SET)
This is on conllu format, and and it was converted to VRT format using standard spacy convereter.

The next is SrpELTeC-gold, also from Jerteh. It set of old Serbian novela, and in BRAT format. In in folder SrpELTeC-gold, there is script for converting BRAT to spacy.

The vrtConvert.py script is used for converting the VRT files into a format compatible with SpaCy training. This script processes the VRT file and generates a binary .spacy file. The .spacy file comprises the words, lemmas, POS tags, and UD tags from the VRT file.

Here's how to use the script:

python vrtConvert.py path_to_your_vrt_file.vrt output.spacy

Note that root path for the VRT file is set in the script. It is Corpus folder in this git.

The script will generate a .spacy file in the same directory as the VRT file. This file can be used for training the SpaCy model.

There was problem with naming convention in SrpELTeC-gold corpus, so the script ner_filename_corrector.py was writtent o rename files.

The convesion to spacy format is done using nerConvert.py script. Note that scrptit uses spacy pipeline, since SrpELTeC-gold only has NER tags, and not others. If onl;y NER si tot betraing using blank is enough.

And finally script train-test-split.py is used to spearte spacy files in train, eval and test sets.

Here is how to use it:

python train-test-split. name_of_your_spacy_file.spacy

It assumed that file is in Corpus folder, and it will generate train, eval and test files in the same folder, adding -train -dev, -test, just before .spacy .

Baseline

-Model2: is trained in SrpKor4Tagging corpus using basic token to vector, tagger, lemmatizer.

Besides being useful, it will be comparation, to languge modeal traied using LLM

The next pipeline uses transformer-based models to provide tokenization, part-of-speech tagging, and lemmatization. Specifically:

Model 3: bert-base-multilingual-uncased is a transformer-based model that is pre-trained on a large corpus of text in multiple languages. It is part of the BERT family of models and is designed to handle text in multiple languages without the need for language-specific models.
Model 4: Classla Bertic is a transformer-based model that is trained on Serbian, Croatian, and other similar languages. It is part of the classla library, which provides a suite of natural language processing tools for Slavic languages.
Model 5: Berticovo is a transformer-based model that is specifically trained on Serbian. It is part of the bertic library, which provides a suite of natural language processing tools for Serbian.

Model 6-

Note that due to space constraints, only models 2 and 3 are uploaded on Git, but both the base configuration and configuration files are set for training. If anyone wants the files, they can contact me over Git and I will be happy to share them.

The next step in the pipeline is to add named entity recognition (NER) to the pipeline, but this is still a work in progress.

Model 5 was converted to package and uploaded to Hugging Face model hub. It can be found here.

it can drecly intasteled using pip:

!pip install https://huggingface.co/Tanor/sr_Spacy_Serbian_Model_SrpKor4Tagging_BERTICOVO/resolve/main/sr_Spacy_Serbian_Model_SrpKor4Tagging_BERTICOVO-any-py3-none-any.whl

and it can be used in spacy pipeline:

# Using spacy.load().
import spacy
nlp = spacy.load("sr_Spacy_Serbian_Model_SrpKor4Tagging_BERTICOVO")

# Importing as module.
import sr_Spacy_Serbian_Model_SrpKor4Tagging_BERTICOVO
nlp = sr_Spacy_Serbian_Model_SrpKor4Tagging_BERTICOVO.load()

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.vscode		.vscode
Corpus		Corpus
model		model
model2		model2
model3		model3
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Test Spacy files.ipynb		Test Spacy files.ipynb
Untitled.ipynb		Untitled.ipynb
base_config.cfg		base_config.cfg
base_config10.cfg		base_config10.cfg
base_config11.cfg		base_config11.cfg
base_config12.cfg		base_config12.cfg
base_config2.cfg		base_config2.cfg
base_config3.cfg		base_config3.cfg
base_config4.cfg		base_config4.cfg
base_config5.cfg		base_config5.cfg
base_config6.cfg		base_config6.cfg
base_config7.cfg		base_config7.cfg
base_config8.cfg		base_config8.cfg
base_config9.cfg		base_config9.cfg
config.cfg		config.cfg
config10.cfg		config10.cfg
config11.cfg		config11.cfg
config12.cfg		config12.cfg
config2.cfg		config2.cfg
config3.cfg		config3.cfg
config4.cfg		config4.cfg
config5.cfg		config5.cfg
config6.cfg		config6.cfg
config7.cfg		config7.cfg
config8.cfg		config8.cfg
config9.cfg		config9.cfg
eval_results.json		eval_results.json
eval_results10-1.json		eval_results10-1.json
eval_results10.json		eval_results10.json
eval_results11.json		eval_results11.json
eval_results12.json		eval_results12.json
eval_results7-1.json		eval_results7-1.json
eval_results7.json		eval_results7.json
eval_results8-1.json		eval_results8-1.json
eval_results8.json		eval_results8.json
eval_resultstest.json		eval_resultstest.json
generate_spacy_meta.py		generate_spacy_meta.py
inspect_ner_spacy_corpus.py		inspect_ner_spacy_corpus.py
model12-train-output.txt		model12-train-output.txt
model2- trainouput.txt		model2- trainouput.txt
model3-train-output.txt		model3-train-output.txt
model4-train-output.txt		model4-train-output.txt
model5-train-output.txt		model5-train-output.txt
model6-train-output.txt		model6-train-output.txt
model7-train-output.txt		model7-train-output.txt
model8-train-output.txt		model8-train-output.txt
nerConvert.py		nerConvert.py
ner_filename_corrector.py		ner_filename_corrector.py
python		python
senter.cfg		senter.cfg
start.bat		start.bat
train-test-split.py		train-test-split.py
ud_tags.txt		ud_tags.txt
vrtConvert.py		vrtConvert.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spacy-Serbian-Transformer

Data Preparation

Baseline

About

Releases

Packages

Languages

License

sasa5linkar/Spacy-Serbian-Transformer

Folders and files

Latest commit

History

Repository files navigation

Spacy-Serbian-Transformer

Data Preparation

Baseline

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages