Skip to content

Sentence alignment for ancient Greek using sentence embeddings.

License

Notifications You must be signed in to change notification settings

kevinkrahn/grc-bertalign

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bertalign for Ancient Greek

This repo is a fork of Bertalign, a mulitlingual sentence aligner, updated for aligning ancient Greek texts with English translations.

Bertalign is designed to facilitate the construction of multilingual parallel corpora and translation memories, which have a wide range of applications in translation-related research such as corpus-based translation studies, contrastive linguistics, computer-assisted translation, translator education and machine translation.

Approach

Bertalign uses sentence-transformers to represent source and target sentences so that semantically similar sentences in different languages are mapped onto similar vector spaces. Then a two-step algorithm based on dynamic programming is performed: 1) Step 1 finds the 1-1 alignments for approximate anchor points; 2) Step 2 limits the search path to the anchor points and extracts all the valid alignments with 1-many, many-1 or many-to-many relations between the source and target sentences.

Performance

The gold alignment dataset is based on translations of the Didache, letter of Polycarp, a Greek reader, and works of Josephus.

LaBSE on eval dataset:

 ---------------------------------
|             |  Strict |    Lax  |
| Precision   |   0.946 |   0.999 |
| Recall      |   0.935 |   1.000 |
| F1          |   0.941 |   1.000 |
 ---------------------------------

Using a sentence transformer trained on Greek-English parallel data:

 ---------------------------------
|             |  Strict |    Lax  |
| Precision   |   0.970 |   0.994 |
| Recall      |   0.956 |   1.000 |
| F1          |   0.963 |   0.997 |
 ---------------------------------

Installation

Please see requirements.txt for installation.

Basic example

See example.py.

Citation

Lei Liu & Min Zhu. 2022. Bertalign: Improved word embedding-based sentence alignment for Chinese–English parallel corpora of literary texts, Digital Scholarship in the Humanities. https://doi.org/10.1093/llc/fqac089.

Licence

Bertalign is released under the GNU General Public License v3.0

Credits

Main Libraries
Other Sentence Aligners

About

Sentence alignment for ancient Greek using sentence embeddings.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%