Analysis of Glyph and Writing System Similarities using Siamese Neural Networks

About

This repository includes the code and the data associated to the paper Analysis of Glyph and Writing System Similarities using Siamese Neural Networks presented at the Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) which has been co-located with the LREC-COLING 2024 conference in Torino, Italy on Saturday, May 25 2024.

Here is the abstract of the article: In this paper we use siamese neural networks to compare glyphs and writing systems. These deep learning models define distance-like functions and are used to explore and visualize the space of scripts by performing multidimensional scaling and clustering analyses. From 51 historical European, Mediterranean and Middle Eastern alphabets, we use a Ward-linkage hierarchical clustering and obtain 10 clusters of scripts including three isolated writing systems. To collect the glyph database we use the Noto family fonts that encode in a standard form the Unicode character repertoire. This approach has the potential to reveal connections among scripts and civilizations and to help the deciphering of ancient scripts.

Here is the paper published in open-access: https://aclanthology.org/2024.lt4hala-1.12/

Here is the associated poster presented at the conference: https://hal.science/hal-04597366/

Figure: Dendrogram of the Ward-linkage hierarchical clustering of the 51 scripts. Color chart: red: medoid, blue: isolated script.

Figure: Two-dimensional scaling of the Latin and Old Italic scripts which are close scripts with a SNN-distance of 0.26.

Figure: Two-dimensional scaling of the 51 scripts. Marker chart: clusters. Color chart: red: medoid, blue: isolated script.

Citation

Claire Roman and Philippe Meyer. 2024. Analysis of Glyph and Writing System Similarities Using Siamese Neural Networks. In Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024, pages 98–104, Torino, Italia. ELRA and ICCL.

or in bibtex:

@inproceedings{roman-meyer-2024-analysis,
    title = "Analysis of Glyph and Writing System Similarities Using {S}iamese Neural Networks",
    author = "Roman, Claire  and
      Meyer, Philippe",
    editor = "Sprugnoli, Rachele  and
      Passarotti, Marco",
    booktitle = "Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lt4hala-1.12",
    pages = "98--104",
    abstract = "In this paper we use siamese neural networks to compare glyphs and writing systems. These deep learning models define distance-like functions and are used to explore and visualize the space of scripts by performing multidimensional scaling and clustering analyses. From 51 historical European, Mediterranean and Middle Eastern alphabets, we use a Ward-linkage hierarchical clustering and obtain 10 clusters of scripts including three isolated writing systems. To collect the glyph database we use the Noto family fonts that encode in a standard form the Unicode character repertoire. This approach has the potential to reveal connections among scripts and civilizations and to help the deciphering of ancient scripts.",
}

Requirements

To run this project we recommend to create a new python environment and install the following python packages (see requirements.txt):

keras==2.15.0
matplotlib==3.7.2
numpy==1.23.5
opencv_python==4.7.0.72
Pillow==10.2.0
scipy==1.12.0
skimage==0.0
tensorflow==2.15.0
tensorflow_intel==2.15.0

Content description

The raw and the processed data used in this work are located in the folder data. Here are the descriptions of the raw data:

fonts is composed of NotoSans font ttf files necessary to create the database of 51 writing systems.
omniglot_invented is composed of the invented scripts of the omniglot database (see https://github.com/brendenlake/omniglot).

Here are the descriptions of the processed data:

alphabets is composed of the 51 writing systems in numpy arrays that are created from the font files.
distances is composed of the distances between the scripts obtained with the siamese neural network.
omniglot_invented_augmented is composed of the omniglot database augmented by rotations, shears, zooms and shits that are used to train the model.

The python scripts located in the folder src permit to recreate the processed data and to train the model. They are include for reproducibility but it is not necessary to run them to use the notebooks. Here are the descriptions of the scripts:

creation_alphabets_from_fonts.py permits to create alphabets from font ttf files and to export them as numpy arrays.
dictionary_alphabets.py stores the capital letter unicodes and the names of the font files of the scripts.
distance_functions.py defines the siamese-based distance between writing systems.
model.py defines the siamese neural network model.
model_prediction.py predicts the distances between alphabets using the siamese neural network model.
model_training.py trains the siamese neural network model and save it in the folder models for reuse.
omniglot_data_augmentation.py uses rotations, shears, zooms and shits to augment the Omniglot dataset that will be use to train the siamese neural network model.

The two notebooks located in the folder notebooks permit to produce the scientific results of the paper. Here are the descriptions of the notebooks:

1.space_glyphs.ipynb imports pairs of scripts to visualize them with multidimensional scaling analysis.
2.clustering_scripts.ipynb imports of the distances between scripts to perform a clustering of the writing systems.

To get the fitted siamese neural network, the weights have to be downloaded from https://drive.google.com/file/d/1A1nXBWSTOWQbitYCaDXzwZX4FJK55jMy/view?usp=drive_link and extracted in the folder models.

Bonus

In the paper there are only two-dimensional scaling analyses. An interactive three-dimensional scaling analysis of our scripts can be found here: https://philippemeyer68.github.io/glyph.html.

Authors

Claire Roman
Philippe Meyer - Université Paris-Saclay, INRAE, AgroParisTech, Micalis Institute, 78350, Jouy-en-Josas, France. Email: philippemeyer68@yahoo.fr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysis of Glyph and Writing System Similarities using Siamese Neural Networks

About

Citation

Requirements

Content description

Bonus

Authors

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
images		images
models		models
notebooks		notebooks
src		src
README.md		README.md
requirements.txt		requirements.txt

PhilippeMeyer68/glyph-SNN

Folders and files

Latest commit

History

Repository files navigation

Analysis of Glyph and Writing System Similarities using Siamese Neural Networks

About

Citation

Requirements

Content description

Bonus

Authors

About

Topics

Resources

Stars

Watchers

Forks

Languages