Skip to content

Latest commit

 

History

History
106 lines (77 loc) · 3.77 KB

README.md

File metadata and controls

106 lines (77 loc) · 3.77 KB

ProtoScribe: Modeling the Evolution of Writing

CI status

𓏞

𒁾𒊬

This repository contains the supporting code for experimenting with machine learning approaches to evolution of writing.

NOTE: This is work in progress.

Installation and testing

Note, the instructions in this section apply to Debian Linux where this project was developed. Depending on your operating system some of the installation steps in setup.sh may need to be amended.

Ideally the installation should happen in a Python virtual environment. The installation is taken care of by the setup.sh script. Simply run

./setup.sh

from the root directory of the project. If all the dependencies are installed correctly, run the tests using pytest:

./test.sh

Note, use --continue-on-collection-errors flag to calls to pytest inside test.sh to see all the failing tests even if some of them cannot be loaded correctly.

The codebase

The important directories are:

  • corpus: Utilities for building and parsing the corpus.

  • data: This directory contains most of the bits out of which we generate our simulated dataset. For example, the various concept inventories can be found in concepts, while their corresponding numeric embeddings in semantics. The other types of data include things like various sets of SVGs for our glyphs and so on.

  • evolution: Pipeline utilities and supporting scripts for simulation of writing system evolution.

  • glyphs: Libraries for dealing with SVGs, but also with discrete glyph vocabularies, i.e. glyph_vocab.py.

  • language: Directory housing linguistic modeling APIs:

    • embeddings: Various interfaces for (semantic) embeddings. The most relevant one is the embedder.py. Our configuration defaults to BNC. In addition, in the past we played with representing concepts by glosses -- shorts snippets of Wikipedia text explaining what a thing is -- these can then be encoded using a pretrained language model.

    • morphology/phonology/syntax: Definitions of the phonology, morphology and syntax of the generated language. The core functionality for determining morpheme shape and what it means for two words to sound similar resides in phonology and includes libraries for computing phonetic embeddings.

  • glyphs: Libraries for dealing with SVGs, but also with discrete glyph vocabularies, i.e. glyph_vocab.py.

  • semantics: Basic helper packages for representing knowledge about categories.

  • scoring: Tools for evaluating and scoring the resulting models.

  • sketches: Libraries for manipulating and modeling glyphs as sketches. Includes core libraries for representing glyphs as sequences of (possibly quantized) strokes in our models.

  • speech: Acoustic front-end components.

  • texts: The libraries for constructing the actual "accounting documents".

  • vision: Components for building and representing image features corresponding to semantic concepts.

Disclaimers

This is not an official Google product.