Skip to content

Latest commit

 

History

History
852 lines (544 loc) · 19.5 KB

slides.md

File metadata and controls

852 lines (544 loc) · 19.5 KB
theme background class highlighter lineNumbers info drawings title
seriph
/uit_bakgrunn.png
text-center
shiki
false
## Slidev Starter Template Presentation slides for developers. Learn more at [Sli.dev](https://sli.dev)
persist
LT for minority languages & the GiellaLT infrastructure

LT for minority languages &
the GiellaLT infrastructure

Sjur Nørstebø Moshagen


Presentation plan:




layout: section

Introduction


About me



  • Sjur Nørstebø Moshagen
  • Linguistics, nordic languages & computer science
  • Lingsoft
  • Sámi Parliament
  • UiT the Arctic University of Norway
  • heading the Divvun group at UiT
  • 30 years experience with language technology

What is language technology



A very brief history — from cuneiform to speech recognition


The first language technology

Cuneiforms


Some later instances of long-lasting language technology

Runes


Some later instances of long-lasting language technology


Today — information technology

Internet a.o.


Language technology proper

The term language technology is restricted to actual processing of language data

  • be it speech or text or video (as when processing signed languages). The ultimate dream of language technology is speech-to-speech machine translation of unrestricted language:


Language technology is transformative

In all cases language (and information) technology has been pretty transformative.

… and divisive

Another typical characteristic of language technology is that it is divisive:

  • those with access
  • those without

digital divide


LT divide

Empowering those with access, leaving those without behind. As such it can easily be a driver in language death — to take part in the society at large, you can't use your own language because the society expects use of certain technologies:

  • a certain alphabet or writing system — ie literacy
  • access to a printing press
  • access to computers
  • access to your letters on that computer

For speakers of most of the languages of the world (there are about 7000) one or several of the points above are not true, and will only add to all the other factors driving language death.

One of the main objectives of the GiellaLT infrastructure is to help counter this, by developing language technology for such languages, to make them easy to use on digital devices.

Our starting point and main focus is the Sámi languages, but everything that we make is language independent (except for the linguistic data, obviously), and we actively cooperate with other groups to extend the reach of our technology.


layout: section

Minority languages and requirements for LT development


Characteristics of minority language technology development

Typically, minority languages share a number of characteristics:

  • few or non-existing digital resources
  • restricted availability of dictionaries and grammars, or none at all
  • often complex morphology or morphophonology or both

That is, the dominating language technology paradigm (ML) has nothing to offer.


Ownership

It is important that language communities have control over language resources relating to their language, in the sense that no private entity can block access to those resources. Otherwise the society will risk vendor lock-in, and expensive redevelopment of existing tools and resources.

— Despite being aware of this, we have experienced it twice!

The best solution is to ensure that everything is open source. All resources and tools in the GiellaLT infra are open source, unless forced to by software we integrate with (MS Office is one such case). Also, some language communities do not want their language to be openly accessible, due to a history of being colonialised, oppressed and their language becoming stigmatised. In such cases we of course follow their decision.

Open Source


Reuse and multi-use

Because of the costs of language technology projects, it is important to build your infra and resources with reuse in mind, and also plan them so that everything is prepared for multiple usage scenarios.

E.g. in the GiellaLT infrastructure, we have standardised conventions that makes it easy to build both normative and descriptive tools from the same codebase.

  • normative: tools that adhere strictly to an agreed-upon norm for writing, and try to correct text so that deviations are brought in line with the norm: spelling checkers and grammar checkers.
  • descriptive: tools that try to process all texts in a language, including erroneous and non-standard texts

Mainly rule-based

Language technology comes in several flavours:

  • rule-based
  • statistical
  • stochastic
  • neural nets

Typical for all but the rule-based one is that they require large amounts of raw data to be trained on.

Rule based technologies on the other hand, in principle only requires a mother tongue speaker and a linguist (which in the best of cases is one and the same person).


Basic working of rule-based technologies

Rule-based


Main sources for building the grammars and language resources



  • digital dictionaries
  • grammars
  • corpus
  • native speakers

layout: section

The GiellaLT infrastructure


Main features of the GiellaLT infrastructure



  • language independent infrastructure
  • scalability in two dimensions: languages x tools/products
  • standardised dir & file structure
  • encourages and facilitates international cooperation
  • ~130 languages in our infra (at various stages), 30+ in active development
    • almost all of them minority languages
    • majority language grammars and LT resources mainly to support the minority languages

Scalability

  • for languages:
    • template for all resources needed
  • for tools:
    • add support for a new tool to the template, and propagate it to all existing languages
  • core design principle:
    • separate language independent processing from language-specific processing

The templating system and the split between language independent and specific code ensures that we can add as many languages as we want, and easily add support for new tools and technologies.


Standardised dir structure

.
├── devtools
├── docs
├── src
│   ├── cg3
│   ├── filters
│   ├── fst
│   ├── hyphenation
│   ├── orthography
│   ├── phonetics
│   ├── scripts
│   ├── tagsets
│   └── transcriptions
├── test
│   ├── data
│   ├── src
│   └── tools
└── tools
    ├── analysers
    ├── grammarcheckers
    ├── hyphenators
    ├── mt
    ├── shellscripts
    ├── spellcheckers
    └── tokenisers

International cooperation



Some language repositories

With maturity and license, bug and build status (github.com/divvun/registry).


layout: section

Linguistic programming


Formalisms / technologies used



  • morphology / morphophonology: Hfst / Foma / Xerox
    • lexc
    • twolc
    • xfst rewrite rules
    • Xerox-style pmatch scripts
  • syntax: Constraint grammar (in the form of VISLCG3 )

All of these are open source except for the Xerox tools (which are free, though). Foma does not support TwolC (see further down).


LexC



  • an excellent formalism for concatenative morphology
  • typically, you specify stems and affixes in different lexicons
  • ... to allow for abstractions over stem classes and inflections
  • it is in essence a programming language for linguists
  • ... where you spell out the morphology of a language such that a compiler can turn it into an executable program
LEXICON Root
  iloinen:iloi nen-adj-inflection ;

LEXICON nen-adj-inflection
  +A+Sg+Nom:nen # ;
  +A+Sg+Gen:sen # ;
  +A+Sg+Par:sta # ;

TwolC



  • Formalism developed by Kimmo Koskenniemi in the early 80's to describe phonological processes
  • resembles quite closely generative rewrite rules of the form:
    A -> B / C _ D
    
  • rules are unordered and applied in parallel

Xfst rewrite rules



  • another formalism to describe phonology
  • main difference to TwolC: rules are ordered and applied in sequence

Both TwolC and Xfst rewrite rules are supported by the GiellaLT infrastructure, compilation support is dependent on the compiler tool used:

Foma does not support Twolc, everything else is supported by all tools


Xerox-style pmatch scripts

Hfst only, this formalism is an extension of the xfst rewrite rules, and are a reimplementation of work by Xerox around 10 years ago. It allows for more complex text processing, and with a few modifications we have turned the formalism into a tokeniser-and-morphological-analyser that will also output ambiguous tokens. Such ambiguity can then be resolved using Constraint Grammar (see next), followed by a simple reformatter that rewrites tokens that are split in two.

Using this setup it is possible to get the tokenisation almost perfect. In practice we still have some work to do, but we are already well above the alternative methods.

The pmatch scripts are key to a recent addition to our infrastructure: rule-based grammar checking. We are also now developing text-to-speech systems using the pmatch scripts + VISLCG3 processing to turn raw text into disambiguated IPA text streams that can be fed to the synthesis engine.

For speech synthesis this means that we use rule-based technologies for everything but the actual synthesis modelling, reducing the corpus need to about 10 hours of studio recordings. That is within reach for most language communities.


Constraint grammar



  • formalism developed at Helsinki university by Fred Karlsson, later extended by Tapanainen (CG2), and further by the VISL project (CG3)
  • main idea is to remove or select specific possible readings of ambiguous words given context constraints:

    in the context of a subject personal pronoun, select a verb reading that agrees with the pronoun in person and number

    Cf. German haben: can be both Infinitive, 1Pl and 3Pl.

    But with a subject pronoun wir, only 1Pl makes sense, so select it.

  • used a lot in text parsers in combination with morphological analysers, giving very good results
  • also used in language technology tools and products such as machine translation and grammar checking since the late 1990's

Testing

Systematic testing is essential, and the infrastructure supports several types of tests:

  • classes of words/inflections/alternations
  • lemmas
  • in-source test data

Example test data (South Sámi):

Tests:

  Verb - båetedh: # verb I, stem -ie, root vowel -åe-
    båetedh+V+IV+Inf: båetedh
    båetedh+V+IV+Ind+Prs+Sg1: båatam
    båetedh+V+IV+Ind+Prs+Sg2: båatah
    båetedh+V+IV+Ind+Prs+Sg3: båata
    båetedh+V+IV+Ind+Prs+Du1: båetien
    båetedh+V+IV+Ind+Prs+Du2: [båeteden, båetiejidien]
    båetedh+V+IV+Ind+Prs+Du3: båetiejægan
    båetedh+V+IV+Ind+Prs+Pl1: [båetebe, båetiejibie]
    båetedh+V+IV+Ind+Prs+Pl2: [båetede, båetiejidie]
    båetedh+V+IV+Ind+Prs+Pl3: båetieh

layout: section

Tools


Keyboards (desktop & mobile)

A very simple syntax (mobile keyboard shown):

modes:
  mobile-default: |
    á š e r t y u i o p ŋ
    a s d f g h j k l đ ŧ
       ž z č c v b n m
  mobile-shift: |
    Á Š E R T Y U I O P Ŋ
    A S D F G H J K L Đ Ŧ
       Ž Z Č C V B N M

This + a few more technical details is used to produce ready-to-use installers and keyboard apps.

One can also add a speller file (fst-based spell checker), and get spelling correction as part of your mobile keyboard.


Final keyboard

The end result looks like this:


It has a dark mode

The speller is exactly the same fst-based speller as described below, with slight adaptions of the error model to fit the keyboard layout and the errors typically made.


layout: two-cols

Locale registration

As part of the desktop keyboard installers, the locale
of the keyboard is added to the system:

::right::


So that languages unknown to Windows and macOS is subsequently known and can be used for spell checking:

Plains Cree in MS Word


Spellers



A speller is made up of two parts:

  1. an acceptor - is this a correct word or not?
  2. an error model - if this is not a word, how is it most likely to be corrected?

In our infrastructure, both are finite state transducers. The acceptor is built from our general analyser, but restricted to only normatively correct forms.

The error model contains a standard permutation fst for the relevant alphabet, with language specific additions based on likely errors made by writers.


Short turnaround during development



  1. add a word, correct some part of the morphology
  2. compile
  3. test in e.g. LibreOffice or on the command line

Compilation time varies a lot depending on the language and the size and complexity of the lexicon, the morphology and the morphophonology.


layout: two-cols

Host app integration



  • MS Word (Windows, macOS coming)
  • LibreOffice (all OS's)
  • System wide spellers (Windows, macOS, Linux)
  • mobile keyboard apps
  • web server

::right::

Speller online


Hyphenation



  • uses rewrite rules to identify syllable structure = hyphenation points
  • uses analyser (lexicon) to find word boundaries and exceptional hyphenation

Grammar checkers



  • morphological analyser for analysis and tokenisation
  • includes disambiguation of multiword expressions
  • a tagger for whitespace errors
  • runs the spelling checker on unknown words
  • constraint grammars for both disambiguation and error detection, as well as for selecting or filtering speller suggestions based on context
  • uses valency info and semantic tags to avoid reliance on (faulty) morphology and syntax
  • new research coming out of this:
    • improvements to sentence border detection (near-perfect results possible)
    • improvements to tokenisation and whitespace handling - we can detect compounds erroneously written apart (not very well handled or not at all by most other grammar checkers)

layout: center

Grammar checker flow chart


The grammar checker works in



  • MS Word (as a web-based office extension for now)
  • GoogleDocs
  • planned support:
    • macOS (system wide), possibly Windows
    • LibreOffice
    • regular MS Office grammar checker for Windows and macOS

Screen shot from MS Word (web version):

Grammar checker


Text-to-speech systems



  • Commercial, closed source since 2014 — North Sámi
  • Working on open source solution based on HFST, VislCG and ML

Closed source synthesis



  • recordings and text available
  • technology unfortunately from a commercial company = closed source code
    • we have now been hunted by this - they are closing down the macOS version
    • we had fortunately already planned a new project for Julev Sámi that is completely built using open source, so we should be good in a couple of years
  • quality very good

Open-source synthesis



  • the original plan was to use our own text processing for conversion to IPA or similar
    • we are doing that now, in a new project for Lule Sámi
  • using a similar pipeline to the grammar checker one to produce a phonetic transcription
  • feeding that to the synthesis engine
  • synthesis done using machine learning / neural nets
  • 10 hours of recordings should be enough for high quality synthesis

Dictionaries



  • content from several sources
  • morphological analysis to enable looking up directly in text
    • web browsers
    • macOS and Windows apps

NDS


Language learning



  • analysing reader input
  • adapting suggested forms according to user preferences

Korp



  • database and interface for searching an analysed corpus
  • morphological analysis, disambiguation, syntactic parsing using our tools
  • corpus data available in many languages

Korp


Summary

  • one source for everything
  • reuse and multiple usages
  • summarised in the following illustration:


Links

Everything easily accessible in GitHub, everyone can edit and contribute.



layout: end