theme | background | class | highlighter | lineNumbers | info | drawings | title | |
---|---|---|---|---|---|---|---|---|
seriph |
/uit_bakgrunn.png |
text-center |
shiki |
false |
## Slidev Starter Template
Presentation slides for developers.
Learn more at [Sli.dev](https://sli.dev)
|
|
LT for minority languages & the GiellaLT infrastructure |
Sjur Nørstebø Moshagen
Presentation plan:
- Sjur Nørstebø Moshagen
- Linguistics, nordic languages & computer science
- Lingsoft
- Sámi Parliament
- UiT the Arctic University of Norway
- heading the Divvun group at UiT
- 30 years experience with language technology
A very brief history — from cuneiform to speech recognition
Internet a.o.
The term language technology is restricted to actual processing of language data
- be it speech or text or video (as when processing signed languages). The ultimate dream of language technology is speech-to-speech machine translation of unrestricted language:
In all cases language (and information) technology has been pretty transformative.
Another typical characteristic of language technology is that it is divisive:
- those with access
- those without
Empowering those with access, leaving those without behind. As such it can easily be a driver in language death — to take part in the society at large, you can't use your own language because the society expects use of certain technologies:
- a certain alphabet or writing system — ie literacy
- access to a printing press
- access to computers
- access to your letters on that computer
For speakers of most of the languages of the world (there are about 7000) one or several of the points above are not true, and will only add to all the other factors driving language death.
One of the main objectives of the GiellaLT infrastructure is to help counter this, by developing language technology for such languages, to make them easy to use on digital devices.
Our starting point and main focus is the Sámi languages, but everything that we make is language independent (except for the linguistic data, obviously), and we actively cooperate with other groups to extend the reach of our technology.
Typically, minority languages share a number of characteristics:
- few or non-existing digital resources
- restricted availability of dictionaries and grammars, or none at all
- often complex morphology or morphophonology or both
That is, the dominating language technology paradigm (ML) has nothing to offer.
It is important that language communities have control over language resources relating to their language, in the sense that no private entity can block access to those resources. Otherwise the society will risk vendor lock-in, and expensive redevelopment of existing tools and resources.
— Despite being aware of this, we have experienced it twice!
The best solution is to ensure that everything is open source. All resources and tools in the GiellaLT infra are open source, unless forced to by software we integrate with (MS Office is one such case). Also, some language communities do not want their language to be openly accessible, due to a history of being colonialised, oppressed and their language becoming stigmatised. In such cases we of course follow their decision.
Because of the costs of language technology projects, it is important to build your infra and resources with reuse in mind, and also plan them so that everything is prepared for multiple usage scenarios.
E.g. in the GiellaLT infrastructure, we have standardised conventions that makes it easy to build both normative and descriptive tools from the same codebase.
- normative: tools that adhere strictly to an agreed-upon norm for writing, and try to correct text so that deviations are brought in line with the norm: spelling checkers and grammar checkers.
- descriptive: tools that try to process all texts in a language, including erroneous and non-standard texts
Language technology comes in several flavours:
- rule-based
- statistical
- stochastic
- neural nets
Typical for all but the rule-based one is that they require large amounts of raw data to be trained on.
Rule based technologies on the other hand, in principle only requires a mother tongue speaker and a linguist (which in the best of cases is one and the same person).
- digital dictionaries
- grammars
- corpus
- native speakers
- language independent infrastructure
- scalability in two dimensions: languages x tools/products
- standardised dir & file structure
- encourages and facilitates international cooperation
- ~130 languages in our infra (at various stages), 30+ in active development
- almost all of them minority languages
- majority language grammars and LT resources mainly to support the minority languages
- for languages:
- template for all resources needed
- for tools:
- add support for a new tool to the template, and propagate it to all existing languages
- core design principle:
- separate language independent processing from language-specific processing
The templating system and the split between language independent and specific code ensures that we can add as many languages as we want, and easily add support for new tools and technologies.
.
├── devtools
├── docs
├── src
│ ├── cg3
│ ├── filters
│ ├── fst
│ ├── hyphenation
│ ├── orthography
│ ├── phonetics
│ ├── scripts
│ ├── tagsets
│ └── transcriptions
├── test
│ ├── data
│ ├── src
│ └── tools
└── tools
├── analysers
├── grammarcheckers
├── hyphenators
├── mt
├── shellscripts
├── spellcheckers
└── tokenisers
With maturity and license, bug and build status (github.com/divvun/registry).
- morphology / morphophonology: Hfst / Foma / Xerox
- lexc
- twolc
- xfst rewrite rules
- Xerox-style pmatch scripts
- syntax: Constraint grammar (in the form of VISLCG3 )
All of these are open source except for the Xerox tools (which are free, though). Foma does not support TwolC (see further down).
- an excellent formalism for concatenative morphology
- typically, you specify stems and affixes in different lexicons
- ... to allow for abstractions over stem classes and inflections
- it is in essence a programming language for linguists
- ... where you spell out the morphology of a language such that a compiler can turn it into an executable program
LEXICON Root
iloinen:iloi nen-adj-inflection ;
LEXICON nen-adj-inflection
+A+Sg+Nom:nen # ;
+A+Sg+Gen:sen # ;
+A+Sg+Par:sta # ;
- Formalism developed by Kimmo Koskenniemi in the early 80's to describe phonological processes
- resembles quite closely generative rewrite rules of the form:
A -> B / C _ D
- rules are unordered and applied in parallel
- another formalism to describe phonology
- main difference to TwolC: rules are ordered and applied in sequence
Both TwolC and Xfst rewrite rules are supported by the GiellaLT infrastructure, compilation support is dependent on the compiler tool used:
Foma
does not support Twolc, everything else is supported by all tools
Hfst only, this formalism is an extension of the xfst rewrite rules, and are a reimplementation of work by Xerox around 10 years ago. It allows for more complex text processing, and with a few modifications we have turned the formalism into a tokeniser-and-morphological-analyser that will also output ambiguous tokens. Such ambiguity can then be resolved using Constraint Grammar (see next), followed by a simple reformatter that rewrites tokens that are split in two.
Using this setup it is possible to get the tokenisation almost perfect. In practice we still have some work to do, but we are already well above the alternative methods.
The pmatch scripts are key to a recent addition to our infrastructure: rule-based grammar checking. We are also now developing text-to-speech systems using the pmatch scripts + VISLCG3 processing to turn raw text into disambiguated IPA text streams that can be fed to the synthesis engine.
For speech synthesis this means that we use rule-based technologies for everything but the actual synthesis modelling, reducing the corpus need to about 10 hours of studio recordings. That is within reach for most language communities.
- formalism developed at Helsinki university by Fred Karlsson, later extended by Tapanainen (CG2), and further by the VISL project (CG3)
- main idea is to remove or select specific possible readings of ambiguous words given context constraints:
in the context of a subject personal pronoun, select a verb reading that agrees with the pronoun in person and number
Cf. Germanhaben
: can be both Infinitive, 1Pl and 3Pl.
But with a subject pronounwir
, only 1Pl makes sense, so select it. - used a lot in text parsers in combination with morphological analysers, giving very good results
- also used in language technology tools and products such as machine translation and grammar checking since the late 1990's
Systematic testing is essential, and the infrastructure supports several types of tests:
- classes of words/inflections/alternations
- lemmas
- in-source test data
Example test data (South Sámi):
Tests:
Verb - båetedh: # verb I, stem -ie, root vowel -åe-
båetedh+V+IV+Inf: båetedh
båetedh+V+IV+Ind+Prs+Sg1: båatam
båetedh+V+IV+Ind+Prs+Sg2: båatah
båetedh+V+IV+Ind+Prs+Sg3: båata
båetedh+V+IV+Ind+Prs+Du1: båetien
båetedh+V+IV+Ind+Prs+Du2: [båeteden, båetiejidien]
båetedh+V+IV+Ind+Prs+Du3: båetiejægan
båetedh+V+IV+Ind+Prs+Pl1: [båetebe, båetiejibie]
båetedh+V+IV+Ind+Prs+Pl2: [båetede, båetiejidie]
båetedh+V+IV+Ind+Prs+Pl3: båetieh
A very simple syntax (mobile keyboard shown):
modes:
mobile-default: |
á š e r t y u i o p ŋ
a s d f g h j k l đ ŧ
ž z č c v b n m
mobile-shift: |
Á Š E R T Y U I O P Ŋ
A S D F G H J K L Đ Ŧ
Ž Z Č C V B N M
This + a few more technical details is used to produce ready-to-use installers and keyboard apps.
One can also add a speller file (fst-based spell checker), and get spelling correction as part of your mobile keyboard.
The end result looks like this:
The speller is exactly the same fst-based speller as described below, with slight adaptions of the error model to fit the keyboard layout and the errors typically made.
As part of the desktop keyboard installers, the locale
of the keyboard is added to the system:
::right::
So that languages unknown to Windows and macOS is subsequently known and can be used for spell checking:
A speller is made up of two parts:
- an acceptor - is this a correct word or not?
- an error model - if this is not a word, how is it most likely to be corrected?
In our infrastructure, both are finite state transducers. The acceptor is built from our general analyser, but restricted to only normatively correct forms.
The error model contains a standard permutation fst for the relevant alphabet, with language specific additions based on likely errors made by writers.
- add a word, correct some part of the morphology
- compile
- test in e.g. LibreOffice or on the command line
Compilation time varies a lot depending on the language and the size and complexity of the lexicon, the morphology and the morphophonology.
- MS Word (Windows, macOS coming)
- LibreOffice (all OS's)
- System wide spellers (Windows, macOS, Linux)
- mobile keyboard apps
- web server
::right::
- uses rewrite rules to identify syllable structure = hyphenation points
- uses analyser (lexicon) to find word boundaries and exceptional hyphenation
- morphological analyser for analysis and tokenisation
- includes disambiguation of multiword expressions
- a tagger for whitespace errors
- runs the spelling checker on unknown words
- constraint grammars for both disambiguation and error detection, as well as for selecting or filtering speller suggestions based on context
- uses valency info and semantic tags to avoid reliance on (faulty) morphology and syntax
- new research coming out of this:
- improvements to sentence border detection (near-perfect results possible)
- improvements to tokenisation and whitespace handling - we can detect compounds erroneously written apart (not very well handled or not at all by most other grammar checkers)
- MS Word (as a web-based office extension for now)
- GoogleDocs
- planned support:
- macOS (system wide), possibly Windows
- LibreOffice
- regular MS Office grammar checker for Windows and macOS
- Commercial, closed source since 2014 — North Sámi
- Working on open source solution based on HFST, VislCG and ML
- recordings and text available
- technology unfortunately from a commercial company = closed source code
- we have now been hunted by this - they are closing down the macOS version
- we had fortunately already planned a new project for Julev Sámi that is completely built using open source, so we should be good in a couple of years
- quality very good
- the original plan was to use our own text processing for conversion to IPA or similar
- we are doing that now, in a new project for Lule Sámi
- using a similar pipeline to the grammar checker one to produce a phonetic transcription
- feeding that to the synthesis engine
- synthesis done using machine learning / neural nets
- 10 hours of recordings should be enough for high quality synthesis
- content from several sources
- morphological analysis to enable looking up directly in text
- web browsers
- macOS and Windows apps
- analysing reader input
- adapting suggested forms according to user preferences
- database and interface for searching an analysed corpus
- morphological analysis, disambiguation, syntactic parsing using our tools
- corpus data available in many languages
- one source for everything
- reuse and multiple usages
- summarised in the following illustration:
Everything easily accessible in GitHub, everyone can edit and contribute.
- Divvun tools & download: divvun.no & divvun.org
- Language resources & source code: github.com/giellalt
- Tool source code: github.com/divvun
- Korp: gtweb.uit.no/korp/
- Machine translation: jorgal.uit.no