Updating the Database for itwêwina

This document describes the current process to generate a complete inputjson file for morphodict from the available sources (most of them described in the main README of this repo).

As referenced in #108, the process is:

Update dictionary sources
Update LEXC sources based on updated dictionary sources
Compile new FSTs in giellalt/lang-crk
Aggregate and process dictionary sources into an importjson file for uploading to the intelligent dictionary
Update the internal database for the intelligent dictionary, including whatever genration of forms in paradigms or English translation equivalents.

1. Update dictionary sources

(From #108): The following steps should be equivalent to running the altlab/crk/bin/update-crk-dictionary-sources-2-lexc.sh script from the private repo.

Update Cree Words (CW) source file CreeDict-x in Carleton repo

svn up

Remove Windows-style CR characters from CW source, and copy this over to ALTLab repo

cat PlainsLexUni/CreeDict-x | tr -d '\r' > altlab/crk/dicts/Wolvengrey_altlab.toolbox

Note: A version of this file lives in the private altlab repo at crk/dicts/

Convert this Toolbox file into TSV format:

cat altlab/crk/dicts/Wolvengrey_altlab.toolbox | altlab/crk/bin/toolbox2tsv.sh > altlab/crk/generated/Wolvengrey_altlab.tsv

Compare against Maskwacîs Dictionary content, and add unique entries (and associated stem and inflectional class information) after the CW entries:

altlab/crk/bin/add-md-entries-2-after-cw-tsv.sh altlab/crk/generated/Wolvengrey_altlab.tsv altlab/crk/dicts/Maskwacis_altlab.tsv > altlab/crk/generated/altlab.tsv

2. Update LEXC sources based on updated dictionary sources

Generate LEXC source for individual parts-of-speech from this ALTLab aggregated TSV file:

cat altlab/crk/generated/altlab.tsv | altlab/crk/bin/altlab2lexc.sh 'N' > altlab/crk/generated/noun_stems.lexc
cat altlab/crk/generated/altlab.tsv | altlab/crk/bin/altlab2lexc.sh 'V' > altlab/crk/generated/verb_stems.lexc

Add copyright headers to LEXC sources, and copy over giellalt/lang-crk/src/fst/stems/

cat giellalt/lang-crk/src/fst/stems/noun_header.lexc altlab/crk/generated/noun_stems.lexc > giellalt/lang-crk/src/fst/stems/noun_stems.lexc
cat giellalt/lang-crk/src/fst/stems/verb_header.lexc altlab/crk/generated/verb_stems.lexc > giellalt/lang-crk/src/fst/stems/verb_stems.lexc

3. Compile new FSTs in `giellalt/lang-crk`

Follow the instructions from https://github.com/giellalt/lang-crk, once you are in the repo:

./autogen-sh
./configure
make

Make sure you have provided the desired configuration options to ./configure to generate the FSTs you are interested in.

From both UAlbertaALTLab/morphodict/257 and #109, the process can also be performed with the

The following are explicit instructions on creating a descriptive analyzer and normative generator (with morpheme boundaries) from updated LEXC source (undertaken in #108):

Create basic morphological model

If one has compiled the aggregate LEXC file, lexicon.tmp.lexc, with the regular GiellaLT compilation scheme, one can use that file as the primary source.

read lexc src/fst/lexicon.tmp.lexc
define Morphology

Otherwise, one can compile the aggregate file as follows:

cat src/fst/root.lexc src/fst/stems/noun_stems.lexc src/fst/stems/verb_stems.lexc src/fst/stems/particles.lexc src/fst/stems/pronouns.lexc src/fst/stems/numerals.lexc src/fst/affixes/noun_affixes.lexc src/fst/affixes/verb_affixes.lexc > lexicon.tmp.lexc

Create basic phonological model

source src/fst/phonology.xfscript
define Phonology

Create filters for removing a) word fragments and b) orthographically non-standard forms.

regex ~[ $[ "+Err/Frag" ]];
define removeFragments

regex ~[ $[ "+Err/Orth" ]];
define removeNonStandardForms

Create filter to select only forms belonging to dictionary parts-of-speech.

regex $[ "+N" | "+V" | "+Ipc" | "+Pron" ];
define selectDictPOS

Compose normative generator.

set flag-is-epsilon ON
regex [ selectDictPOS .o. removeNonStandardForms .o. removeFragments .o. Morphology .o. Phonology ];
save stack generator-gt-dict-norm.hfst
define NormativeGenerator

Specify transcriptor to remove special morpheme boundary characters.

regex [ [ "<" | ">" | "/" ] -> 0 ];
define removeBoundaries

Load in basic model for spell relaxation.

load src/orthography/spellrelax.compose.hfst
define SpellRelax

Compose descriptive analyzer

regex [ selectDictPOS .o. removeFragments .o. Morphology .o. Phonology .o. removeBoundaries .o. SpellRelax ];
# regex [ NormativeGenerator .o. removeBoundaries .o. SpellRelax ];
invert net
save stack analyser-gt-dict-desc.hfst
define DescriptiveAnalyser

Normally, the necessary FSTs would be created according to the standard GiellaLT compilation configruration, with the option --enable-dicts.

4. Aggregate and process dictionary sources into an `importjson` file for uploading to the intelligent dictionary

This process is described on the README.md file in https://github.com/UAlbertaALTLab/crk-db but we currently document it here for completeness:

Install node.js and run npm install in the repo.
Collect the "original" data sources created from previous steps, in the $CRK_DB_REPO/data folder:
- The altlab.tsv file
- The Maskwacis.tsv file
- The Wolvengrey.toolbox file
Run the following command:

node rpm build

After this command is run, the data/crkeng_dictionary.importjson file will be generated.

NOTE: The scripts currently pull via LFS two HFSTs: src/crkeng/resources/fst/crk-relaxed-analyzer-for-dictionary.hfstol and src/crkeng/resoures/fst/crk-strict-analyzer-for-dictionary.hfstol.. These FSTs are used in the lib/aggregate/index.js file.

5. Update the internal database for the intelligent dictionary, including whatever genration of forms in paradigms or English translation equivalents.

For updating the server, please follow the instructions on https://github.com/fbanados/crk-db/tree/update-db-docs#steps-to-incrementally-update-the-production-database .

To update your local database, go to the morphodict repo and run

./crkeng-manage importjsondict crkeng_dictionary.importjson

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db-update.md

db-update.md

Updating the Database for itwêwina

1. Update dictionary sources

2. Update LEXC sources based on updated dictionary sources

3. Compile new FSTs in `giellalt/lang-crk`

4. Aggregate and process dictionary sources into an `importjson` file for uploading to the intelligent dictionary

5. Update the internal database for the intelligent dictionary, including whatever genration of forms in paradigms or English translation equivalents.

Files

db-update.md

Latest commit

History

db-update.md

File metadata and controls

Updating the Database for itwêwina

1. Update dictionary sources

2. Update LEXC sources based on updated dictionary sources

3. Compile new FSTs in giellalt/lang-crk

4. Aggregate and process dictionary sources into an importjson file for uploading to the intelligent dictionary

5. Update the internal database for the intelligent dictionary, including whatever genration of forms in paradigms or English translation equivalents.

3. Compile new FSTs in `giellalt/lang-crk`

4. Aggregate and process dictionary sources into an `importjson` file for uploading to the intelligent dictionary