Skip to content

Latest commit

 

History

History
172 lines (149 loc) · 6.96 KB

example.md

File metadata and controls

172 lines (149 loc) · 6.96 KB

Example of working with cltoolkit

In this example we'll use cltoolkit to compute linguistic features from lexical data from the WOLD dataset DOI.

Loading CLDF Wordlists

cltoolkit provides an abstraction layer to access (collections of) pycdlf.Wordlist, thus we load data as follows:

>>> from cltoolkit import Wordlist
>>> from pycldf import Dataset
>>> wl = Wordlist([Dataset.from_metadata("https://raw.githubusercontent.com/lexibank/wold/v4.0/cldf/cldf-metadata.json")])
loading forms for wold: 100%|██████████| 64289/64289 [00:01<00:00, 33125.96it/s]
>>> print(wl)
<cltoolkit.wordlist.Wordlist object at 0x7fa8de7504f0>

cltoolkit.features.Feature

A cltoolkit.features.Feature bundles basic metadata with a Python callable implementing the feature computation. In the simplest case this could be a lambda (i.e. an ad-hoc function) as shown below:

>>> from cltoolkit.features import Feature
>>> latitude = Feature(id="lat", name="Geographic Latitude", function=lambda l: l.latitude)
>>> for lang in wl.languages:
...     print('{}: {}'.format(lang.name, latitude(lang)))
...     break
...
Swahili: -6.5

A Feature is computed for a language by calling the Feature instance, passing a cltoolkit.models.Language instance.

cltoolkit provides a couple of base classes for (sometimes parametrizable) derived feature implementations (for phonology and lexicon). E.g. we can compute basic properties if a language's phoneme inventory:

>>> from cltoolkit.features.phonology import InventoryQuery
>>> number_of_consonants = Feature(id='1', name="Number of consonants", function=InventoryQuery('consonants'))

Let's apply this feature:

>>> for lang in wl.languages:
    ...     print('{}: {}'.format(lang.name, number_of_consonants(lang)))
...     break
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/home/robert_forkel/projects/cldf/cltoolkit/src/cltoolkit/features/collection.py", line 85, in __call__
return self.function(param)
File "/home/robert_forkel/projects/cldf/cltoolkit/src/cltoolkit/features/reqs.py", line 58, in wrapper_requires
raise MissingRequirement(' '.join(s[0] for s in status if not s[1]))
cltoolkit.features.reqs.MissingRequirement: inventory

Oops. Something went wrong. cltoolkit.features.reqs.MissingRequirement exceptions are used to signal that a feature implementation can not be applied to a particular Language object, because required properties are missing (see reqs.py). Here, we have loaded a wordlist without passing a CLTS transcription system; thus cltoolkit could not compute CLTS-mapped phoneme inventories.

Let's fix this: We need to download the CLTS data, and pass the bipa transcription system from an appropriately initialized pyclts.CLTS object when creating the wordlist:

>>> from pyclts import CLTS
>>> wl = Wordlist([Dataset.from_metadata("https://raw.githubusercontent.com/lexibank/wold/v4.0/cldf/cldf-metadata.json")], ts=CLTS('clts').bipa)
loading forms for wold: 100%|███████████| 64289/64289 [00:10<00:00, 6271.07it/s]
>>> for lang in wl.languages:
    ...     print('{}: {}'.format(lang.name, number_of_consonants(lang)))
...     break
...
Swahili: 31

Persisting feature metadata

The main goal of cltoolkit is enabling rapid explorative analysis of lexical data. Thus, it is expected that feature implementations may start out as simple functions (as shown above). Once feature implementations evolve into something worth keeping (and sharing), the importance of the metadata layer provided by the Feature class becomes apparent.

Let's add our parametrized InventoryQuery to a FeatureCollection and dump the feature specification to a JSON file:

>>> from cltoolkit.features import FeatureCollection
>>> fc = FeatureCollection([number_of_consonants])
>>> fc.dump('features.json')

features.json looks as follows:

[
    {
        "id": "1",
        "name": "Number of consonants",
        "function": {
            "class": "cltoolkit.features.phonology.InventoryQuery",
            "args": [
                "consonants"
            ]
        },
        "type": "int",
        "note": null,
        "categories": null,
        "requires": [
            "cltoolkit.features.reqs.inventory"
        ]
    }
]

If the feature implementation (here cltoolkit.features.phonology.InventoryQuery) is available from a properly distributed and installable python package, we can share the JSON spec, and allow others to recreate our features:

>>> from cltoolkit.features import FeatureCollection
>>> fc = FeatureCollection.load('features.json')
>>> fc[0].function
<cltoolkit.features.phonology.InventoryQuery object at 0x7f42a4e07820>
>>> from pyclts import CLTS
>>> from cltoolkit import Wordlist
>>> from pycldf import Dataset
>>> wl = Wordlist([Dataset.from_metadata("https://raw.githubusercontent.com/lexibank/wold/v4.0/cldf/cldf-metadata.json")], ts=CLTS('/home/robert_forkel/projects/cldf-clts/clts-data').bipa)
loading forms for wold: 100%|███████████| 64289/64289 [00:09<00:00, 6655.21it/s]
>>> fc[0](wl.languages[0])
31

Persisting feature data

The metadata bundled in the Feature objects does not only help with sharing feature implementations, but also with sharing the computed feature values. Creating a CLDF StructureDataset with the computed feature values can be done as follows:

>>> from pycldf import StructureDataset
>>> cldf = StructureDataset.in_dir('.')
>>> cldf.add_component('ParameterTable')
>>> cldf.add_component('LanguageTable')
>>> langs = [dict(ID=l.id, Name=l.name) for l in wl.languages]
>>> params = [dict(ID=f.id, Name=f.name) for f in fc]
>>> values = [dict(ID='{}-{}'.format(l.id, f.id), Value=f(l), Language_ID=l.id, Parameter_ID=f.id) for f in fc for l in wl.languages]
>>> cldf.write(LanguageTable=langs, ParameterTable=params, ValueTable=values)

This will create valid data with metadata in StructureDataset-metadata.json:

$ cldf validate StructureDataset-metadata.json
$ head values.csv 
ID,Language_ID,Parameter_ID,Value,Code_ID,Comment,Source
wold-Swahili-1,wold-Swahili,1,31,,,
wold-Iraqw-1,wold-Iraqw,1,34,,,
wold-Gawwada-1,wold-Gawwada,1,26,,,
wold-Hausa-1,wold-Hausa,1,47,,,
wold-Kanuri-1,wold-Kanuri,1,21,,,
wold-TarifiytBerber-1,wold-TarifiytBerber,1,72,,,
wold-SeychellesCreole-1,wold-SeychellesCreole,1,21,,,
wold-Romanian-1,wold-Romanian,1,40,,,
wold-SeliceRomani-1,wold-SeliceRomani,1,44,,,
$ head parameters.csv 
ID,Name,Description
1,Number of consonants,