Skip to content

msklvsk/UD_Ukrainian-IU

 
 

Repository files navigation

Summary

Gold standard Universal Dependencies corpus for Ukrainian, developed for UD originally, by Institute for Ukrainian, NGO.
[українською]

Introduction

UD Ukrainian comprises 122K tokens in 7000 sentences of fiction, news, opinion articles, Wikipedia, legal documents, letters, posts, and comments — from the last 15 years, as well as from the first half of the 20th century.

Consider using the latest version at ‘dev’ branch on GitHub. It contains the latest stable improvements while the official releases are up to 6 month old [discussion].

Acknowledgments

Major contributors: Natalia Kotsyba, Bohdan Moskalevskyi, Mykhailo Romanenko.

Large portion of annotation was made by Halyna Samoridna, Ivanka Kosovska, Olha Lytvyn, Oksana Orlenko and by students of Kyiv-Mohyla Academy department of Ukrainian language (headed by Liudmyla Dyka): Hanna Brovko, Bohdana Matushko, Natalia Onyshchuk, Valeriia Pareviazko, Yaroslava Rychyk, Anastasiia Stetsenko, Snizhana Umanets.

We thank Prof. Larysa Masenko for guidance.

Documentation

Project homepage (in Ukrainian)

Search

You can also browse the entire treebank in Brat.

Stats

set sentences ~tokens
train 5496 92K
dev 672 13K
test 892 17K
TOTAL 7060 122K

See stats.xml for detail.

Annotation procedure

Morphology is annotated using 2+1 schema. The syntax is single-pass plus supervisor’s check. Consistency is further enforced by ~300 validation and autofix rules (see warnings page) and by investigating errors made by a trained parser.

Data split

Data is split between train/dev/test linearly by hand at 75%/10%/15% to balance in genre and complexity. Some large documents are divided across datasets.

Format

UD Ukrainian data conforms to CoNLL-U format with the following specifics:

  • Sentence-level comments:
    • Document boundaries are present as # newdoc id = xxxx.
    • Sentence-level paragraph boundaries are present as # newpar id = xxxx.
    • Document titles are present as # doc_title = Назва.
    • Czech-like translit is present as # translit = ….
    • Gaps in the text are marked on the sentences following the gap as:
      • # annotation_gap for sentences not exported to CoNLL-U because annotator was unable to parse it with confidence (e.g. new guidelines need to be created);
      • # gap for intentional gaps in texts (selected fragments).
  • XPOSTAG column contains MTE tag with U for punctuation. UPOS+FEATS contain all the information in XPOSTAG and more. XPOSTAG is intended for legacy applications.
  • DEPS column contains Enhanced Dependencies.
  • MISC column:
    • Token-level paragraph boundaries are present as NewPar=Yes.
    • Token ids are present as Id=xxxx.
    • SpaceAfter=No markers are present.
    • Form (Translit) and lemma (LTranslit) transliterations are present
    • The pipe (|) character is escaped with \p. Backslash is \\. See issue #569.
  • Document, paragraph, sentence, and token ids are 4-character base-32 numbers. They survive treebank updates.
  1. Ellipsis. Elided predicates are manually reconstructed with word forms and full morphological info. The TB currently contains ~200 of them.
  2. Propagation of conjuncts. Conjoined modifiers are propagated automatically. For heterogeneous conjuncts, a relation guesser is employed. Dependents of first conjuncts are propagated only if they are manually marked as shared (40% of such annotation is done).
  3. Controlled/raised subjects. All xcomp subjects are annotated manually as nsubj:x/csubj:x. Subjects of xcomp:sp (secondary predication) are nsubj:sp/csubj:sp. The latter are also used for the subjects of advcl:sp (see #476).
  4. Relative clauses. All relative clauses are manually annotated with enhanced dependencies. This includes all types mentioned in the universal docs plus Ukrainian clauses that use personal pronouns as relativizers: вузол, що його не переріжеш “the-knot, that it.Acc not you-can-cut”.
  5. Case information. We don’t case-mark relation names because this doesn’t bring any new information [discussion].

Development

Data files are built from sources at mova-institute/zoloto, where the actual development happens.

Licensing

The data is licensed under CC BY-NC-SA 4.0 and is free for non-commercial use. For a commercial license, please contact us at org@mova.institute.

Contact

org@mova.institute

Changelog

  • 2021-05-15 v2.8

    • Undocumented PunctType=Ndash|Hyph|Bull converted to PuncType=Dash.
  • 2019-05-15 v2.4

    • Closed many annotaion gaps: 116K→122K.
    • Fixed annotation errors.
    • Shared more dependents of a first conjunct.
    • Improved consistency by extending annotation guidelines to rarer phenomena.
    • Switched from ccomp to xcomp where nsubj:x is a phantom object.
    • Made clauses with ADV relativizers :relcl.
    • Added Polarity=Neg for conjunctions.
    • Escaped the pipe (|) character in MISC as \p. \\ is now a backslash.
  • 2018-11-15 v2.3

    • Added all types of enhanced dependencies except for case-marking, see Enhanced Dependencies section.
    • Closed many annotation gaps and added new texts: 100→115K.
    • Fixed ~450 annotation errors including його/її/їх PRON vs DET ambiguity.
    • Improved consistency by extending annotation guidelines to many rarer phenomena.
    • Introduced multitokens for ні́кого, ні́де etc.
    • Split words with fused пів- numerals (e.g. півкласу) to multitokens.
    • Introduced flat:abs, flat:sibl, flat:range, advmod:det, acl:adv, parataxis:rel, vocative:cl.
    • Specified acl:relcl.
    • Removed :pass subtype from relations as it currently can be inferred from the morphology.
    • Added transliteration.
    • Fixed missing # annotation_gaps.
    • Updated readme with more description, links.
  • 2018-04-15 v2.2

    • Renamed the repository from UD_Ukrainian to UD_Ukrainian-IU to match the new UD naming convention.
    • Fixed some validation errors.
    • Added a couple of new sentences.
    • Orth=Khark feature renamed to Orth=Alt.
  • 2017-11-15 v2.1

    • Quadrupled the amount of data up to 100K, mostly with nonfiction; improved consistency.
    • Resplitted train/dev/test.
  • 2017-02-15 v2.0

    • Replaced v1.4 data with 25K tokens of misc genres, mostly fiction.
  • 2016-11-01 v1.4

    • An initial experimental release containing 1.6K tokens of grammar examples and fiction.

=== Machine-readable metadata =================================================
Data available since: UD v1.4
License: CC BY-NC-SA 4.0
Includes text: yes
Genre: blog email fiction grammar-examples legal news reviews social web wiki
Lemmas: manual native
UPOS: manual native
XPOS: manual native
Features: manual native
Relations: manual native
Contributors: Kotsyba, Natalia; Moskalevskyi, Bohdan; Romanenko, Mykhailo
Contributing: elsewhere
Contact: org@mova.institute
===============================================================================

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published