UD Hebrew-IAHLT

האיגוד הישראלי לטכנולוגיות שפת אנוש
الرابطة الإسرائيلية لتكنولوجيا اللغة البشرية
The Israeli Association of Human Language Technologies
https://www.iahlt.org

A Universal Dependencies treebank with named entities for contemporary Hebrew covering Knesset protocols. It is released under CC-BY-4.0, see COPYING for details.

Data set

The Universal Dependencies (UD) Hebrew-IAHLTKnesset treebank is a work in progress. The dataset consists of 2619 annotations of 2619 sentences (with a total of 45538 tokens annotated, 4382 unique lemmas) for dependency syntax, part-of-speech, lemmatization and morphological analysis. The texts were sampled from Knesset protocols.

Note that although the sentnumber values are not necessarily consecutive, the sentences are in fact sorted according to the original order.

Introduction

The UD Hebrew-IAHLT treebank consists of texts originating from Knesset protocols. The schema for the UD Hebrew-IAHLT treebank is based on the conversion of the Hebrew Treebank (HTB) into UD V2 and is checked against the Universal Dependencies validator as of UD release V2.8.

The HTB was initially converted automatically, then a subset of the converted data was manually validated and adopted as a gold standard for training the model for UD parsing used in Hebrew-IAHLT.

The parsed data has been manually edited to correct parsing errors. Quality assurance (QA) scripts were used to apply corrections following updates in the schema. All sentences in this release pass level-5 validation of the Universal Dependencies validator.

Metadata fields

fields for technical use:
- sent_id - a unique identifier for the tree within this release
- text - the (Hebrew) text of the original sentence
- url - the link for the source entry/article
- source - the source of the sentence
- doc_id - a unique identifier for the source document
- protocol - the source protocol file
- parnumber - the paragraph sequence number within the source document
- sentnumber - the sentence sequence number within the source paragraph

Guidelines

The annotation guidelines can be found at https://github.com/ivrit/IAHLT-HTB-GUIDELINES

Acknowledgments

We would like to thank all the people who contributed to this corpus:

Emmanuelle Ko Israel Landau Nick Howell Noam Ordan Omer Strass Shira Wigderson Yifat Ben Moshe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UD Hebrew-IAHLT

Data set

Introduction

Metadata fields

Guidelines

Acknowledgments

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
COPYING		COPYING
README.md		README.md
he_iahltknesset-ne-ud-dev.biose		he_iahltknesset-ne-ud-dev.biose
he_iahltknesset-ne-ud-dev.conllu		he_iahltknesset-ne-ud-dev.conllu
he_iahltknesset-ne-ud-dev.jsonl		he_iahltknesset-ne-ud-dev.jsonl
he_iahltknesset-ne-ud-test.biose		he_iahltknesset-ne-ud-test.biose
he_iahltknesset-ne-ud-test.conllu		he_iahltknesset-ne-ud-test.conllu
he_iahltknesset-ne-ud-test.jsonl		he_iahltknesset-ne-ud-test.jsonl
he_iahltknesset-ne-ud-train.biose		he_iahltknesset-ne-ud-train.biose
he_iahltknesset-ne-ud-train.conllu		he_iahltknesset-ne-ud-train.conllu
he_iahltknesset-ne-ud-train.jsonl		he_iahltknesset-ne-ud-train.jsonl
he_iahltknesset-ud-dev.biose		he_iahltknesset-ud-dev.biose
he_iahltknesset-ud-dev.conllu		he_iahltknesset-ud-dev.conllu
he_iahltknesset-ud-test.conllu		he_iahltknesset-ud-test.conllu
he_iahltknesset-ud-train.conllu		he_iahltknesset-ud-train.conllu

License

IAHLT/UD_Hebrew-IAHLTKnesset

Folders and files

Latest commit

History

Repository files navigation

UD Hebrew-IAHLT

Data set

Introduction

Metadata fields

Guidelines

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages