Public Domain Data for Uralic Languages, v. 0.1.3

This is an R data package that contains text corpus data from Uralic languages which are in the Public Domain. Individual subsets of the data are documented within the package. Here are statistics and brief explanations for the subsets that are planned to be included.

This package is in an extremely early testing stage and it is not yet supposed to be used in actual research. Currently, mostly Public Domain data from Zyrian Komi is included, and adding that into the package is one of the next steps. The whole current setup is experimental and may change any moment. We will also add an explicit documentation concerning how to use and cite the whole data set or individual resources.

This version contains data from several Uralic languages, and those are parallel alignable by sentence_id. If there are no matching id’s found between the datasets, then that means there are no parallel matches for that sentence between some of the languages. The book used for testing, Four Battles, is available in Erzya, Moksha, Komi Zyrian, Komi Permyak, Udmurt, Hill Mari, Meadow Mari, Mansi and Tundra Nenets. However, at the moment only some of these are proofread.

Installation

You can install uralic from GitHub with:

# install.packages("devtools")
devtools::install_github("langdoc/uralic")

Example

library(tidyverse, warn.conflicts = FALSE)
#> ── Attaching packages ────────────────────
#> ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
#> ✔ tibble  1.4.2     ✔ dplyr   0.7.8
#> ✔ tidyr   0.8.2     ✔ stringr 1.3.1
#> ✔ readr   1.3.1     ✔ forcats 0.3.0
#> ── Conflicts ──── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
library(uralic)

uralic::kpv %>% 
    as_tibble() %>%
    filter(str_detect(sentence, "станция")) %>% 
    select(sentence) %>%
  knitr::kable()

sentence
Тайӧ сотчисны кутшӧмкӧ матыса станцияяс, кодъяс ӧзйӧмаӧсь белӧйяс снарядъясысь.
Коймӧд случай вӧлі Воропоново станцияын, неылын сійӧ жӧ Царицынсянь.
Дас кӧкъямысӧд числӧын штабнӧй вагон сійӧс вайис Воропоново станцияӧ.
Неыджыд станцияын сы вӧсна вӧлі полнӧй неразбериха.
Но век жӧ станцияыс эз вӧв дзик фронт вылас.
И сэн, на сайын, пу увъяс костӧд сійӧ аддзис: сё, кыксё, витсё, он гӧгӧрво мындаӧн белӧйяс мыйвынсьыныс локтӧны станциялань.
Белӧйяс — найӧ вӧлі тыдалӧны кипыдӧс вылын моз — скачитісны овраг дорӧдз, коді воӧ вӧлі степь кузя веськыда станция дорӧ, и ряд бӧрся ряд вошисны сэтчӧ.
Туйыс на водзын беззащитнӧй станцияӧ вӧлі восьса.
Но кӧн жӧ вӧлі станциялӧн охрана?

Citation

Please notice that providing proper references to the sources used is clearly demanded by good scientific practice, even when using Public Domain data. The difference between Public Domain data and virtually any other data source is that the citation is not demanded from a legal perspective, but can and should be justified and rationalized by what is reasonable.

Niko Partanen, Michael Rießler, Joshua Wilbur (2019). Public Domain Corpus Data for Uralic Languages. R data package. Version 0.1.3. https://github.com/langdoc/uralic/.

Explanations how to cite single example sentences or larger subsets of the data are coming soon. Ideally there would be an R function that would spill out citations by id. Individual books will likely have several sources that would be important to cite, among these the original collection of the National Library of Finland. Individual book citations will look something like this.

Успенский, Л 1939 Ныл боевой случай. Москва: Молодежын Всесоюзный Ленинский Коммунистический Союзшын Центральный Комитетше: Йоча Литератур издательство. URN: http://urn.fi/URN:NBN:fi-fe2014091844781

Of course, as this is Public Domain data, the exact data citation can be rationalized on a case by case basis.

Code of conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
R		R
data-raw		data-raw
data		data
docs		docs
man		man
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CONDUCT.md		CONDUCT.md
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
PLANNED_DATA.md		PLANNED_DATA.md
README.Rmd		README.Rmd
README.md		README.md
uralic.Rproj		uralic.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Public Domain Data for Uralic Languages, v. 0.1.3

Installation

Example

Citation

Citation

Code of conduct

About

Releases

Packages

Contributors 4

Languages

License

langdoc/uralic

Folders and files

Latest commit

History

Repository files navigation

Public Domain Data for Uralic Languages, v. 0.1.3

Installation

Example

Citation

Citation

Code of conduct

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages