Open Parallel Corpus

This corpus contains a growing collection of multilingual texts aligned to Tibetan texts (bo) at the sentence level.

Content summary

2,08,736 Tibetan segments
4,611 files
Files from Lotsawa House and 84,000

Languages:	bo-en	bo-es	bo-fr	bo-de	bo-it	bo-nl	bo-zh	bo-pt
Segments:	2,08,736	3,481	8,971	5,892	1,129	889	2,573	2018

Detailed content description

Lotsawa House

Source:	https://www.lotsawahouse.org/
Pairs:	76,135
Files:	4,405
Accessed on:	2023-01-04 12:44:15.146037
Crawler:	LH Crawler
Parser:	LH Parser
Layers:	Base + `Segments`
Included texts:	See text pairs catalog

Languages:	bo-en	bo-es	bo-fr	bo-de	bo-it	bo-nl	bo-zh	bo-pt
Segments:	76,135	3,481	8,971	5,892	1,129	889	2,573	2018

84000-translation-memory

Source	https://read.84000.co/
Pairs	132601
Files	206
Accessed On	2018-09-26T07:14:13.428Z
Crawler:	TMX Crawler
Parser:	TMX Parser
Layers	Base + `Segments`
Included texts:	See text pairs catalog

Languages:	bo-en
Segments:	1,32,601

Views

This collection presents the same data in two views: text pairs and TMs.

View 1 - Text pairs

What it is

Plain text pairs in .txt format (see detailed catalog).

More

Text pairs consist of matching sets of .txt files. They include a file containing a Tibetan text with one chunk of text per line and one or more .txt files of translations of the text into other languages. Translation files are also split into lines to correspond to the line breaks in the Tibetan file.

Titles of any file can be found on line 1 of the file or by searching for a file's identifying number (e.g. A00023033) in the corpus's text pairs catalog.

Text pairs or groups share the same identifying number and differ only in the ending language tag.

Example:

A00023033-bo.txt
A00023033-en.txt
A00023033-fr.txt

As stated above, these files are aligned by line to match the Tibetan version.

Example: Tibetan text

1 ༄༅། །འཆི་མེད་འཕགས་མའི་སྙིང་ཐིག་གི་བརྒྱུད་པའི་གསོལ་འདེབས་ཚེ་དབང་བཅུད་འཛིན་ཞེས་བྱ་བ་བཞུགས་སོ། །
2 རང་བྱུང་རྟག་པའི་རྡོ་རྗེ་ཚེ་དཔག་མེད། །
3 འཆི་བདག་བདུད་འཇོམས་གཙུག་ཏོར་རྣམ་པར་རྒྱལ། །

English text

1 The Fount of Longevity Chimé Phakmé Nyingtik Lineage Prayer
2 Amitāyus, Boundless Life, natural, everlasting and indestructible,
3 Uṣṇīṣa-Vijayā, Victorious Conqueror of māra , Lord of Death,

French text

1 La Fontaine de longévité La prière à la lignée de Chimé p'akmé nyingthik
2 Existant par lui-même et éternel, indestructible Amitāyus,
3 Celle qui triomphe du démon Seigneur de la mort, Uṣṇīṣa Vijayā,

Who it's for

View 1 is intended for developers who want to train a translation model.

How to use it

This data can be fed into machine translation training pipelines such as using this and that.

View 2 - TMs

What it is

TM files in .tmx format (see detailed catalog).

Note: If you need a different format, check the how to get help section below.

Who it's for

View 2 is intended for developers who want to train a translation model.

How to use it

This data can be fed into machine translation training pipelines such as using this and that.

Coming soon

700 more texts from Lotsawa House
87 texts from Oslo

Questions about this collection?

Email us at openpecha[at]gmail.com.
Join our Discord.
File an issue.

Acknowledgments

Thanks to the following organizations for providing data for this collection:

Terms of use

This corpus is provided by OpenPecha under the CC0 Public Domain Dedication v 1.0.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
C0A2DD042.opc		C0A2DD042.opc
text-pairs		text-pairs
tmx		tmx
.gitattributes		.gitattributes
C0A2DD042.yml		C0A2DD042.yml
README.md		README.md
text-pairs-catalog.csv		text-pairs-catalog.csv
tmx-catalog.csv		tmx-catalog.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Parallel Corpus

Content summary

Table of contents

Detailed content description

Lotsawa House

84000-translation-memory

Views

View 1 - Text pairs

What it is

Who it's for

View 2 - TMs

What it is

Who it's for

Coming soon

Questions about this collection?

Acknowledgments

Terms of use

About

Releases

Packages

Contributors 3

OpenPecha-Data/C0A2DD042

Folders and files

Latest commit

History

Repository files navigation

Open Parallel Corpus

Content summary

Table of contents

Detailed content description

Lotsawa House

84000-translation-memory

Views

View 1 - Text pairs

What it is

Who it's for

View 2 - TMs

What it is

Who it's for

Coming soon

Questions about this collection?

Acknowledgments

Terms of use

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages