GitHub - boun-tabi/SQuAD-TR

📜 SQuAD-TR

SQuAD-TR is a machine translated version of the original SQuAD2.0 dataset into Turkish using Amazon Translate.

Dataset Description

Paper: Building Efficient and Effective OpenQA Systems for Low-Resource Languages [pdf].
Point of Contact: Emrah Budur

Dataset Structure

Data Instances

Our data instances follow that of the original SQuAD2.0 dataset. Shared below is an example instance🍫 from the default train dataset.

Example from SQuAD2.0:

{
  "context": "Chocolate is New York City's leading specialty-food export, with up to US$234 million worth of exports each year. Entrepreneurs were forming a \"Chocolate District\" in Brooklyn as of 2014, while Godiva, one of the world's largest chocolatiers, continues to be headquartered in Manhattan.",
  "qas": [
    {
     "id": "56cff221234ae51400d9c140",
      "question": "Which one of the world's largest chocolate makers is stationed in Manhattan?",
      "is_impossible": false,
      "answers": [
        {
          "text": "Godiva",
          "answer_start": 194
        }
      ],
    }
  ]
}

Turkish translation:

{
    "context": "Çikolata, her yıl 234 milyon ABD dolarına varan ihracatı ile New York'un önde gelen özel gıda ihracatıdır. Girişimciler 2014 yılı itibariyle Brooklyn'de bir “Çikolata Bölgesi” kurarken, dünyanın en büyük çikolatacılarından biri olan Godiva merkezi Manhattan'da olmaya devam ediyor.",
    "qas": [
        {
            "id": "56cff221234ae51400d9c140",
            "question": "Dünyanın en büyük çikolata üreticilerinden hangisi Manhattan'da konuşlandırılmış?",
            "is_impossible": false,
            "answers": [
                {
                    "text": "Godiva",
                    "answer_start": 233
                }
            ]
        }
    ]
}

Dataset Creation

We translated the titles, context paragraphs, questions and answer spans from the original SQuAD2.0 dataset using Amazon Translate - requiring us to remap the starting positions of the answer spans, since their positions were changed due to the automatic translation.

We performed an automatic post-processing step to populate the start positions for the answer spans. To do so, we have first looked at whether there was an exact match for the translated answer span in the translated context paragraph and if so, we kept the answer text along with this start position found. If no exact match was found, we looked for approximate matches using a character-level edit distance algorithm.

We have excluded the question-answer pairs from the original dataset where neither an exact nor an approximate match was found in the translated version. Our default configuration corresponds to this version.

We have put the excluded examples in our excluded configuration.

As a result, the datasets in these two configurations are mutually exclusive. Below are the details for the corresponding dataset splits.

Data Splits

The SQuAD2.0 TR dataset has 2 splits: train and validation. Below are the statistics for the most recent version of the dataset in the default configuration.

Split	Articles	Paragraphs	Answerable Questions	Unanswerable Questions	Total
train	442	18776	61293	43498	104,791
validation	35	1204	2346	5945	8291

Split	Articles	Paragraphs	Questions wo/ answers	Total
train-excluded	440	13490	25528	25528
dev-excluded	35	924	3582	3582

In addition to the default configuration, we also a different view of train split can be obtained specifically for openqa setting by combining the train and train-excluded splits. In this view, we only have question-answer pairs (without answer_start field) along with their contexts.

Split	Articles	Paragraphs	Questions w/ answers	Total
openqa	442	18776	86821	86821

More information on our translation strategy can be found in our linked paper.

Source Data

This dataset used the original SQuAD2.0 dataset as its source data.

Licensing Information

The SQuAD-TR is released under CC BY-NC-ND 4.0.

📚 Resources

📖 Download SQuAD-TR

🔗 Raw files

All SQuAD-TR files can be downloaded from the data folder of this repository.
XQuAD-TR file can be downloaded here.

🤗 HuggingFace datasets

from datasets import load_dataset

squad_tr_standard_qa = load_dataset("boun-tabi/squad_tr", "default")
squad_tr_open_qa = load_dataset("boun-tabi/squad_tr", "openqa")
squad_tr_excluded = load_dataset("boun-tabi/squad_tr", "excluded")
xquad_tr = load_dataset("xquad", "xquad.tr") # External resource

Demo application 👉 Google Colab.

🩺 Visualizations - 🆕

The visualizations for the dense retrievers in our paper are demonstrated here.

🔬 Reproducibility

You can find all code, models and samples of the input data here and the instructions to reproduce the experiment results here. Please feel free to reach out to us if you have any specific questions.

✍️ Citation

Emrah Budur, Rıza Özçelik, Dilara Soylu, Omar Khattab, Tunga Güngör and Christopher Potts.
Building Efficient and Effective OpenQA Systems for Low-Resource Languages. 2024. [pdf]

@misc{budur-etal-2024-squad-tr,
      title={Building Efficient and Effective OpenQA Systems for Low-Resource Languages}, 
      author={Emrah Budur and R{\i}za \"{O}z\c{c}elik and
              Dilara Soylu and Omar Khattab and
              Tunga G\"{u}ng\"{o}r and Christopher Potts},
      year={2024},
      eprint={2401.03590},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

❤ Acknowledgment

This research was supported by the AWS Cloud Credits for Research Program (formerly AWS Research Grants).

We thank Alara Dirik, Almira Bağlar, Berfu Büyüköz, Berna Erden, Gökçe Uludoğan, Havva Yüksel, Melih Barsbey, Murat Karademir, Selen Parlar, Tuğçe Ulutuğ, Utku Yavuz for their support on our application for AWS Cloud Credits for Research Program and Fatih Mehmet Güler for the valuable advice, discussion and insightful comments.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
visualizations		visualizations
CC-BY-NC-ND4.0.txt		CC-BY-NC-ND4.0.txt
README.md		README.md
visualizations.md		visualizations.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📜 SQuAD-TR

Dataset Description

Dataset Structure

Data Instances

Dataset Creation

Data Splits

Source Data

Licensing Information

📚 Resources

📖 Download SQuAD-TR

🔗 Raw files

🤗 HuggingFace datasets

🩺 Visualizations - 🆕

🔬 Reproducibility

✍️ Citation

❤ Acknowledgment

About

Releases

Packages

boun-tabi/SQuAD-TR

Folders and files

Latest commit

History

Repository files navigation

📜 SQuAD-TR

Dataset Description

Dataset Structure

Data Instances

Dataset Creation

Data Splits

Source Data

Licensing Information

📚 Resources

📖 Download SQuAD-TR

🔗 Raw files

🤗 HuggingFace datasets

🩺 Visualizations - 🆕

🔬 Reproducibility

✍️ Citation

❤ Acknowledgment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages