Skip to content

MultiLS-Japanese Lexical Complexity Prediction and Lexical Simplification Dataset for Japanese: annotator profiles, unaggregated annotation, and annotatation guidelines.

Notifications You must be signed in to change notification settings

naist-nlp/multils-japanese

Repository files navigation

MultiLS-Japanese

License: CC BY-NC-SA 4.0

MultiLS-Japanese is a lexical complexity prediction (LCP) and lexical simplification (LS) dataset for Japanese.

This repository provides:

  1. Additional data for the original annotation, which was used to evaluate the MLSP 2024 shared task:

    • LCP and LS annotator profiles. Note that each instance in both trial and test data was annotated by the the same annotators.
    • Unaggregated trial and test ratings for LCP that can be merged with the Japanese dataset using the id column.
    • Empty Excel templates used for annotation including our annotation guidelines and the exact questions we asked in the annotator profiles.
  2. Non-Chinese/Korean L1 replication of the LCP trial set:

  3. Chinese L1 reannotation of the LCP trial set:

The last two trial set annotations were used for analysis in “Difficult for Whom? A Study of Japanese Lexical Complexity” (Nohejl et al., 2024). Only the original data was used for the MLSP shared task (Shardlow et al., 2024).

The LS and LCP Data

Please get the data for all languages, including Japanese (original annotation), from the MLSP2024 dataset on Hugging Face Hub. This multils-japanese repository only provides additional data specific for the Japanese subset of MultiLS (MLSP2024) dataset.

Papers

The MultiLS-Japanese dataset was created by Adam Nohejl, Akio Haykawa, and Yusuke Ide. You can learn more about it in the following papers. Please cite them if you use the data.

MultiLS-Japanese: Analysis and Additional Annotation

Paper

@inproceedings{nohejl-etal-2024-difficult,
  title = {Difficult for {{Whom}}? {{A Study}} of {{Japanese Lexical Complexity}}},
  author = {Nohejl, Adam and Hayakawa, Akio and Ide, Yusuke and Watanabe, Taro},
  booktitle = "Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)",
  year = {2024}, url = "https://aclanthology.org/2024.tsar-1.8",
}

MultiLS (all MLSP2024 data): Shared Task Report and Dataset

Paper

@inproceedings{shardlow2024bea,
  title={{The BEA 2024 Shared Task on the Multilingual Lexical Simplification Pipeline}},
  author={Shardlow, Matthew and Alva-Manchego, Fernando and Batista-Navarro, Riza and Bott, Stefan and Calderon Ramirez, Saul and Cardon, Rémi and François, Thomas and Hayakawa, Akio and Horbach, Andrea and Huelsing, Anna and Ide, Yusuke and Imperial, Joseph Marvin and Nohejl, Adam and North, Kai and Occhipinti, Laura and Peréz Rojas, Nelson and Raihan, Nishat and Ranasinghe, Tharindu and Solis Salazar, Martin and \v{S}tajner, Sanja and Zampieri, Marcos and Saggion, Horacio},
  booktitle={Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA)},
year={2024}
}

MultiLS (all MLSP2024 data): Dataset Creation

Paper

@inproceedings{shardlow2024readi,
  title={{An Extensible Massively Multilingual Lexical Simplification Pipeline Dataset using the MultiLS Framework}},
  author={Shardlow, Matthew and Alva-Manchego, Fernando and Batista-Navarro, Riza and Bott, Stefan and Calderon Ramirez, Saul and Cardon, Rémi and François, Thomas and Hayakawa, Akio and Horbach, Andrea and Huelsing, Anna and Ide, Yusuke and Imperial, Joseph Marvin and Nohejl, Adam and North, Kai and Occhipinti, Laura and Peréz Rojas, Nelson and Raihan, Nishat and Ranasinghe, Tharindu and Solis Salazar, Martin and Zampieri, Marcos and Saggion, Horacio},
  booktitle={Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI)},
year={2024}
}

Related Work

JaLeCoN, a Dataset of Japanese Lexical Complexity for Non-Native Readers

Paper. The annotation was done with a slightly different scale and in a dense setting.

@inproceedings{ide2023,
  title     = "Japanese Lexical Complexity for Non-Native Readers: A New Dataset",
  author    = "Ide, Yusuke and Mita, Masato and Nohejl, Adam and Ouchi, Hiroki and Watanabe, Taro",
  booktitle = "Proceedings of the Eighteenth Workshop on Innovative Use of {NLP} for Building Educational Applications",
  month     = July,
  year      = 2023,
  publisher = "Association for Computational Linguistics",
}

MultiLS Framework

Paper

@article{north2024multils,
  title={MultiLS: A Multi-task Lexical Simplification Framework},
  author={North, Kai and Ranasinghe, Tharindu and Shardlow, Matthew and Zampieri, Marcos},
  journal={arXiv preprint arXiv:2402.14972}, year={2024}
}

MultiLS-SP/CA: Spanish and Catalan Datasets

Paper

@misc{bott2024multilsspca,
      title={MultiLS-SP/CA: Lexical Complexity Prediction and Lexical Simplification Resources for Catalan and Spanish},
      author={Stefan Bott and Horacio Saggion and Nelson Peréz Rojas and Martin Solis Salazar and Saul Calderon Ramirez},
      year={2024},
      eprint={2404.07814},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

You may also be interested in Japanese lexical simplification datasets targeting native speakers (by different authors):

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Please, cite our papers if you use the data.

See the sources and license information for trial set and test set for details.

About

MultiLS-Japanese Lexical Complexity Prediction and Lexical Simplification Dataset for Japanese: annotator profiles, unaggregated annotation, and annotatation guidelines.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published