Plain text:
K. Sonmezoz and M. F. Amasyali, "Same Sentence Prediction: A new Pre-training Task for BERT," 2021 Innovations in Intelligent Systems and Applications Conference (ASYU), 2021, pp. 1-6, doi: 10.1109/ASYU52992.2021.9598954.
BibTex:
@INPROCEEDINGS{9598954,
author={Sonmezoz, Kaan and Amasyali, Mehmet Fatih},
booktitle={2021 Innovations in Intelligent Systems and Applications Conference (ASYU)},
title={Same Sentence Prediction: A new Pre-training Task for BERT},
year={2021},
volume={},
number={},
pages={1-6},
doi={10.1109/ASYU52992.2021.9598954}}
Corpus has been prepared by the same way as in Berturk.
- Latest wikipedia dump
- Latest wikipedia dump was 01 November 2021.
- BERTurk used Wikipedia dump on 2 February 2020 for pre-training.
- Kemal Oflazer's corpus
- Private
- Contact with Mr Oflazer to get corpus.
- OSCAR
- Public
- Deduplicated version has been used.
- OPUS
- Public
- Various datasets:
- Bible Uedin
- GNOME
- JW300
- OpenSubtitles
- OPUS All (?)
- Not sure whether this is the one that used in BERTurk pre-training
- QED
- SETIMES
- Tanzil
- Tatoeba
- TED2013
- Wikipedia
- 03.05.2022:
- Added citation
- 03.11.2021:
- Added datasets used
- Repo initialized