This is the list of tutorials, workshops, talks, books, papers, and resources on computational linguistic approaches to research in Indonesian languages. The list will be updated over time. You are welcome to send a pull request to update the list and be one of the contributors! π
π If you are working on any work related to Indonesian or any local Indonesian languages, don't hesitate to contact me or send a pull request!
- Jan Wira Gotama Putra (2019) Pengenalan Konsep Pembelajaran Mesin dan Deep Learning (in Indonesian). [Book]
- Bedah Paper Series by INACL (in Indonesian) [Video]
- Aji, et al. (2022) One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia. ACL [Paper]
- Winata, et al. (2022) NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages. Preprint [Paper] [Benchmark]
- Cahyawijaya, et al. (2021) IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation. EMNLP [Paper] [Benchmark] [Huggingface Models]
- Wibowo, et al. (2021) IndoCollex: A Testbed for Morphological Transformation of Indonesian Colloquial Words. ACL Findings [Paper] [Benchmark]
- Koto, et al. (2020) IndoLEM and IndoBERT: A benchmark dataset and pre-trained language model for Indonesian NLP. COLING [Paper] [Benchmark]
- Fajri Koto, and Ikhwan Koto (2020) Towards Computational Linguistics in Minangkabau Language: Studies on Sentiment Analysis and Machine Translation. PACLIC [Paper] [Benchmark]
- Wilie, et al. (2020) IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding. AACL [Paper] [Benchmark] [Huggingface Models]
- Wongso, et al. (2022) Pre-Trained Transformer-Based Language Models for Sundanese. Journal of Big Data [Paper]
- Pimentel, et al. (2021) SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages. Workshop on Computational Research in Phonetics, Phonology, and Morphology [Paper] [Dataset]
- Devin Hoesen and Ayu Purwarianti (2018) Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger. International Conference on Asian Language Processing [Paper] [Benchmark]
- Dinakaramani, et al. (2014) Designing an Indonesian Part of speech Tagset and Manually Tagged Indonesian Corpus. International Conference on Asian Language Processing [Paper] [Dataset]
- Devin Hoesen and Ayu Purwarianti (2018) Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger. International Conference on Asian Language Processing [Paper] [Benchmark]
- Muhammad Fachri (2014) Named Entity Recognition for Indonesian Text using Hidden Markov Model. Undergraduate Thesis [Paper] [Dataset]
- Alfina, et al. (2016) DBpedia Entities Expansion in Automatically Building Dataset for Indonesian NER. International Conference on Advanced Computer Science and Information Systems [Paper] [Dataset]
- Mahendra, et al. (2018) Cross-Lingual and Supervised Learning Approach for Indonesian Word Sense Disambiguation Task. Global Wordnet Conference [Paper] [Dataset]
- Arwidarasti, et al. (2019) Converting an Indonesian Constituency Treebank to the Penn Treebank Format. International Conference on Asian Language Processing [Paper] [Dataset]
- Moeljadi, et al. (2018) Building Cendana: a Treebank for Informal Indonesian. Global Wordnet Conference [Paper] [Dataset]
- David Moeljadi (2017) Building JATI: A Treebank for Indonesian. Global Wordnet Conference [Paper] [Dataset]
- Zeman, et al. (2018) CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. CoNLL Shared Task [Paper] [Dataset]
- McDonald, et al. (2013) Universal Dependency Annotation for Multilingual Parsing. ACL [Paper] [Dataset]
- Artari, et al. (2021) A Multi-Pass Sieve Coreference Resolution for Indonesian. RANLP [Paper] [Dataset]
- Lin, et al. (2021) XPersona: Evaluating Multilingual Personalized Chatbot. NLP4ConvAI [Paper] [Benchmark] [Dataset]
- Clark, et al. (2020) TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. TACL [Paper] [Dataset]
- Purwarianti, et al. (2007) A Machine Learning Approach for Indonesian Question Answering System. RANLP [Paper] [Benchmark]
- Kemal Kurniawan and Samuel Louvan (2018) A New Benchmark Dataset for Indonesian Text Summarization. International Conference on Asian Language Processing [Paper] [Benchmark] [Dataset]
- Koto, et al. (2020) A Large-scale Indonesian Dataset for Text Summarization. AACL [Paper] [Benchmark] [Dataset]
- Mahfuzh, et al. (2019) Improving Joint Layer RNN based Keyphrase Extraction by Using Syntactical Features. International Conference of Advanced Informatics: Concepts, Theory and Applications [Paper] [Benchmark]
- Mahendra, et al. (2021) IndoNLI: A Natural Language Inference Dataset for Indonesian. EMNLP [Paper] [Dataset]
- Ken Nabila Setya and Rahmad Mahendra (2018) Semi-supervised Textual Entailment on Indonesian Wikipedia Data. International Conference on Computational Linguistics and Intelligent Text Processing [Paper] [Benchmark]
- Ayu Purwarianti and Ida Ayu Putu Ari Crisdayanti (2019) Improving Bi-LSTM Performance for Indonesian Sentiment Analysis Using Paragraph Vector. International Conference of Advanced Informatics: Concepts, Theory and Applications [Paper] [IndoNLU Benchmark] [NusaX Benchmark]
- Azhar, et al. (2019) Multi-label Aspect Categorization with Convolutional Neural Networks and Extreme Gradient Boosting. International Conference on Electrical Engineering and Informatics [Paper] [Benchmark]
- Ilmania, et al. (2018) Aspect Detection and Sentiment Classification Using Deep Neural Network for Indonesian Aspect-Based Sentiment Analysis. International Conference on Asian Language Processing [Paper] [Benchmark]
- Saputri, et al. (2018) Emotion Classification on Indonesian Twitter Dataset. International Conference on Asian Language Processing [Paper] [Dataset]
- Jannati, et al. (2018) Stance Classification Towards Political Figures on Blog Writing. International Conference on Asian Language Processing [Paper] [Dataset]
- Alfina, et al. (2017) Hate Speech Detection in the Indonesian Language: A Dataset and Preliminary Study. International Conference on Advanced Computer Science and Information Systems [Paper] [Dataset]
- Muhammad Okky Ibrohim and Indra Budi (2018) A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media. International Conference on Computer Science and Computational Intelligence [Paper] [Dataset]
- Muhammad Okky Ibrohim and Indra Budi (2019) Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter. Workshop on Abusive Language Online [Paper] [Dataset]
- Andika William and Yunita Sari (2020) CLICK-ID: A Novel Dataset for Indonesian Clickbait Headlines. Data in Brief [Paper] [Dataset]
- Wibowo, et al. (2020) Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-Translation. International Conference on Asian Language Processing [Paper] [Dataset]
IndoNLP is going to start collecting new datasets at https://github.com/orgs/IndoNLP. They will open the submission starting mid June 2022. Stay tuned!