CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of Contemporary Welsh) is an inter-disciplinary and multi-institutional project that has created a large-scale, open-source corpus of contemporary Welsh. The CorCenCC corpus contains over 11 million words (circa 14 million tokens) from written, spoken and electronic (online, digital texts) Welsh language sources, taken from a range of genres, language varieties (regional and social) and contexts. The contributors to CorCenCC are representative of the over half a million Welsh speakers in the country.
Information on how to request access to the CorCenCC dataset is available here: www.corcencc.org/download
The CorCenCC corpus and associated software tools are licensed under Creative Commons CC-BY-SA v4 and thus are freely available for use by professional communities and individuals with an interest in language. Bespoke applications and instructions are provided for each tool. When reporting information derived by using the CorCenCC corpus data and/or tools, CorCenCC should be appropriately acknowledged, as follows:
-
CorCenCC corpus: Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasić, I., Thomas, E-M., Lovell, A., Morris, J., Evas, J., Stonelake, M., Arman, L., Davies, J., Ezeani, I., Neale, S., Needs, J., Piao, S., Rees, M., Watkins, G., Williams, L., Muralidaran, V., Tovey-Walsh, B., Anthony, L., Cobb, T., Deuchar, M., Donnelly, K., McCarthy, M. and Scannell, K. (2020). CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of Contemporary Welsh. Cardiff University. http://doi.org/10.17035/d.2020.0119878310
-
Report: Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasić, I. and Thomas, E. M. (2020). The National Corpus of Contemporary Welsh: Project Report | Y Corpws Cenedlaethol Cymraeg Cyfoes: Adroddiad y Prosiect. arXiv:2010.05542, October 2020. Available online at: https://arxiv.org/abs/2010.05542 (also see www.corcencc.org/outputs for a PDF version of this report)
-
CorCenCC’s infrastructure and crowdsourcing app: Knight, D., Loizides, F., Neale, S., Anthony, L. and Spasić, I. (2020). Developing computational infrastructure for the CorCenCC corpus – the National Corpus of Contemporary Welsh. Language Resources and Evaluation (LREV). https://doi.org/10.1007/s10579-020-09501-9
-
CorCenCC’s part-of-speech (POS) tagger ‘CyTag’: Neale, S., Donnelly, K., Watkins, G. and Knight, D. (2018). Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh. Poster presented at the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan. https://www.aclweb.org/anthology/L18-1623/
-
CorCenCC’s semantic tagger ‘CySemTagger’: Piao, S., Rayson, P., Knight, D. and Watkins, G. (2018). Towards a Welsh Semantic Annotation System. Proceedings of the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan. https://www.aclweb.org/anthology/L18-1158/
-
CorCenCC’s pedagogic toolkit ‘Y Tiwtiadur’: Davies, J., Thomas, E-M., Fitzpatrick, T., Needs, J., Anthony, L., Cobb, T. and Knight, D. (2020). Y Tiwtiadur. [Digital Resource]. Available at: https://www.corcencc.org/Y-Tiwtiadur
-
CorCenCC’s word frequency lists ‘Yr Amliadur’: Details coming soon
This work was carried out as part of the UK Economic and Social Research Council (ESRC) and Arts and Humanities Research Council (AHRC) funded Corpws Cenedlaethol Cymraeg Cyfoes (The National Corpus of Contemporary Welsh): A community driven approach to linguistic corpus construction project (Grant Number ES/M011348/1). For more information, go to www.corcencc.org | www.corcencc.cymru