From 26ef5e5c64aa638b94a97119334c450b3ea1354a Mon Sep 17 00:00:00 2001 From: Sjur N Moshagen Date: Wed, 28 Feb 2024 09:54:57 +0200 Subject: [PATCH] =?UTF-8?q?Starta=20p=C3=A5=20eit=20kapittel=20om=20store?= =?UTF-8?q?=20spr=C3=A5kmodellar?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- slides.md | 40 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 39 insertions(+), 1 deletion(-) diff --git a/slides.md b/slides.md index cc635e6..5d20090 100644 --- a/slides.md +++ b/slides.md @@ -1031,13 +1031,51 @@ layout: section # Large language models, AI and minority languages -XXX +
+
+- large language models - data scarcity - lack of community involvement - speech technology - hybrid systems +--- +layout: two-cols +--- + +## Large language models + + +
+
+ +- ChatGPT +- Google Translate +- [Tartu NLP/Neurotõlge](https://neurotolge.ee) + +Our experience: + +- bad at low-resource languages +- the less data the worse output +- and of course the other way as well + +Example (from Wiechetek et al, forthcoming: _The Ethical Question – Use of Indigenous Corpora for Large Language Models_): + +::right:: + +English original: + +> Hundreds of Indigenous and environmental campaigners have blocked a mai thoroughfare in Oslo to demand the demolition of two windfarms that have been described by the Norwegian government as a «violation of human rights». + +South Sámi output: + +> Tjuetie *aalkoealmetji jïh *byjresekampanjh leah *aktem *åejviehaerniem *Oslosne *biegkemeurhkedh, juktie *rïjvestidh göökte *bïegkefaamoeh, *mejtie nöörjen reerenasse lea *gohtjeme "*almetjereaktide *mïedtelidh". + +Literal back-translation from South Sámi to English: + +> Hundred indigenous __people's__ and environmental __campaigns__ have __one main-haerniem__ in Oslo to __wind-blowing__, which __tear__ two __wind powers__, __to which__ the Norwegian government __has called "to offend to__ human rights". + --- # Data scarcity