lecture2

tsamardzic · Sep 28, 2023 · 170938e · 170938e
1 parent 853e6fb
commit 170938e
Show file tree

Hide file tree

Showing 10 changed files with 99 additions and 1 deletion.
diff --git a/2.md b/2.md
@@ -0,0 +1,98 @@
+## 2. NLP tasks, data sets, benchmarks
+
+> Explanations and visualisations 
+> - Crash Course Linguistics [#2](https://youtu.be/93sK4jTGrss?si=iBXbRHv_6npQduCH), [#3](https://youtu.be/B1r1grQiLdk?si=parMqegmCgLtmCWH), [#4](https://youtu.be/n1zpnN-6pZQ?si=IbWeV913ioUzcwG5)
+> - Universal Dependencies, [CoNLL-U format](https://universaldependencies.org/format.html)
+> - Jurafsky-Martin [2.4](https://web.stanford.edu/~jurafsky/slp3/2.pdf) 
+
+&nbsp; 
+
+### Text parsing 
+
+Because language is compositional, text parsing is performed at several levels. 
+
+#### Tokenisation 
+
+Here we decide what the units of processing are. In the CoNLL-like formats, tokenisation is deciding what goes in each row. Traditionally, each word is considered to be one token. But what is a word? What about punctuation?
+
+<img src="figures/tokenisation.png" alt="splits" width="200"/>
+
+&nbsp; 
+
+#### Lemmatisation
+
+Mapping different word forms into a single canonical form, e.g. journaux -> journal. It can be very difficult for some language due to:
+
+- non-concatenative morphology
+- not clear difference between derivation and morphology 
+- no clear word boundaries (e.g. Chinese) 
+
+Morphology
+
+<img src="figures/french_morph.png" alt="splits" width="100"/><img src="figures/arabic_morph.png" alt="splits" width="100"/><img src="figures/malay_morph.png" alt="splits" width="100"/>
+
+
+Derivation
+
+<img src="figures/morph_derivation.png" alt="splits" width="300"/>
+
+
+&nbsp; 
+
+#### Part-of-speech (PoS) tagging or morphosyntactic definition (MSD)
+
+Classifying tokens into categories, e.g. VERB, NOUN. If a language has rich morphology (like Latin), we need additional features called morphosyntactic definitions, e.g. NOUN in the ACCUSATIVE case SINGULAR, MASCULINE gender
+
+
+<img src="figures/PoS.png" alt="splits" width="300"/>
+
+
+&nbsp; 
+
+#### Syntactic parsing 
+
+How tokens combine into phrases and sentences:
+
+No labels
+
+<img src="figures/syntax_tree.png" alt="splits" width="300"/>
+
+
+&nbsp; 
+
+Constituent analysis
+
+<img src="figures/syntax_tree2.png" alt="splits" width="400"/>
+
+&nbsp; 
+
+Dependency analysis
+
+
+<img src="figures/brat.png" alt="splits" width="400"/>
+
+
+&nbsp; 
+
+### End-user tasks
+
+- Examples in the [HuggingFace tutorial](https://huggingface.co/course/chapter1/3?fw=pt): 
+    - sentiment analysis: given a short text, is it positive or negative?
+    - named entity recognition: given a token, is it an ordinary word or does it refer to a specific real entity?
+    - question answering: given a question and a text snippet, what segments of the text respond to the question? 
+    - mask filling: given a sentence with empty slots, what tokens suit best the empty slots?  
+    - translation
+    - summarisation 
+    - text generation
+
+- Famous NLU benchmarks and data sets:
+    - [GLUE](https://gluebenchmark.com/tasks)
+    - [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)
+    - [SNLI](https://nlp.stanford.edu/projects/snli/)
+    - [COPA](https://people.ict.usc.edu/~gordon/copa.html)
+
+
+--------------
+
+
+&nbsp; 
diff --git a/README.md b/README.md
@@ -27,7 +27,7 @@ These notes should be used as a guide for acquiring the most important notions a
 
 #### 1. [Large language models (LLMs), Artificial Intelligence (AI) and Natural language processing (NLP), history of NLP](https://tsamardzic.github.io/nlp_intro/1.html) 
 
-#### 2. NLP tasks, data sets, benchmarks
+#### 2. [NLP tasks, data sets, benchmarks]((https://tsamardzic.github.io/nlp_intro/2.html))
 
 #### 3. Evaluation, data splits
 

diff --git a/figures/PoS.png b/figures/PoS.png
diff --git a/figures/arabic_morph.png b/figures/arabic_morph.png
diff --git a/figures/brat.png b/figures/brat.png
diff --git a/figures/french_morph.png b/figures/french_morph.png
diff --git a/figures/malay_morph.png b/figures/malay_morph.png
diff --git a/figures/morph_derivation.png b/figures/morph_derivation.png
diff --git a/figures/syntax_tree.png b/figures/syntax_tree.png
diff --git a/figures/syntax_tree2.png b/figures/syntax_tree2.png