Skip to content

Commit

Permalink
lecture2
Browse files Browse the repository at this point in the history
  • Loading branch information
tsamardzic committed Sep 28, 2023
1 parent 853e6fb commit 170938e
Show file tree
Hide file tree
Showing 10 changed files with 99 additions and 1 deletion.
98 changes: 98 additions & 0 deletions 2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
## 2. NLP tasks, data sets, benchmarks

> Explanations and visualisations
> - Crash Course Linguistics [#2](https://youtu.be/93sK4jTGrss?si=iBXbRHv_6npQduCH), [#3](https://youtu.be/B1r1grQiLdk?si=parMqegmCgLtmCWH), [#4](https://youtu.be/n1zpnN-6pZQ?si=IbWeV913ioUzcwG5)
> - Universal Dependencies, [CoNLL-U format](https://universaldependencies.org/format.html)
> - Jurafsky-Martin [2.4](https://web.stanford.edu/~jurafsky/slp3/2.pdf)
 

### Text parsing

Because language is compositional, text parsing is performed at several levels.

#### Tokenisation

Here we decide what the units of processing are. In the CoNLL-like formats, tokenisation is deciding what goes in each row. Traditionally, each word is considered to be one token. But what is a word? What about punctuation?

<img src="figures/tokenisation.png" alt="splits" width="200"/>

&nbsp;

#### Lemmatisation

Mapping different word forms into a single canonical form, e.g. journaux -> journal. It can be very difficult for some language due to:

- non-concatenative morphology
- not clear difference between derivation and morphology
- no clear word boundaries (e.g. Chinese)

Morphology

<img src="figures/french_morph.png" alt="splits" width="100"/><img src="figures/arabic_morph.png" alt="splits" width="100"/><img src="figures/malay_morph.png" alt="splits" width="100"/>


Derivation

<img src="figures/morph_derivation.png" alt="splits" width="300"/>


&nbsp;

#### Part-of-speech (PoS) tagging or morphosyntactic definition (MSD)

Classifying tokens into categories, e.g. VERB, NOUN. If a language has rich morphology (like Latin), we need additional features called morphosyntactic definitions, e.g. NOUN in the ACCUSATIVE case SINGULAR, MASCULINE gender


<img src="figures/PoS.png" alt="splits" width="300"/>


&nbsp;

#### Syntactic parsing

How tokens combine into phrases and sentences:

No labels

<img src="figures/syntax_tree.png" alt="splits" width="300"/>


&nbsp;

Constituent analysis

<img src="figures/syntax_tree2.png" alt="splits" width="400"/>

&nbsp;

Dependency analysis


<img src="figures/brat.png" alt="splits" width="400"/>


&nbsp;

### End-user tasks

- Examples in the [HuggingFace tutorial](https://huggingface.co/course/chapter1/3?fw=pt):
- sentiment analysis: given a short text, is it positive or negative?
- named entity recognition: given a token, is it an ordinary word or does it refer to a specific real entity?
- question answering: given a question and a text snippet, what segments of the text respond to the question?
- mask filling: given a sentence with empty slots, what tokens suit best the empty slots?
- translation
- summarisation
- text generation

- Famous NLU benchmarks and data sets:
- [GLUE](https://gluebenchmark.com/tasks)
- [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)
- [SNLI](https://nlp.stanford.edu/projects/snli/)
- [COPA](https://people.ict.usc.edu/~gordon/copa.html)


--------------


&nbsp;
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ These notes should be used as a guide for acquiring the most important notions a

#### 1. [Large language models (LLMs), Artificial Intelligence (AI) and Natural language processing (NLP), history of NLP](https://tsamardzic.github.io/nlp_intro/1.html)

#### 2. NLP tasks, data sets, benchmarks
#### 2. [NLP tasks, data sets, benchmarks]((https://tsamardzic.github.io/nlp_intro/2.html))

#### 3. Evaluation, data splits

Expand Down
Binary file added figures/PoS.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/arabic_morph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/brat.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/french_morph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/malay_morph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/morph_derivation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/syntax_tree.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/syntax_tree2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 170938e

Please sign in to comment.