-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
853e6fb
commit 170938e
Showing
10 changed files
with
99 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
## 2. NLP tasks, data sets, benchmarks | ||
|
||
> Explanations and visualisations | ||
> - Crash Course Linguistics [#2](https://youtu.be/93sK4jTGrss?si=iBXbRHv_6npQduCH), [#3](https://youtu.be/B1r1grQiLdk?si=parMqegmCgLtmCWH), [#4](https://youtu.be/n1zpnN-6pZQ?si=IbWeV913ioUzcwG5) | ||
> - Universal Dependencies, [CoNLL-U format](https://universaldependencies.org/format.html) | ||
> - Jurafsky-Martin [2.4](https://web.stanford.edu/~jurafsky/slp3/2.pdf) | ||
| ||
|
||
### Text parsing | ||
|
||
Because language is compositional, text parsing is performed at several levels. | ||
|
||
#### Tokenisation | ||
|
||
Here we decide what the units of processing are. In the CoNLL-like formats, tokenisation is deciding what goes in each row. Traditionally, each word is considered to be one token. But what is a word? What about punctuation? | ||
|
||
<img src="figures/tokenisation.png" alt="splits" width="200"/> | ||
|
||
| ||
|
||
#### Lemmatisation | ||
|
||
Mapping different word forms into a single canonical form, e.g. journaux -> journal. It can be very difficult for some language due to: | ||
|
||
- non-concatenative morphology | ||
- not clear difference between derivation and morphology | ||
- no clear word boundaries (e.g. Chinese) | ||
|
||
Morphology | ||
|
||
<img src="figures/french_morph.png" alt="splits" width="100"/><img src="figures/arabic_morph.png" alt="splits" width="100"/><img src="figures/malay_morph.png" alt="splits" width="100"/> | ||
|
||
|
||
Derivation | ||
|
||
<img src="figures/morph_derivation.png" alt="splits" width="300"/> | ||
|
||
|
||
| ||
|
||
#### Part-of-speech (PoS) tagging or morphosyntactic definition (MSD) | ||
|
||
Classifying tokens into categories, e.g. VERB, NOUN. If a language has rich morphology (like Latin), we need additional features called morphosyntactic definitions, e.g. NOUN in the ACCUSATIVE case SINGULAR, MASCULINE gender | ||
|
||
|
||
<img src="figures/PoS.png" alt="splits" width="300"/> | ||
|
||
|
||
| ||
|
||
#### Syntactic parsing | ||
|
||
How tokens combine into phrases and sentences: | ||
|
||
No labels | ||
|
||
<img src="figures/syntax_tree.png" alt="splits" width="300"/> | ||
|
||
|
||
| ||
|
||
Constituent analysis | ||
|
||
<img src="figures/syntax_tree2.png" alt="splits" width="400"/> | ||
|
||
| ||
|
||
Dependency analysis | ||
|
||
|
||
<img src="figures/brat.png" alt="splits" width="400"/> | ||
|
||
|
||
| ||
|
||
### End-user tasks | ||
|
||
- Examples in the [HuggingFace tutorial](https://huggingface.co/course/chapter1/3?fw=pt): | ||
- sentiment analysis: given a short text, is it positive or negative? | ||
- named entity recognition: given a token, is it an ordinary word or does it refer to a specific real entity? | ||
- question answering: given a question and a text snippet, what segments of the text respond to the question? | ||
- mask filling: given a sentence with empty slots, what tokens suit best the empty slots? | ||
- translation | ||
- summarisation | ||
- text generation | ||
|
||
- Famous NLU benchmarks and data sets: | ||
- [GLUE](https://gluebenchmark.com/tasks) | ||
- [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) | ||
- [SNLI](https://nlp.stanford.edu/projects/snli/) | ||
- [COPA](https://people.ict.usc.edu/~gordon/copa.html) | ||
|
||
|
||
-------------- | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.