From 0721857c265de4b2bf9506ff6ffa910ba5d08070 Mon Sep 17 00:00:00 2001 From: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Date: Sun, 6 Feb 2022 21:14:54 +0100 Subject: [PATCH] center tables --- backup.md | 98 +++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 66 insertions(+), 32 deletions(-) diff --git a/backup.md b/backup.md index a1af188..6d37b5b 100644 --- a/backup.md +++ b/backup.md @@ -27,17 +27,37 @@ The encoder-decoder architecture was proposed in 2014, when several papers ([Cho -| drawing | -|:--:| -| Figure 1: RNN-based encoder-decoder architecture [[1]](https://arxiv.org/abs/1409.3215) [[2]](https://arxiv.org/abs/1409.0473)

Left: without attention mechanism   \|   Right: with attention mechism| +
+ + + + + + + + + + +
drawing
Figure 1: RNN-based encoder-decoder architecture [1] [2]

Left: without attention mechanism   |   Right: with attention mechism
+
In 2017, Vaswani et al. published a paper [Attention is all you need](https://arxiv.org/abs/1706.03762) which introduced a new model architecture called `Transformer`. It still consists of an encoder and a decoder, however instead of using RNN/LSTM for the components, they use multi-head self-attention as the building blocks. This innovate attention mechanism becomes the fundamental of the breakthroughs in NLP since then, beyond the NMT tasks. -| drawing | -|:--:| -| Figure 2: Transformer encoder-decoder architecture [[3]](https://arxiv.org/abs/1706.03762)| +
+ + + + + + + + + + +
drawing
Figure 2: Transformer encoder-decoder architecture [3]
+
Combined with the idea of pretraining and transfer learning (for example, from [ULMFiT](https://arxiv.org/abs/1801.06146)), a golden age of NLP started in 2018-2019 with the release of OpenAI's [GPT](https://openai.com/blog/language-unsupervised/) and [GPT-2](https://openai.com/blog/better-language-models/) models and Google's [BERT](https://arxiv.org/abs/1810.04805) model. It's now common to call them Transformer models, however they are not encoder-decoder architecture as the original Transformer: BERT is encoder-only (originally for text classification) and GPT models are decoder-only (for text auto-completion). @@ -45,10 +65,20 @@ The above models and their variations focus on pretraining either the encoder or -| drawing | -|:--:| -| Figure 3: The 3 pretraining paradigms for Transformer models [[4]](https://arxiv.org/abs/1810.04805) [[5]](https://openai.com/blog/language-unsupervised/) [[6]](https://arxiv.org/abs/1910.13461)| - +
+ + + + + + + + + + +
drawing
Figure 3: The 3 pretraining paradigms for Transformer models [4] [5] [6]
+
+ In 2020, the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) studied the effectiveness of initializing sequence-to-sequence models with pretrained encoder/decoder checkpoints for sequence generation tasks. It obtained new state-of-the-art results on machine translation, text summarization, etc. Following this idea, 🤗 [transformers](https://huggingface.co/docs/transformers/index) implements [EncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/encoderdecoder) that allows users to easily combine almost any 🤗 pretrained encoder (Bert, Robert, etc.) with a 🤗 pretrained decoder (GPT models, decoder from Bart or T5, etc.) to perform fine-tuning on downstream tasks. Instantiate a [EncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/encoderdecoder) is super easy, and finetune it on a sequence-to-sequence task usually obtains descent results in just a few hours on Google Cloud TPU. @@ -151,9 +181,19 @@ The obtained sequence of vectors plays the same role as token embeddings in [BER -| drawing | -|:--:| -| Figure 4: BERT v.s. ViT | +
+ + + + + + + + + + +
drawing
Figure 4: BERT v.s. ViT
+
2 This is just the concept. The actual implementation uses convolution layers to perform this computation efficiently. @@ -369,9 +409,19 @@ We have learned the encoder-decoder architecture in NLP and the vision Transform -| drawing | -|:--:| -| Figure 5: Vision-Encoder-Decoder architecture | +
+ + + + + + + + + + +
drawing
Figure 5: Vision-Encoder-Decoder architecture
+
### **Vision-Encoder-Decoder in 🤗 transformers** @@ -567,14 +617,6 @@ display(df[:3].style.set_table_styles([{'selector': 'td', 'props': props}, {'sel - @@ -659,14 +701,6 @@ display(df[3:].style.set_table_styles([{'selector': 'td', 'props': props}, {'sel -