Merge branch 'master' of https://github.com/ydshieh/notebooks

ydshieh · Jan 13, 2022 · 44baa1a · 44baa1a
2 parents 9c9f175 + 2cf0ee9
commit 44baa1a
Showing 1 changed file with 163 additions and 0 deletions.
diff --git a/vision_encoder_decoder_blog.ipynb b/vision_encoder_decoder_blog.ipynb
@@ -0,0 +1,163 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "Untitled3.ipynb",
+      "provenance": [],
+      "collapsed_sections": [],
+      "authorship_tag": "ABX9TyPp/OrdlZCWajXKkckPzuMm",
+      "include_colab_link": true
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/ydshieh/notebooks/blob/master/vision_encoder_decoder_blog.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# **Vision Encoder Decoder**"
+      ],
+      "metadata": {
+        "id": "0-M4B1Tvd8mK"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The encoder-decoder architecture is a general architecture for learning sequence-to-sequence problems. It is used extensively in NLP, originally for machine learning tasks (NMT). It is then adopted for other tasks, for example, document summarization, question answering, etc.\n",
+        "\n",
+        "With the success of the Transformer architecture and Transfer learning paradigm, the de-facto standard method nowadays for NLP tasks is to fine-tune a pretrained Transformer model on a downstream mask. This usually produces descent results within a few hours of training. Well known examples are [BERT](https://arxiv.org/abs/1810.04805) and [GPT](https://openai.com/blog/better-language-models/) models. When it comes to sequence-to-sequence problems, there are 2 ways to combine the transformer-based encoder-decoder architecture with Transfer learning paradigm:\n",
+        "\n",
+        "  - Initialize an encoder-decoder model, pre-train it with different sequence-to-sequence objectives, then fine-tune it on downstream tasks. [BART](https://arxiv.org/abs/1910.13461) and [T5](https://arxiv.org/abs/1910.10683) models are 2 examples of this approach.\n",
+        "  - Take pretrained encoder and decoder models - which are pretrained with their own pretraining objectives, usaully being MLM (masked language modeling) and CLM (causal language modeling) respectively. Then combine them into an encoder-decoder model and fine-tune it. See [Rothe et al. (2019)](https://arxiv.org/abs/1907.12461).\n",
+        "\n",
+        "Since [BERT](https://arxiv.org/abs/1810.04805) and [GPT](https://openai.com/blog/better-language-models/) were introduced, there are a series of transformer-based auto-encoding and auto-regressive models being developed, usually with differences in pretraining methods and attention mechanisms (to deal with long documents). Furthermore, several variations have been used to pretrain on datasets in other languages ([CamemBERT](https://camembert-model.fr/), [XLM-RoBERTa](https://arxiv.org/abs/1911.02116), etc.), or to produce smaller models (for example, [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)).\n",
+        "\n",
+        "The approach in [Rothe et al. (2019)](https://arxiv.org/abs/1907.12461) allows us to combine different encoders and decoders from this ever-growing set of pretrained models. It is particular useful for machine translation - we can take an encoder in one language and a decoder in another language. This avoids to train each combination of language paris from scratch: sometimes we have little translation data for a low-resource language, while still having adequate mono-lingual data in that language.\n",
+        "\n",
+        "While the transformer-based encoder-decoder architecture dominates NLP conditional sequence generation tasks, it was not used for image-to-text generation tasks, like text recognition and image captioning. The pure transformer-based vision encoder introduced in [Vision Tranformer](https://arxiv.org/abs/2010.11929) in 2020 opens the door to use the same encoder-decoder architecture for image-to-text tasks, among which [TrOCR](https://arxiv.org/abs/2109.10282) is one example, which leverages pre-trained image Transformer encoder and text Transformer decoder models, similar to [Rothe et al. (2019)](https://arxiv.org/abs/1907.12461) for text-to-text tasks.\n",
+        "\n",
+        "In this post, we will give a short introduction to the encoder-decoder architecture along its history. We then expalin how the [Vision Transformer](https://arxiv.org/abs/2010.11929) works and its difference from the original Transformer. We provide a visualization of the vision-encoder-decoder architecture to better understand it. Finally, we show how to train an image-captioning model by using 🤗 [VisionEncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/visionencoderdecoder) implementation with an example training script, and provide a few tips of using it.\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "-mGwhY83r_oN"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## **A short history of Encoder-Decoder architecture**\n",
+        "\n",
+        "The encoder-decoder architecture was proposed in 2014, when several papers ([Cho et al.](https://arxiv.org/pdf/1406.1078.pdf), [Sutskever et al.](https://arxiv.org/abs/1409.3215), [Bahdanau et al.](https://arxiv.org/abs/1409.0473), etc.) used it to tackle the machine translation tasks (NMT, neurla machine translation). At this time, the encoder-decoder architecutre was mainly based on recurrent neural networks (RNN or LSTM), and its combination with different variations of attention mechanisms dominate the domain of NMT for almost about 3 years.\n",
+        "\n",
+        "| <img src=\"https://raw.githubusercontent.com/ydshieh/notebooks/master/images/rnn_encoder_decoder.JPG\" alt=\"drawing\" width=\"550\"/> | \n",
+        "|:--:| \n",
+        "| <a id='figure-1'></a>Figure 1: RNN-based encoder-decoder architecture<br><br>Left: without attention mechanism &nbsp; \\| &nbsp; Right: with attention mechism <br><br> (figures from the original papers)|\n",
+        "\n",
+        "In 2017, Vaswani et al. published a paper [Attention is all you need](https://arxiv.org/abs/1706.03762) which introduced a new model architecture called `Transformer`. It still consists of an encoder and a decoder, however instead of using RNN/LSTM for the components, they use multi-head self-attention as the building blocks. This innovate attention mechanism becomes the fundamental of the breakthroughs in NLP since then, beyond the NMT tasks.\n",
+        "\n",
+        "| <img src=\"https://raw.githubusercontent.com/ydshieh/notebooks/master/images/transformer.JPG\" alt=\"drawing\" width=\"250\"/> | \n",
+        "|:--:| \n",
+        "| Figure 2: Transformer encoder-decoder architecture<br><br>(figures from the original papers)|\n",
+        "\n",
+        "Combined with the idea of pretraining and transfer learning (for example, from [ULMFiT](https://arxiv.org/abs/1801.06146)), a golden age of NLP started in 2018-2019 with the release of OpenAI's [GPT](https://openai.com/blog/language-unsupervised/) and [GPT-2](https://openai.com/blog/better-language-models/) models and Google's [BERT](https://arxiv.org/abs/1810.04805) model. It's now common to call them Transformer models, however they are not encoder-decoder architecture as the original Transformer: BERT is encoder-only (originally for text classification) and GPT models are decoder-only (for text auto-completion).\n",
+        "\n",
+        "The above models and their variations focus on pretraining either the encoder or the decoder only. The [BART](https://arxiv.org/abs/1910.13461) model is one example of a standalone encoder-decoder Transformer model adopting sequence-to-sequence pretraining method, which can be used for document summarization, question answering and machine translation tasks directly.[<sup>1</sup>](#fn1) The [T5](https://arxiv.org/abs/1910.10683) model converts all text-based NLP problems into a text-to-text format, and use the Transformer encoder-decoder to tackle all of them. During pretraining, these models are trained from scratch: their encoder and decoder models are initialized with random weights.\n",
+        "\n",
+        "| <img src=\"https://raw.githubusercontent.com/ydshieh/notebooks/master/images/bert-gpt-bart.JPG\" alt=\"drawing\" width=\"400\"/> | \n",
+        "|:--:| \n",
+        "| Figure 3: The 3 pretraining paradigms for Transformer models<br><br>(figures from the original papers)|\n",
+        "\n",
+        "In 2020, the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) studied the effectiveness of initializing sequence-to-sequence models with pretrained encoder/decoder checkpoints for sequence generation tasks. It obtained new state-of-the-art results on machine translation, text summarization, etc.\n",
+        "\n",
+        "Following this idea, 🤗 [transformers](https://huggingface.co/docs/transformers/index) implements [EncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/encoderdecoder) that allows users to easily combine almost any 🤗 pretrained encoder (Bert, Robert, etc.) with a 🤗 pretrained decoder (GPT models, decoder from Bart or T5, etc.) to perform fine-tuning on downstream tasks. Instantiate a [EncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/encoderdecoder) is super easy, and finetune it on a sequence-to-sequence task usually obtains descent results in just a few hours on Google Cloud TPU.\n",
+        "\n",
+        "Here is an example of creating an encoder-decoder model with BERT as encoder and GPT2 and decoder - just in 1 line!\n",
+        "\n",
+        "Let's check [Figure 1](#figure-1)."
+      ],
+      "metadata": {
+        "id": "EJW36LOCsMWh"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vY5-nik2rlVc"
+      },
+      "outputs": [],
+      "source": [
+        "from transformers import EncoderDecoderModel\n",
+        "\n",
+        "# Initialize a bert-to-gpt2 model from pretrained BERT/GPT2 models.\n",
+        "# Note that the cross-attention layers will be randomly initialized.\n",
+        "model = EncoderDecoderModel.from_encoder_decoder_pretrained(\"bert-base-uncased\", \"gpt2\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "<span id=\"fn1\"> <sup>1</sup> It can be used for text classification and generation too, by using only its encoder and decoder respectively.</span>"
+      ],
+      "metadata": {
+        "id": "bO_gzO6_YNMU"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## **Transformers in Computer Vision**"
+      ],
+      "metadata": {
+        "id": "IfM5pLIYr6DL"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## **Vision Encoder Decoder**"
+      ],
+      "metadata": {
+        "id": "WQXngwnuX75_"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## **Train an image-captioning**"
+      ],
+      "metadata": {
+        "id": "Vhg2qVDUUD74"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        ""
+      ],
+      "metadata": {
+        "id": "4NOYPCblX22X"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}