SUSTech EE340 Project 1: Mnist classification.
I implement a simple Vision Transformer (ViT) model with Pytorch.
This model was proposed by Dosovitskiy et al. in the paper "An image is worth 16x16 words: Transformers for image recognition at scale" (2020).
python main.py
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.