This Repo implements a simple Vision Transformer(ViT) for a dummy classification task of predicting whether a person is wearing a Hat or not.
- Image patcher and depatcher
- Positional Encoding and Transformer Encoder (My implementation is from "Attention is All You Need")
- Model Architecture
- Attention Visualisation
This Repo contains 2 files: transformer_utils.py and ViT Experiments.py
-
transformer_utils.py : Contains key components of the transformer encoder. In this file, you would find:
- A Single-head self attention layer implementation
- A Multi-head self attention layer
- A Positional encoder
- Transformer Encoder Which you can download and import to quickly build your own architecture
-
ViT Experiments.py: The notebook where I trained my transformer model to classify Hat or No hat images. All preprocessing and visualisations are done here.
-
OBSERVATIONS.txt: These are a summary of observations corrections I made which i think would help with better undersatnding
This is a small dataset of 471 images. Link to the data is: here
The emphasis if this Repo was more on model Architecture rather than performance. That being said, the result were good
- EPOCHS: 100
- TRAIN: 0.93
- TEST: 0.91