Transfomer models implementation from scratch using pytorch to make it more accessible for research purposes.
The best way to understand is learning by doing.
Each example is in one single notebook for readability and understanding
**Each example implemented from scratch using Pytorch 2.0 **
Task | dataset link | Pyotch 2.0 | description |
---|---|---|---|
text classification |
clinc_oos | ✅ | encoder model for text classification |
masked language modeling |
clinc_oos | ✅ | encoder model pretraining with mlm style |
electra language modeling |
clinc_oos | ✅ | encoder model pretraining with electra style |
casual language-modeling |
mark-twain-books | ✅ | decoder model pretraining with gpt style and kv-cache for fast inference |
knowledge distilation |
clinc_oos | ✅ | initilization of a model from pretrained model |
seq2seq modeling |
Flicker-30k | ✅ | seq2seq model training for image caption with kv-cache |
adapters |
clinc_oos | ✅ | Lora and Dora for parameter efficient tunning |
vit |
Scene-classification | ✅ | visual image transformer for image classification |
detr |
Global-Wheat-Detection | ✅ | implementation of detr DEtection TRansformer encoder decoder model for object detection |
clip |
Flicker-30k | ✅ | implementation of contrastive language-image pre-training |
vision language multimodel-I |
COCO | ✅ | A minimmal vision-language model implementation with image-text fusion to generate image caption with RoPE and kv-cache |
vision language multimodel-II |
COCO | ✅ | A Multimodel implementation with image-text fusion to generate caption of image with RoPE and kv-cache which can we extended to visual question answering, open vocabulary object detection, optical character recognition |
Paligemma |
Flicker-30k | ✅ | Scratch implementation of Paligemma a Multimodel from Google-AI |
**More to come |
from VyomAI import EncoderModel, EncoderForMaskedLM
from VyomAI import EncoderConfig
config = EncoderConfig()
encoder = EncoderModel(config,pos_embedding_type='rope')
#pos_embedding_type supported: Absolute, sinusoidal, RoPE
#attention_type supported: gqa, Vanila
At a granular level, it support the following components:
Component | Description |
---|---|
Encoder | Text encoder model with Bert like architecture that support absolute, sin,Rope embedding and GQA , Vanila attention |
Decoder | Text decoder model with GPT like architecture that support absolute, sin,RoPE embedding and GQA , Vanila attention and KV-Cache for fast inference |
Seq2Seq | Model with Bart like architecture that support absolute, sin,RoPE embedding and GQA, Vanila attention and KV-Cache for fast inference encoder can be text or image type |
VisionEncoder | Model with Vit like architecture for image encoding **more to come |
Multimodel | A Minimal vision-language model **more to come |
We appreciate all contributions. If you want to contribute new features, utility functions, or tutorials please open an issue and discuss the feature with us.
Some helpfull learning resources