Personal "cheatsheet" repository for my ideal machine learning tech-stack. I use this repository to play around and familiarize with ML libraries, advanced git and GitHub features, virtualization and so on 🤓.
- The classics
- Pytorch, Lightning and W&Bs
- xformers
- transformers
- DeepSpeed
- Colossal-AI
- spaCy
- nvidia-ml-py3
- albumentations
- augly
- einops
- bitsandbytes
- vLLM
- SkyPilot
- Protected Branches
- Tags and Releases
- LFS
- Hidden Directory
- CircleCI
- GitHub Actions
- GitHub Pages
- Others
- Python
- RainbowCSV
- Remote
- CoPilot
- GitLens
- Docker
- Jupiter
- Gitignore
- vscode-pdf
- GitToolBox
- CoPilot
- Docker
- NumPy - Math operations, manipulations, linear algebra and more.
- Pandas - Tabular data management.
- MatplotLib and Seaborn - All sorts of plots.
- OpenCV2, Pillow, and Sci-Kit Image - Image manipulation
PyTorch is currently the reference ML framework for Python.
Weights and Biases (W&B) allows to easily track experiments, performances, parameters and so on in a single place.
PyTorch Lightning gets rid of most of the usual PyTorch boilerplate code, like train/val/test loops, backward and optim steps and so on. It also allows to easily use powerful pytorch features and other libraries (like W&B) by inserting just few optional parameters here and there.
xformers allows you to define transformer architecture easily. It also features the latest and hottest techniques.
HuggingFace🤗 allows to easily download, fine-tune and deploy pre-trained transformer models across a multitude of applications. It is also possible to share models and datasets on the platform, as well as "spaces" which are interactive live demos of the capabilities of the created models.
Related libraries:
- Datasets provides efficient loading of custom or common dataset samples (even online).
- Diffusers is HuggingFace🤗 package for diffusion models specifically. It comes with pre-trained SOTA model for vision and audio generation.
- timm provides a multitude of pre-trained and vanilla image models.
- Safetensors is HuggingFace🤗 package which allows storing tensors in a safe way (unlike with pickle files).
- accelerate takes care of automatically finding the best available device for training (PyTorch).
- optimum provides multiple features to accelerate training and inference
- tokenizers provides features to simply carry-out popular tokenizations.
- evaluate allows to evaluate and compare trained models.
- peft (Parameter-Efficient Fine-Tuning) provides implementations of algorithms like LORA, which allow to speed up fine-tuning while saving memory consumption.
- xformers provides optimized implementation of all operations carried-out in transformers (e.g. Memory Efficient Attention).
DeepSpeed allows for distributed high-performance and efficient training. DeepSpeed is supported in PyTorch Lightning.
Colossal-AI is a framework that improves the efficiency and speed of large model training, especially for HPC clusters.
Spacy offers a multitude of features and pre-trained pipelines for NLP tasks (like huggingface, but just for NLP).
This library allows to access information about NVIDIA GPUs directly in python code.
All sorts of popular image augmentations, like ColorJitter, ZoomBlur, Gaussian Noise... are implemented by albumentations.
Data augmentation library for text, sound, image and video.
Manipulation of tensors (reshaping, concatenating, ...) with einops is extremely intuitive and time-saving.
bitsandbytes allow to run training using 8-bit precision. Particularly useful to fine-tune very large models.
vllm is a high-level library to efficiently run inference of LLMs.
skypilot allows to easily run inference of LLM and more on any cloud platform (Google, AWS, Azure, ...).
Hydra allows to set multiple configurations smoothly as well as defining custom CLI commands. Similar to jsonargpase and LightningCLI
SciencePlots allows to plot much nicer plots than classic matplotlib and seaborn.
python-dotenv allows to define and read environmental variables from a file.
yacs allows to manage configurations such as hyperparameters for experiments.
poetry allows for easy dependency management and packaging.
Conda allows to easily create and share virtual environments. The
command conda env export > environment.yml
creates a .yml file that can be used to create an identical virtual
environment.
Docker allows to emulate a whole operating system.
Hyper.js, Alacritty and Kitty among the most popular terminals in r/unixporn and are compatible with all OSs.
iTerm2 is a MacOS-only terminal emulator with lots of functionalities.
Oh My Zsh is available on Unix-like machines. It provides terminal plug-ins and themes.
tmate allows to connect via SSH to custom machine not "out in the internet". A sort of TeamViewer for ssh.
rich is a library to create amazing looking CLIs.
yabai together with skhd allows to have a nice window manager-like experience on MacOS.
~/.ssh/config
and~/.ssh/authorized_keys
files to define known host names and authorized ssh keys.nvidia-smi
➡️ Check NVIDIA Cards current statusps
,top
,htop
➡️ Check currently running processesbpytop
- Likehtop
, but better.nvitop
➡️ Likenvidia-smi
, but better.tmux
➡️ Terminal multiplexer, allows to easily detach jobs.- Fig ➡️ Intellisense (and much more) for command line commands.
- sshfs ➡️ allows to mount file systems over ssh.
- ranger ➡️ CLI browser with possible image preview on terminals like
kitty
and installingw3m
.
HPC clusters typically use a cluster management and job scheduling tool. Slurm allows to schedule jobs, handle priorities, design partitions and much more. Cheatsheet files for slurm are under the /slurm folder. The library submitit allows to switch seamlessly between executing on Slurm or locally.
Taking the time to go through most of GitHub's Documentation at least once is very important. Here's a few features to keep in mind.
Protected branches prevent code to be pushed onto custom branches.
Important commits can be tagged. Then, jumping to a tagged commit is easy as:
git checkout $tag-name
Git Large File System allows to push bigger files to the GitHub repository. Careful: There is a global usage quota per GitHub account that goes across repositories.
Hidden Directory
The .github
directory allows to keep the landing page of the GitHub repository "clean" and includes:
- CONTRIBUTING.md ➡️ Guidelines to contribute to the repository.
- ISSUE_TEMPLATE.md ➡️ Template for issues.
- PULL_REQUEST_TEMPLATE.md ➡️Template for pull requests.
- README.md ➡️Repository's README (i.e. this) file.
- workflows ➡️Directory which contains .yaml files for GitHub actions.
CircleCI hosts CI/CD pipelines and workflows, similarly to GitHub Actions.
GitHub Actions allows to execute custom actions automatically upon some triggers by some events (pull requests, pushes, issues opened, ...).
GitHub Pages allows to host a webpage for each GitHub repository.
GitBook allows to simply create a documentation starting from a GitHub repository.
Pre-commit allows to create customized pre-commit hooks to, e.g., run formatting or testing before committing. Some nice things to include there:
- Black formats Python files compliantly to PEP 8.
- autopep8 allows to automatically format files to be compliant with PEP 8.
- yapf is like autopep8, but with a search algorithm for the best possible formatting.
- isort automatically sorts order of import instructions in python files.
- flake8 uses other tools to check for python errors (pyflakes), correct use of PEP conventions and others.
- pylint, similarly to flake, analyzes the code and checks for errors without actually running it.
- ruff is yet another python linter that can replace isort, flake8 and autoflake. It also extremelly fast.
- mypy is a type checker that can also be used to convert regular python to statically typed code.
Shields.io allows to put neat banner in README files, such as the number of of the repository.
I find it extremelly satisfying to build an actual prototype or product out of a Machine Learning project. Here's my favourite options:
To quickly create interactive apps based on trained machine learning models, gradio and streamlit are among the most popular frameworks. While it is easy to prototype using these frameworks, more complex applications are better built with a more complete stack. Figma is currently the best tool I could find to design an app / website.
On the frontend, NextJS is one of the most popular frameworks. It builds on top of the React framework and provides additional functionalities and optimizations. Tailwindcss allows for easy styling without the need for css style sheets. Chakra-UI comes with pre-built and nice looking components. It also offers support for dark mode.
Since we are interested in Machine Learning applications, it makes sense to pick a python backend.
FastAPI is a python backend extremelly simple to set-up and highly optimized for speed. Django and Flask are more popular frameworks. Django is a full-stack meant for big projects with a clearly defined structure, whereas flask is lightweight and meant for smaller projects.
Auth0 allows for authentication and authorization. Stripe is a popular tool to deal with payments. Testing APIs is easily done with Postman.
MySQL, PostgreSQL, Redis and MongoDB and are all very valid and popular databases.
PostgreSQL is preferable over MySQL for its better support for JSON data. Redis is a key-value database, which is very fast and useful for caching. MongoDB is a document-oriented database, which is very flexible and easy to use.
Prisma is a nodejs database toolkit compatible with MySQL, PostgreSQL, SQLite and SQL server. It allows to easily create and manage databases.
Applications can be hosted on a number of services. Heroku, DigitalOcean, AWS, Google Cloud and Microsoft Azure are among the most popular solutions.
Here's a few things that are not really ML-related but that I use in my work environment and find that are worth mentioning.
Gnome-look.org offers a variety of themes for Linux machines. My personal favourite is the orchit gtk theme.
Window managers allow to customize the look and feel of the desktop environment while making development more efficient (the idea is that you should never take your hands off the keyboard). I use i3, which is one of the most popular window managers for Linux.
Iriun allows to use an iPhone or iPad as a webcam for a Linux or Windows machine, while UxPlay allows to do screen-mirroring of iPhone and iPad devices. Both are super useful for presentations, meetings, recording videos and so on.
Notion is possibly the best note-taking app out there. Full stop.
Clockify allows you to track the time spent on different projects. It is useful to stay aware of your productivity.
A few very helpful chrome extensions are: NordPass, Acrobat Reader, and Grammarly.