Skip to content

A curated list of awesome tools, frameworks, platforms, and resources for building scalable and efficient AI infrastructure, including distributed training, model serving, MLOps, and deployment.

Notifications You must be signed in to change notification settings

awesomelistsio/awesome-ai-infrastructure

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Awesome AI Infrastructure Awesome Lists

Buy Me A Coffee   Ko-Fi   PayPal   Stripe

A curated list of awesome tools, frameworks, platforms, and resources for building scalable and efficient AI infrastructure, including distributed training, model serving, MLOps, and deployment.

Contents

Distributed Training

  • Horovod - A distributed deep learning training framework for TensorFlow, Keras, and PyTorch.
  • Ray - A framework for building scalable distributed applications, including distributed AI and reinforcement learning.
  • PyTorch Distributed - Tools and libraries for distributed training in PyTorch.
  • DeepSpeed - A deep learning optimization library that makes distributed training easy and efficient.
  • MPI for Machine Learning - Using the Message Passing Interface (MPI) standard for distributed machine learning.

Model Serving and Deployment

  • TensorFlow Serving - A flexible, high-performance serving system for machine learning models.
  • TorchServe - A model serving framework for PyTorch, providing fast and efficient model deployment.
  • NVIDIA Triton Inference Server - A scalable model serving platform supporting multiple frameworks.
  • ONNX Runtime - A cross-platform, high-performance scoring engine for serving ONNX models.
  • Seldon Core - An open-source platform for deploying and monitoring machine learning models on Kubernetes.
  • KFServing (KServe) - A Kubernetes-based model serving solution as part of the Kubeflow project.

MLOps and Automation

  • MLflow - An open-source platform for managing the end-to-end machine learning lifecycle.
  • Kubeflow - A platform for orchestrating machine learning workflows on Kubernetes.
  • DVC (Data Version Control) - A tool for version control and reproducibility in machine learning projects.
  • ZenML - An extensible MLOps framework for creating portable, production-ready machine learning pipelines.
  • Airflow - A platform for orchestrating complex workflows, commonly used in machine learning pipelines.
  • Metaflow - A human-centric framework for building and managing real-life data science projects, developed by Netflix.

Data Management

  • Delta Lake - An open-source storage layer that brings reliability to data lakes.
  • Apache Hudi - A data management framework that simplifies incremental data processing and streaming analytics.
  • Feast - An open-source feature store for managing and serving machine learning features.
  • Great Expectations - A tool for data validation and testing in machine learning workflows.
  • LakeFS - An open-source data versioning platform for managing data lakes.

Optimization Tools

  • NVIDIA TensorRT - A high-performance deep learning inference optimizer and runtime.
  • Apache TVM - A deep learning compiler stack for optimizing models on various hardware backends.
  • Intel OpenVINO - A toolkit for optimizing and deploying AI inference on Intel hardware.
  • OctoML - An AI model optimization platform for efficient deployment on edge and cloud.
  • Quantization Aware Training (QAT) - Tools for optimizing model performance through quantization.

Infrastructure as Code

  • Terraform - A tool for building, changing, and versioning infrastructure safely and efficiently.
  • Pulumi - Infrastructure as code for deploying and managing cloud infrastructure using programming languages.
  • Ansible - An open-source automation tool for provisioning and managing infrastructure.
  • AWS CloudFormation - A service for automating AWS resource deployment and management.
  • Google Deployment Manager - An infrastructure management tool for Google Cloud Platform.

Cloud Platforms

  • AWS SageMaker - A comprehensive platform for building, training, and deploying machine learning models on AWS.
  • Google AI Platform - Google Cloud’s integrated environment for AI development and deployment.
  • Azure Machine Learning - A cloud-based platform for training, deploying, and managing machine learning models.
  • IBM Watson Studio - A suite of tools for data science, machine learning, and AI model development.
  • Paperspace Gradient - A cloud platform for developing, training, and deploying machine learning models.

Learning Resources

Books

  • Machine Learning Engineering by Andriy Burkov - A book on building scalable machine learning infrastructure.
  • Building Machine Learning Powered Applications by Emmanuel Ameisen - A guide to building robust ML applications in production.
  • Designing Data-Intensive Applications by Martin Kleppmann - A comprehensive guide to building scalable and reliable data systems.
  • MLOps: Data Science in Production by Mark Treveil and The Dotscience Team - A book on best practices for MLOps and model deployment.
  • Reliable Machine Learning by Cathy Chen - A book on creating resilient machine learning infrastructure.

Community

Contribute

Contributions are welcome!

License

CC0

About

A curated list of awesome tools, frameworks, platforms, and resources for building scalable and efficient AI infrastructure, including distributed training, model serving, MLOps, and deployment.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages