Skip to content

ExamonHPC/ExaMLOps

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ExaMLOps

Overview

MLOps (Machine Learning Operations) is a set of practices that aims to streamline and automate the processes involved in developing, deploying, and maintaining machine learning models in production. MLOps integrates machine learning workflows with DevOps principles to create a seamless end-to-end pipeline. The goal is to enhance collaboration between data scientists and operations teams, ensure reproducibility, and enable scalable model management.

To implement MLOps for this project, we are using Kubeflow.

Kubeflow

Kubeflow is an open-source platform designed to facilitate the deployment, orchestration, and management of machine learning workflows on Kubernetes. Originally developed by Google, Kubeflow provides a unified toolkit for managing the entire machine learning lifecycle, from data preprocessing and training to deployment and monitoring. Its Kubernetes-based infrastructure allows for easy scaling and efficient resource management, making it ideal for production-grade machine learning operations.

With Kubeflow, we can:

  • Automate complex ML workflows and pipelines
  • Scale training and deployment across multiple nodes
  • Streamline collaboration across teams with reproducible and reusable components

By leveraging Kubeflow, we enhance the robustness and scalability of our MLOps implementation, ensuring that models are efficiently deployed and managed in a production environment.

For more information on Kubeflow, visit the official Kubeflow website.

ExaMLOps

ExaMLOps is an enhanced MLOps solution that integrates MLOps with the Examon monitoring system. Examon is specifically designed for high-performance computing (HPC) environments, capturing and monitoring signals from HPC systems in real-time. With ExaMLOps, we aim to develop machine learning models that predict and detect anomalies or specific behaviors in HPC system signals, such as temperature fluctuations, energy consumption spikes, and performance degradation.

By leveraging both Kubeflow and Examon, ExaMLOps enables us to:

  • Automate the end-to-end machine learning pipeline on HPC systems
  • Use real-time monitoring data to train and fine-tune models continuously
  • Detect and predict potential issues in HPC operations, helping improve system reliability and performance

This integration of MLOps and Examon provides a powerful solution for predictive monitoring and proactive maintenance in HPC environments, ensuring optimized performance and efficient resource utilization.

Example Projects:

Citation

If you find the tools or code in this repository useful for your research or work, please consider citing our paper:

@article{graafe,
  title={GRAAFE: GRaph anomaly anticipation framework for exascale HPC systems},
  author={Molan, Martin and Ardebili, Mohsen Seyedkazemi and Khan, Junaid Ahmed and Beneventi, Francesco and Cesarini, Daniele and Borghesi, Andrea and Bartolini, Andrea},
  journal={Future Generation Computer Systems},
  year={2024},
  publisher={Elsevier}
}

Thank you for supporting our work! If you have any questions or suggestions, feel free to reach out through the repository's Issues section or contact us directly.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published