Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐕 Batch: Resource monitoring with different input scenarios and systems #1162

Closed
5 tasks done
Malikbadmus opened this issue Jun 17, 2024 · 2 comments
Closed
5 tasks done
Assignees

Comments

@Malikbadmus
Copy link
Contributor

Malikbadmus commented Jun 17, 2024

Summary

We want to monitor and manage the performance and various resources (such as CPU, memory, peak memory, and possibly others) being used throughout a model Runtime.

Monitoring and tracking this resource can be a little complex, as every system has a different approach to getting these metrics, for example reading Docker memory usage statistics using cgroups are Linux-specific features, Windows and macOS do not use this filesystem layout for resource management. so we have to tailor our approach to take into account containers in other system (like WSL2 for Windows, or via Docker Desktop).

And also cgroup-v2 only exposes a subset of memory stats, fields like max_usage and failcnt were not implemented in this version and is therefore not supported by the Docker driver.

We are also interested in the total physical memory being consumed by the process, which will include us taking into account memories that might have been cached, and swapped to disk during runtime.

Related to #1090

Objective(s)

  • Resource monitoring for the Python binary running the model, and the docker container running the model.
  • Workaround platform-specific metrics to get a solution that works across different systems.
  • Use psutil which is platform independent instead of tracemalloc, as tracemalloc does not provide memory information at the OS level, and only tracks internal allocation (pymalloc), it also wraps each malloc call to track memory, which adds to the memory overhead and slows down run time.
  • Track the accumulated CPU process times across sessions, including time spent waiting for blocking I/O to complete.
  • Track peak memory usage to ensure that sufficient memory is available to handle the highest memory demands of the model.

Documentation

This blog post was referenced in the official psutil documentation, it provides a great read.

Also, Docker APIs for getting several resource metrics pertaining to a container can be found here.

The one pertaining to our issue can be found here DOCKER APIs

memory.peak added to cgroups-v2 by linux maintainers here

@DhanshreeA
Copy link
Member

Hey @Malikbadmus thank you for the issue! Could you describe the issue in a few lines? Feel free to copy what you mentioned on Slack here. :)

In the Objectives section, list out a few objectives - I can think of two - resource monitoring for the python binary running the model, and the docker container running the model.

Please also link relevant documentation (eg on docker stats, and psutil, or any other module you propose to use for this purpose).

@DhanshreeA
Copy link
Member

This is done in #1161 and #1176

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants