🐕 Batch: Resource monitoring with different input scenarios and systems #1162

Malikbadmus · 2024-06-17T10:39:10Z

Summary

We want to monitor and manage the performance and various resources (such as CPU, memory, peak memory, and possibly others) being used throughout a model Runtime.

Monitoring and tracking this resource can be a little complex, as every system has a different approach to getting these metrics, for example reading Docker memory usage statistics using cgroups are Linux-specific features, Windows and macOS do not use this filesystem layout for resource management. so we have to tailor our approach to take into account containers in other system (like WSL2 for Windows, or via Docker Desktop).

And also cgroup-v2 only exposes a subset of memory stats, fields like max_usage and failcnt were not implemented in this version and is therefore not supported by the Docker driver.

We are also interested in the total physical memory being consumed by the process, which will include us taking into account memories that might have been cached, and swapped to disk during runtime.

Related to #1090

Objective(s)

Resource monitoring for the Python binary running the model, and the docker container running the model.
Workaround platform-specific metrics to get a solution that works across different systems.
Use psutil which is platform independent instead of tracemalloc, as tracemalloc does not provide memory information at the OS level, and only tracks internal allocation (pymalloc), it also wraps each malloc call to track memory, which adds to the memory overhead and slows down run time.
Track the accumulated CPU process times across sessions, including time spent waiting for blocking I/O to complete.
Track peak memory usage to ensure that sufficient memory is available to handle the highest memory demands of the model.

Documentation

This blog post was referenced in the official psutil documentation, it provides a great read.

Also, Docker APIs for getting several resource metrics pertaining to a container can be found here.

The one pertaining to our issue can be found here DOCKER APIs

memory.peak added to cgroups-v2 by linux maintainers here

The text was updated successfully, but these errors were encountered:

DhanshreeA · 2024-06-17T12:16:29Z

Hey @Malikbadmus thank you for the issue! Could you describe the issue in a few lines? Feel free to copy what you mentioned on Slack here. :)

In the Objectives section, list out a few objectives - I can think of two - resource monitoring for the python binary running the model, and the docker container running the model.

Please also link relevant documentation (eg on docker stats, and psutil, or any other module you propose to use for this purpose).

DhanshreeA · 2024-07-03T07:14:14Z

This is done in #1161 and #1176

Malikbadmus assigned DhanshreeA Jun 17, 2024

DhanshreeA assigned Malikbadmus and unassigned DhanshreeA Jun 17, 2024

Malikbadmus mentioned this issue Jun 22, 2024

Rework code calculating memory usage - output and Docker container #1161

Merged

DhanshreeA mentioned this issue Jun 26, 2024

[🐅 Epic]: Splunk Integration into Ersilia #1090

Closed

17 tasks

DhanshreeA closed this as completed Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐕 Batch: Resource monitoring with different input scenarios and systems #1162

🐕 Batch: Resource monitoring with different input scenarios and systems #1162

Malikbadmus commented Jun 17, 2024 •

edited

Loading

DhanshreeA commented Jun 17, 2024

DhanshreeA commented Jul 3, 2024

🐕 Batch: Resource monitoring with different input scenarios and systems #1162

🐕 Batch: Resource monitoring with different input scenarios and systems #1162

Comments

Malikbadmus commented Jun 17, 2024 • edited Loading

Summary

Objective(s)

Documentation

DhanshreeA commented Jun 17, 2024

DhanshreeA commented Jul 3, 2024

Malikbadmus commented Jun 17, 2024 •

edited

Loading