-
Notifications
You must be signed in to change notification settings - Fork 4
Monitor processes and parallel workloads for hangs
License
grondo/io-watchdog
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
io-watchdog - The IO Watchdog. The IO Watchdog is a facility for monitoring user applications, most notably parallel jobs, for "hangs" which typically have a side-effect of ceasing all write activity (IO) in a cyclic application (i.e. an application that writes something to a log or data file during each cycle of computation). The io-watchdog attempts to monitor all write activity coming from an application and triggers a set of user-defined actions when IO has ceased for a configurable timeout period. The IO watchdog consists of a LD_PRELOAD library (the interposer) which intercepts calls to various output-related calls in libc, along with a watchdog server which wakes up periodically and ensures that the application has written something during the last timeout period. If not, the watchdog server issues a warning on the application's stderr, and invokes all user defined actions, which could include running a debugger on the application, sending email to the user, etc. Set up of the LD_PRELOAD library is facilitated with either the io-watchdog(1) utility, or a SPANK plug-in for SLURM which adds a new --io-watchdog command line option to srun(1). To enable the io-watchdog SLURM plugin, the following line must exist in /etc/slurm/plugstack.conf: required io-watchdog.so The io-watchdog supports the following tunable parameters: timeout The watchdog timeout. Default = 1 hour. rank The MPI rank for which the watchdog runs if a SLURM job. actions A list of actions to run on watchdog trigger. target A pattern match for target of io-watchdog if running multiple applications in a pipeline or single job. These may be set on the command line, or in an io-watchdog configuration file. Configuration files that are read automatically if they exist are /etc/io-watchdog/io-watchdog.conf System defaults ~/.io-watchdogrc User defaults. A config file may also be specified on the command line to override the default location of the user configuration. See "io-watchdog --help" and "srun --io-watchdog=help" for more information.
About
Monitor processes and parallel workloads for hangs
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published