-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Validation Data Collector #96
base: main
Are you sure you want to change the base?
Conversation
2523ea1
to
16e9a86
Compare
Stuff is gonna move around, don't worry about reviewing too in-depth for now |
Part of google#96
- Allows a user to start/stop processes at will, via OS signals SIGSTOP and SIGCONT. - Allows a user to bind processes to specific CPUs. - Allows local_worker_pool to be used outside of a context manager - Switch workers to be Protocol based, so Workers are effectively duck-typed (i.e. anything that has the required methods passes as a Worker) Part of google#96
- Allows a user to start/stop processes at will, via OS signals SIGSTOP and SIGCONT. - Allows a user to bind processes to specific CPUs. - Allows local_worker_pool to be used outside of a context manager - Switch workers to be Protocol based, so Workers are effectively duck-typed (i.e. anything that has the required methods passes as a Worker) Part of google#96
Will be used directly by google#96
Will be used directly by #96
- Allows a user to start/stop processes at will, via OS signals SIGSTOP and SIGCONT. - Allows a user to bind processes to specific CPUs. - Allows local_worker_pool to be used outside of a context manager - Switch workers to be Protocol based, so Workers are effectively duck-typed (i.e. anything that has the required methods passes as a Worker) Part of google#96
* Add pause/resume/context to workers - Allows a user to start/stop processes at will, via OS signals SIGSTOP and SIGCONT. - Allows a user to bind processes to specific CPUs. - Allows local_worker_pool to be used outside of a context manager - Switch workers to be Protocol based, so Workers are effectively duck-typed (i.e. anything that has the required methods passes as a Worker) Part of #96
- Allows concurrent evaluation of models on a separate dataset during training, with --validation_data_path - This is done with minimal impact on training time by only utilizing the CPU for the validation dataset when it is mostly idle doing tf.train(), and pinning processes to specific CPUs - The amount of impact can be adjusted via a gin.config on cpu_affinity.py - CPU affinities are only optimized for internal AMD-Zen based systems at the moment, but can be extended in the future.
16e9a86
to
bd8f87a
Compare
_NR_CPUS = psutil.cpu_count() | ||
|
||
_CPU_CONFIG = { # List of CPU numbers in cache-sharing order. | ||
# 'google-epyc' assumes logical core 0 and N/2 are the same physical core. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can be probably named something more neutral, like 'default'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'default' kind of implies it'd work fine for any system, whereas we can only guarantee it'll work fine on a google server running Epyc CPUs
@@ -71,6 +71,8 @@ def compile_fn( | |||
cancelled work. | |||
RuntimeError: if llvm-size produces unexpected output. | |||
""" | |||
if cancellation_manager is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird... hmm, should we just not pass cancellation_manager
because there's self._cancellation_manager
anyway?
self._running_policy = None | ||
self._default_futures: List[worker.WorkerFuture] = [] | ||
self._current_work: List[Tuple[corpus.ModuleSpec, worker.WorkerFuture]] = [] | ||
self._last_time = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the time stuff anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the time stuff is purely cosmetic, just to get the total wall time spent compiling validation modules
--validation_data_path
Missing tests at the moment, will add after the interface is more concrete