Introduce Validation Data Collector #96

Northbadge · 2022-08-10T22:25:57Z

Allows concurrent evaluation of models on a separate dataset during training, with --validation_data_path
This is done with minimal impact on training time by only utilizing the CPU for the validation dataset when it is mostly idle doing tf.train(), and pinning processes to specific CPUs
The amount of impact can be adjusted via a gin.config on cpu_affinity.py
CPU affinities are only optimized for internal AMD-Zen based systems at the moment, but can be extended in the future.

Missing tests at the moment, will add after the interface is more concrete

Northbadge · 2022-08-11T01:16:45Z

Stuff is gonna move around, don't worry about reviewing too in-depth for now

Part of #96

- Allows a user to start/stop processes at will, via OS signals SIGSTOP and SIGCONT. - Allows a user to bind processes to specific CPUs. - Allows local_worker_pool to be used outside of a context manager - Switch workers to be Protocol based, so Workers are effectively duck-typed (i.e. anything that has the required methods passes as a Worker) Part of google#96

Will be used directly by google#96

Will be used directly by #96

- Allows a user to start/stop processes at will, via OS signals SIGSTOP and SIGCONT. - Allows a user to bind processes to specific CPUs. - Allows local_worker_pool to be used outside of a context manager - Switch workers to be Protocol based, so Workers are effectively duck-typed (i.e. anything that has the required methods passes as a Worker) Part of google#96

* Add pause/resume/context to workers - Allows a user to start/stop processes at will, via OS signals SIGSTOP and SIGCONT. - Allows a user to bind processes to specific CPUs. - Allows local_worker_pool to be used outside of a context manager - Switch workers to be Protocol based, so Workers are effectively duck-typed (i.e. anything that has the required methods passes as a Worker) Part of #96

- Allows concurrent evaluation of models on a separate dataset during training, with --validation_data_path - This is done with minimal impact on training time by only utilizing the CPU for the validation dataset when it is mostly idle doing tf.train(), and pinning processes to specific CPUs - The amount of impact can be adjusted via a gin.config on cpu_affinity.py - CPU affinities are only optimized for internal AMD-Zen based systems at the moment, but can be extended in the future.

mtrofin · 2022-08-19T22:30:38Z

compiler_opt/distributed/local/cpu_affinity.py

+_NR_CPUS = psutil.cpu_count()
+
+_CPU_CONFIG = {  # List of CPU numbers in cache-sharing order.
+    # 'google-epyc' assumes logical core 0 and N/2 are the same physical core.


it can be probably named something more neutral, like 'default'?

'default' kind of implies it'd work fine for any system, whereas we can only guarantee it'll work fine on a google server running Epyc CPUs

mtrofin · 2022-08-19T22:35:25Z

compiler_opt/rl/inlining/inlining_runner.py

@@ -71,6 +71,8 @@ def compile_fn(
      cancelled work.
      RuntimeError: if llvm-size produces unexpected output.
    """
+    if cancellation_manager is None:


weird... hmm, should we just not pass cancellation_manager because there's self._cancellation_manager anyway?

Part of google#96

mtrofin · 2022-08-19T22:40:43Z

compiler_opt/rl/local_validation_data_collector.py

+    self._running_policy = None
+    self._default_futures: List[worker.WorkerFuture] = []
+    self._current_work: List[Tuple[corpus.ModuleSpec, worker.WorkerFuture]] = []
+    self._last_time = None


do we need the time stuff anymore?

the time stuff is purely cosmetic, just to get the total wall time spent compiling validation modules

Northbadge · 2022-08-19T23:03:31Z

Rebase after #113 #114 #115 , might need some slight modification on _schedule_jobs depending on cancellation_manager.

Also, add .values() to cpu_affinity_test's loop

Part of #96

Northbadge requested review from mtrofin and yundiqian August 10, 2022 22:25

Northbadge force-pushed the validation-runner branch from 2523ea1 to 16e9a86 Compare August 10, 2022 22:36

Northbadge mentioned this pull request Aug 11, 2022

Add raw_reward_only to runners #98

Merged

Northbadge added a commit to Northbadge/ml-compiler-opt that referenced this pull request Aug 11, 2022

Minor cosmetic changes

78f012d

Part of google#96

Northbadge mentioned this pull request Aug 11, 2022

Minor cosmetic changes #99

Merged

Northbadge added a commit that referenced this pull request Aug 11, 2022

Minor cosmetic changes (#99)

7e4b19f

Part of #96

Northbadge mentioned this pull request Aug 11, 2022

Add reward_only to runners #100

Closed

Northbadge mentioned this pull request Aug 11, 2022

Add pause/resume/context to workers #101

Merged

Northbadge added a commit to Northbadge/ml-compiler-opt that referenced this pull request Aug 16, 2022

Replace _compile_fn with compile_fn

72b8af4

Will be used directly by google#96

Northbadge mentioned this pull request Aug 16, 2022

Replace _compile_fn with compile_fn #107

Merged

yundiqian pushed a commit that referenced this pull request Aug 16, 2022

Replace _compile_fn with compile_fn (#107)

7b45c23

Will be used directly by #96

Northbadge force-pushed the validation-runner branch from 16e9a86 to bd8f87a Compare August 19, 2022 22:23

mtrofin reviewed Aug 19, 2022

View reviewed changes

Northbadge added a commit to Northbadge/ml-compiler-opt that referenced this pull request Aug 19, 2022

Default to using self.cancellation_manager

13c223b

Part of google#96

This was referenced Aug 19, 2022

Expose compilation timeout programatically #114

Merged

Modify Corpus to store module specs as a tuple #113

Merged

Default to using self.cancellation_manager #115

Merged

mtrofin reviewed Aug 19, 2022

View reviewed changes

mtrofin pushed a commit that referenced this pull request Sep 7, 2022

Default to using self.cancellation_manager (#115)

3a31367

Part of #96

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Validation Data Collector #96

Introduce Validation Data Collector #96

Northbadge commented Aug 10, 2022

Northbadge commented Aug 11, 2022

mtrofin Aug 19, 2022

Northbadge Aug 19, 2022

mtrofin Aug 19, 2022

mtrofin Aug 19, 2022

Northbadge Aug 19, 2022

Northbadge commented Aug 19, 2022 •

edited

Loading

Introduce Validation Data Collector #96

Are you sure you want to change the base?

Introduce Validation Data Collector #96

Conversation

Northbadge commented Aug 10, 2022

Northbadge commented Aug 11, 2022

mtrofin Aug 19, 2022

Choose a reason for hiding this comment

Northbadge Aug 19, 2022

Choose a reason for hiding this comment

mtrofin Aug 19, 2022

Choose a reason for hiding this comment

mtrofin Aug 19, 2022

Choose a reason for hiding this comment

Northbadge Aug 19, 2022

Choose a reason for hiding this comment

Northbadge commented Aug 19, 2022 • edited Loading

Northbadge commented Aug 19, 2022 •

edited

Loading