Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Validation Data Collector #96

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Northbadge
Copy link
Contributor

  • Allows concurrent evaluation of models on a separate dataset during training, with --validation_data_path
  • This is done with minimal impact on training time by only utilizing the CPU for the validation dataset when it is mostly idle doing tf.train(), and pinning processes to specific CPUs
  • The amount of impact can be adjusted via a gin.config on cpu_affinity.py
  • CPU affinities are only optimized for internal AMD-Zen based systems at the moment, but can be extended in the future.

Missing tests at the moment, will add after the interface is more concrete

@Northbadge
Copy link
Contributor Author

Stuff is gonna move around, don't worry about reviewing too in-depth for now

Northbadge added a commit to Northbadge/ml-compiler-opt that referenced this pull request Aug 11, 2022
@Northbadge Northbadge mentioned this pull request Aug 11, 2022
Northbadge added a commit that referenced this pull request Aug 11, 2022
Northbadge added a commit to Northbadge/ml-compiler-opt that referenced this pull request Aug 11, 2022
- Allows a user to start/stop processes at will, via OS signals SIGSTOP and SIGCONT.
- Allows a user to bind processes to specific CPUs.
- Allows local_worker_pool to be used outside of a context manager
- Switch workers to be Protocol based, so Workers are effectively duck-typed (i.e. anything that has the required methods passes as a Worker)

Part of google#96
Northbadge added a commit to Northbadge/ml-compiler-opt that referenced this pull request Aug 11, 2022
- Allows a user to start/stop processes at will, via OS signals SIGSTOP and SIGCONT.
- Allows a user to bind processes to specific CPUs.
- Allows local_worker_pool to be used outside of a context manager
- Switch workers to be Protocol based, so Workers are effectively duck-typed (i.e. anything that has the required methods passes as a Worker)

Part of google#96
Northbadge added a commit to Northbadge/ml-compiler-opt that referenced this pull request Aug 16, 2022
yundiqian pushed a commit that referenced this pull request Aug 16, 2022
Northbadge added a commit to Northbadge/ml-compiler-opt that referenced this pull request Aug 18, 2022
- Allows a user to start/stop processes at will, via OS signals SIGSTOP and SIGCONT.
- Allows a user to bind processes to specific CPUs.
- Allows local_worker_pool to be used outside of a context manager
- Switch workers to be Protocol based, so Workers are effectively duck-typed (i.e. anything that has the required methods passes as a Worker)

Part of google#96
mtrofin pushed a commit that referenced this pull request Aug 18, 2022
* Add pause/resume/context to workers

- Allows a user to start/stop processes at will, via OS signals SIGSTOP and SIGCONT.
- Allows a user to bind processes to specific CPUs.
- Allows local_worker_pool to be used outside of a context manager
- Switch workers to be Protocol based, so Workers are effectively duck-typed (i.e. anything that has the required methods passes as a Worker)

Part of #96
- Allows concurrent evaluation of models on a separate dataset during training, with --validation_data_path
- This is done with minimal impact on training time by only utilizing the CPU for the validation dataset when it is mostly idle doing tf.train(), and pinning processes to specific CPUs
- The amount of impact can be adjusted via a gin.config on cpu_affinity.py
- CPU affinities are only optimized for internal AMD-Zen based systems at the moment, but can be extended in the future.
_NR_CPUS = psutil.cpu_count()

_CPU_CONFIG = { # List of CPU numbers in cache-sharing order.
# 'google-epyc' assumes logical core 0 and N/2 are the same physical core.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be probably named something more neutral, like 'default'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'default' kind of implies it'd work fine for any system, whereas we can only guarantee it'll work fine on a google server running Epyc CPUs

@@ -71,6 +71,8 @@ def compile_fn(
cancelled work.
RuntimeError: if llvm-size produces unexpected output.
"""
if cancellation_manager is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird... hmm, should we just not pass cancellation_manager because there's self._cancellation_manager anyway?

self._running_policy = None
self._default_futures: List[worker.WorkerFuture] = []
self._current_work: List[Tuple[corpus.ModuleSpec, worker.WorkerFuture]] = []
self._last_time = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the time stuff anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the time stuff is purely cosmetic, just to get the total wall time spent compiling validation modules

@Northbadge
Copy link
Contributor Author

Northbadge commented Aug 19, 2022

Rebase after #113 #114 #115 , might need some slight modification on _schedule_jobs depending on cancellation_manager.

Also, add .values() to cpu_affinity_test's loop

mtrofin pushed a commit that referenced this pull request Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants