BatchNorm fixes for JAX and PyTorch workloads #798

priyakasimbeg · 2024-10-17T21:55:21Z

Fixes to BatchNorm behavior in JAX and PyTorch; mainly decouple update batch norm statistics from using the running statistics.

Changes for PyTorch from @adefazio's #783

From pull/783:
There are some subtle issues with how BatchNorm is handled in the PyTorch version of the code. Currently, workload.model_fn has an update_batch_norm parameter, which in theory should allow the submission to control whether the batch-norm statistics are updated during a forward pass. The issues are the following:

The update_batch_norm_fn function stores the old momentum parameter for each batchnorm layer in a momentum_backup variable, so it can be restored later, before zeroing the parameter. However, if it is called with update_batch_norm=False twice in a row, it overwrites the momentum_backup with 0 on the second call, so momentum then remains zero for the remainder of training.
In PyTorch's bultin BatchNorm, 0 indicates that the momentum buffer shouldn't be updated. This is the opposite of how EMA momentum is usually done (i.e. in Adam), where 1 would indicate that it shouldn't be updated, and 0 means it's set to the latest value at every step. The custom BatchNorm modules used in the two librispeech workloads follows this second, more standard convention instead. However, the update_batch_norm_fn sets the momentum to zero for all three layer types, resulting in incorrect behavior for the librispeech workloads.
The update_batch_norm_fn sets the BN layers to eval mode. This doesn't make sense as it prevents the use-case where you use batch-computed statistics (train mode) without also updating the running statistics. The BN layers can bet set to eval mode separately by passing in ForwardPassMode.EVAL to the forward pass, so removing this .eval() call doesn't prevent the submission from using eval mode during a forward pass.
This PR changes switch the custom BN code to follow the BN convention so that momentum=0 doesn't update the running buffers. It also fixes the issues in the update_batch_norm_fn function mentioned above.

github-actions · 2024-10-17T21:55:33Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

adefazio and others added 2 commits September 5, 2024 15:20

BN Fixes

b24812f

add use_running_average_bn arg for jax

f574bf0

priyakasimbeg requested a review from a team as a code owner October 17, 2024 21:55

priyakasimbeg added 9 commits October 17, 2024 22:01

formatting

7ca8365

formatting

3913238

formatting

baac0a4

debugging

087fd5c

add seperate model_fn for deepspeech jax without use_running_average_bn

c5c36c2

fix syntax error

783aab4

fix

28e7e21

fix import order

b063f9f

formatting

894cd87

priyakasimbeg requested a review from sourabh2k15 October 23, 2024 19:07

priyakasimbeg merged commit 787f7fb into dev Oct 29, 2024
31 checks passed

github-actions bot locked and limited conversation to collaborators Oct 29, 2024

priyakasimbeg changed the title ~~BN fixes JAX~~ BatchNorm fixes for JAX and PyTorch workloads Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BatchNorm fixes for JAX and PyTorch workloads #798

BatchNorm fixes for JAX and PyTorch workloads #798

priyakasimbeg commented Oct 17, 2024 •

edited

Loading

github-actions bot commented Oct 17, 2024 •

edited

Loading

BatchNorm fixes for JAX and PyTorch workloads #798

BatchNorm fixes for JAX and PyTorch workloads #798

Conversation

priyakasimbeg commented Oct 17, 2024 • edited Loading

github-actions bot commented Oct 17, 2024 • edited Loading

priyakasimbeg commented Oct 17, 2024 •

edited

Loading

github-actions bot commented Oct 17, 2024 •

edited

Loading