Skip to content

Latest commit

 

History

History
21 lines (12 loc) · 988 Bytes

readme.md

File metadata and controls

21 lines (12 loc) · 988 Bytes

An unofficial implementation of $\sigma$-Reparam

Overview

This repository contains an implementation of $\sigma$-Reparam, which is proposed in Stabilizing Transformer Training by Preventing Attention Entropy Collapse (Zhai et al. 2023) at ICML 2023.

Compared to spectral norm, $\sigma$-Reparam introduces a dimensionless learnable variable $\gamma$ to force the updates of spectral norm to be dimensionality independent.

$$ \hat{W} = \frac{\gamma}{\sigma(W)}W $$

Feedbacks and discussions are welcome on how we could make use of $\sigma$-Reparam to enhance our models.

Compatibility

The implementation is based on torch.nn.utils.parametrizations.spectral_norm in PyTorch v2.1.0. Incompability may arise in newer versions.

Reference

Please refer to the original repository for the official implementation.