https://arxiv.org/abs/2307.15771
The Hydra Effect: Emergent Self-repair in Language Model Computations (Thomas McGrath, Matthew Rahtz, Janos Kramar, Vladimir Mikulik, Shane Legg)
https://arxiv.org/abs/2307.15771
The Hydra Effect: Emergent Self-repair in Language Model Computations (Thomas McGrath, Matthew Rahtz, Janos Kramar, Vladimir Mikulik, Shane Legg)