Model Parallel - Sharding Model Parameters #3272
Replies: 1 comment
-
These GPU issues turned out to be unrelated to the model parameter sharding script (despite appearing to occur after script execution). Closing as a result. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi Flax Community,
I am a novice at sharding model parameters across devices, I attempted to do so based on the following guide. Unfortunately I am encountering and currently debugging GPU errors resulting from running my training script with model parameter sharding.
I attempted a clean install of nvidia software and a reboot but this results in the following output from
nvidia-smi
:Does anyone have a resource on best practices for sharding sets of model parameters? I want to avoid encountering this issue again in future, any advice or resources for model parameter sharding would be much appreciated.
Update:
I will include my learnings here once I have come to a resolution on this. At the moment this is lower priority for my project so it may be a while before I update.
Beta Was this translation helpful? Give feedback.
All reactions