-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
训练过程中CUDA out of memory #236
Comments
如果使用ds_config配置则会直接超显存 |
您可以选择更高级别的优化,比如 zero-3,或者只能用更多的机器 |
我能通过降低dtype来运行吗,例如修改为torch.int8。或者还有什么其他可行的方法吗
|
顺带一提,我在使用llama3-8b-instruct版本在dolly上进行评估时,效果很差,进行sft之后也是,这是什么原因?
|
可以试试 zero3 |
可以看看生成的句子长啥样,每个 token 的输出概率是否正常(loss 看起来是正常的) |
我直接使用metaai公开的llama3-8b-instruct进行评估,生成的句子混乱,前三条answers如下: |
看起来像是生成的句子没有 follow task 要求的格式 |
我在4张A100上使用4卡模型并行训练,student是llama3-8b,teacher是llama3-70b,使用ds_config_zero2_offload运行成功时4张A100的GPU占用为47g/80g,在训练过程中会出现CUDA out of memory,请问如何解决这一问题
The text was updated successfully, but these errors were encountered: