Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama-33B/llama-65B均报OOM,8*V100跑不起来怎么回事呢? #28

Open
alisyzhu opened this issue Jun 29, 2023 · 7 comments
Open

Comments

@alisyzhu
Copy link

环境:8 * V100 (32G)
执行run.sh
【错误log】
image

【LOMO模式】
args_lomo.yaml配置:
image
ds_config.json配置:
image

【LOMO+LORA模式】
args_lomo_lora.yaml配置:
image
ds_config_lora.json
image

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Jun 29, 2023

hi, 麻烦提供一下run.sh和更完整的错误log~

@alisyzhu
Copy link
Author

hi, 麻烦提供一下run.sh和更完整的错误log~

run.sh脚本:
image
【错误log】
image
image
image
image
image
image

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Jun 29, 2023

run.sh脚本: image

现在只用了一张GPU,应该设置--include localhost:0,1,2,3,4,5,6,7来使用所有的GPU

@alisyzhu
Copy link
Author

run.sh脚本: image

现在只用了一张GPU,应该设置--include localhost:0,1,2,3,4,5,6,7来使用所有的GPU

大意了,只看error部分的信息了;
请问,如果我想用多机多卡,这个localhost这里该怎么配置呢?

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Jun 29, 2023

可以参考https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node

@alisyzhu
Copy link
Author

@00drdelius
Copy link

可以参考https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node

3张3090训练13B报OOM👇
f6e5cf36d76a53c8406474379c19ad6

21824581f1586f4099ee3cce12ca852

参数配置如下:
args_lomo.yaml:
5d2cf71a8467d2d7e6077dff8f7089a

ds_config.json:
af3b0630917ff13060762871a1a7a48

run.sh:
2105cb61cd9688667660d8376e61a0f

跑得是baichuan-13b。
对源码的修改我就添加了loss在0.46以下时保存在一个特殊的output directory:
e7e48cfac59a152ed071b7aa50c7d9b

这咋弄呀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants