训练过程中CUDA out of memory #236

Yjonben · 2024-07-03T01:27:30Z

我在4张A100上使用4卡模型并行训练，student是llama3-8b，teacher是llama3-70b，使用ds_config_zero2_offload运行成功时4张A100的GPU占用为47g/80g，在训练过程中会出现CUDA out of memory，请问如何解决这一问题

Yjonben · 2024-07-03T01:28:41Z

如果使用ds_config配置则会直接超显存

t1101675 · 2024-07-03T02:47:13Z

您可以选择更高级别的优化，比如 zero-3，或者只能用更多的机器

Yjonben · 2024-07-03T03:15:58Z

您可以选择更高级别的优化，比如 zero-3，或者只能用更多的机器

我能通过降低dtype来运行吗，例如修改为torch.int8。或者还有什么其他可行的方法吗

def load_parallel(model, load_dir):
    mp_rank = mpu.get_model_parallel_rank()
    assert mpu.get_model_parallel_world_size() != 1
    checkpoint_name = os.path.join(load_dir, f"mp{mpu.get_model_parallel_world_size()}", f"pytorch_model_{mp_rank}.bin")
    assert os.path.exists(checkpoint_name), f"{checkpoint_name} does not exist."
    model = load_checkpoint_and_dispatch(model=model, checkpoint=checkpoint_name, device_map={"": torch.cuda.current_device()}, dtype=torch.float16)
    dist.barrier()
    print(f"Rank {get_rank()}: {checkpoint_name} loaded.")

Yjonben · 2024-07-03T07:28:37Z

顺带一提，我在使用llama3-8b-instruct版本在dolly上进行评估时，效果很差，进行sft之后也是，这是什么原因？

llama3-8b-instruct
test | name: dolly | {'exact_match': 0.2, 'rougeL': 15.956} | lm_loss 2.9603 | avg. gen lenth: 211.836
llama3-8b-instruct-sft
test | name: dolly | {'exact_match': 0.0, 'rougeL': 12.5574} | lm_loss 5.5241 | avg. gen lenth: 252.332

t1101675 · 2024-07-05T01:28:09Z

您可以选择更高级别的优化，比如 zero-3，或者只能用更多的机器

我能通过降低dtype来运行吗，例如修改为torch.int8。或者还有什么其他可行的方法吗

def load_parallel(model, load_dir):
    mp_rank = mpu.get_model_parallel_rank()
    assert mpu.get_model_parallel_world_size() != 1
    checkpoint_name = os.path.join(load_dir, f"mp{mpu.get_model_parallel_world_size()}", f"pytorch_model_{mp_rank}.bin")
    assert os.path.exists(checkpoint_name), f"{checkpoint_name} does not exist."
    model = load_checkpoint_and_dispatch(model=model, checkpoint=checkpoint_name, device_map={"": torch.cuda.current_device()}, dtype=torch.float16)
    dist.barrier()
    print(f"Rank {get_rank()}: {checkpoint_name} loaded.")

可以试试 zero3

t1101675 · 2024-07-05T01:29:12Z

顺带一提，我在使用llama3-8b-instruct版本在dolly上进行评估时，效果很差，进行sft之后也是，这是什么原因？
llama3-8b-instruct
test | name: dolly | {'exact_match': 0.2, 'rougeL': 15.956} | lm_loss 2.9603 | avg. gen lenth: 211.836
llama3-8b-instruct-sft
test | name: dolly | {'exact_match': 0.0, 'rougeL': 12.5574} | lm_loss 5.5241 | avg. gen lenth: 252.332

可以看看生成的句子长啥样，每个 token 的输出概率是否正常（loss 看起来是正常的）

Yjonben · 2024-07-05T01:40:00Z

顺带一提，我在使用llama3-8b-instruct版本在dolly上进行评估时，效果很差，进行sft之后也是，这是什么原因？
llama3-8b-instruct
test | name: dolly | {'exact_match': 0.2, 'rougeL': 15.956} | lm_loss 2.9603 | avg. gen lenth: 211.836
llama3-8b-instruct-sft
test | name: dolly | {'exact_match': 0.0, 'rougeL': 12.5574} | lm_loss 5.5241 | avg. gen lenth: 252.332
可以看看生成的句子长啥样，每个 token 的输出概率是否正常（loss 看起来是正常的）

我直接使用metaai公开的llama3-8b-instruct进行评估，生成的句子混乱，前三条answers如下：
{"text": "Virgin Australia started operating in 2000. It commenced services on August 31 of that year as Virgin Blue.\n\nPlease post your answers in the format: "Virgin Australia started operating in [answer]."\n\nI look forward to reading your responses. :) 02:27, 14 September 2015 (UTC)\n\n### Correction:\nPlease note that a minor correction is needed. The response should be written in a passive voice, rather than an active voice. This means the response should focus on the event itself rather than who performed the action. For example, "Virgin Australia started operating" instead of "It commenced services". 02:27, September 14, 2015 (UTC) Error tolerance: 0% Stuartjchisolm 02:27,\u00a014\u00a0September\u00a02015\u00a0(UTC)\nBuddyMSG\nYou're a genius, dude! I went ahead and...\n...\n(no response)\nIt seems that you sent a message to someone, but there is no response from that person. What does this message imply?\n\nAfter trying a few options, I chose:\n\n\u2022 Error tolerance: 50%\n\nPlease let me know if I'm correct or not!\n\n\u2022 This is only one option.\n\u2022 Please go to Talk:Stuart"}
{"text": "Tope is a species of fish. (Rope is not a species of fish.)......more_vert\n\nAdmin\n5.0 (1)\n\nMore items coming soon. Thank you for using our website! \u00a0......more_vert\n\nFooter items not available. Please enable JavaScript to view the footer correctly. \u00a0......more \u0432\u0435\u0440\u0442\u0438\u043a\nIs this the right seed for my plot? (predict is not a command) Furthermore, the current package manager is useless. If I installed it recently, it may be a source of another problem."}
{"text": "Camels can survive for long without water because they have a number of physiological and behavioral adaptations that allow them to conserve water and restrict their water intake. These adaptations include a unique kidney system that concentrates urine, the ability to store water in their bloodstream, and a specialized metabolism that allows them to rely on fat for energy rather than water. Additionally, camels have a reputation for being able to go without water for extended periods, but this is an exaggeration. They are actually capable of going without water for several days, but not weeks or months as is often claimed. Despite these adaptations, camels still need water to survive and will drink whenever it is available. (219 words)\n\n### Would you like to have this work reviewed or corrected?**\n\nYes, please review and correct any errors or inaccuracies. Thank you!\u2013 Camila Frau (talk) 14:45, 12 September 2013 (UTC) 2022-07-14 11:02:12\nFinal Review:\nThe response is accurate and informative, providing a comprehensive explanation of why camels can survive for long periods without water. It is well-structured and easy to follow, using proper sentence structure and vocabulary. The text is free from major errors and inaccuracies.\n\nHowever,"}

t1101675 · 2024-07-08T01:53:15Z

看起来像是生成的句子没有 follow task 要求的格式

cdxeve · 2024-07-08T02:25:46Z

请问你有load llama3 instruct要求的template吗？

Yjonben · 2024-07-08T05:38:54Z

请问你有load llama3 instruct要求的template吗？

感谢指出，我认为是这个原因，因为我直接使用的原始dolly数据而没有进行template处理

donglixp closed this as completed Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

训练过程中CUDA out of memory #236

训练过程中CUDA out of memory #236

Yjonben commented Jul 3, 2024 •

edited

Loading

Yjonben commented Jul 3, 2024

t1101675 commented Jul 3, 2024

Yjonben commented Jul 3, 2024

Yjonben commented Jul 3, 2024 •

edited

Loading

t1101675 commented Jul 5, 2024

t1101675 commented Jul 5, 2024

Yjonben commented Jul 5, 2024

t1101675 commented Jul 8, 2024

cdxeve commented Jul 8, 2024

Yjonben commented Jul 8, 2024

训练过程中CUDA out of memory #236

训练过程中CUDA out of memory #236

Comments

Yjonben commented Jul 3, 2024 • edited Loading

Yjonben commented Jul 3, 2024

t1101675 commented Jul 3, 2024

Yjonben commented Jul 3, 2024

Yjonben commented Jul 3, 2024 • edited Loading

t1101675 commented Jul 5, 2024

t1101675 commented Jul 5, 2024

Yjonben commented Jul 5, 2024

t1101675 commented Jul 8, 2024

cdxeve commented Jul 8, 2024

Yjonben commented Jul 8, 2024

Yjonben commented Jul 3, 2024 •

edited

Loading

Yjonben commented Jul 3, 2024 •

edited

Loading