Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serious conclusion: LOMO does not significantly reduce GPU memory usage! #72

Open
misonsky opened this issue Jan 3, 2024 · 17 comments

Comments

@misonsky
Copy link

misonsky commented Jan 3, 2024

Through comparative experiments, we found that what really reduces GPU memory is "torch.set_default_dtype(torch.float16)" and deepspeed. We used LLaMA-7B to conduct experiments, using
{
"zero_optimization":{
"stage": 0
},
"gradient_accumulation_steps": 1,
"steps_per_print": 2000,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": false
}
configuration to cancel the deepspeed function. When we do not enable mixed precision, the output of the model is fp16. This result is obviously abnormal. After our check, we found that “torch.set_default_dtype(torch.float16)” played a key role! When we remove deepspeed and “torch.set_default_dtype(torch.float16)”, according to the default configuration on wic datasets, out of memory on the 80G A100 card!. After adding "“torch.set_default_dtype(torch.float16)”, the memory is directly reduced to about 35G. According to normal mixed precision training, the author's LOMO still out of memory on the 80G A100 card!

@misonsky
Copy link
Author

misonsky commented Jan 3, 2024

7703e18434dd1b04b6dc05ea9ebb5d1

This is the result of normal fine-tuning,

7703e18434dd1b04b6dc05ea9ebb5d1

This is mixed precision FT!

The above results completely use the LOMO optimizer, without deepspeed and “torch.set_default_dtype(torch.float16)”.

@misonsky
Copy link
Author

misonsky commented Jan 3, 2024

7703e18434dd1b04b6dc05ea9ebb5d1

This is the result of normal fine-tuning,

7703e18434dd1b04b6dc05ea9ebb5d1

This is mixed precision FT!

The above results completely use the LOMO optimizer, without deepspeed and “torch.set_default_dtype(torch.float16)"

a4149f9842b6f978821c4588f165daa

this is mixed precision results.

@misonsky
Copy link
Author

misonsky commented Jan 3, 2024

a9ff8d4b449be169d58ec56bb9ca3b5

This is the result of memory usage on the wic data set when we only use the LOMO optimizer, the sentence length is 512, and the batch is 1.

@misonsky
Copy link
Author

misonsky commented Jan 3, 2024

1704306608514

This is the result after adding torch.set_default_dtype(torch.float16)!

@misonsky
Copy link
Author

misonsky commented Jan 3, 2024

The author calls it mixed precision training, but it is not! How much memory usage LOMO can reduce needs to be strictly verified through experiments, rather than attributing the effect of reducing the number of precision bits and deepspeed to LOMO! The problem with adaLOMO is similar.

@misonsky misonsky changed the title Serious conclusion: LOMO does not reduce the effectiveness of GPU memory ! Serious conclusion: LOMO does not reduce the of GPU memory effectiveness ! Jan 3, 2024
@misonsky misonsky changed the title Serious conclusion: LOMO does not reduce the of GPU memory effectiveness ! Serious conclusion: LOMO does not significantly reduce GPU memory usage! Jan 3, 2024
@misonsky
Copy link
Author

misonsky commented Jan 3, 2024

403a1ce3849d4dbc14098879ae16523

This is the normal weight type for mixed precision models.

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Jan 4, 2024

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

  1. On Reducing Memory Usage with LOMO:
    To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory.

    In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

    • Is LOMO Simply Reducing Precision to Decrease Memory Use?
      No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.

    • Why Does Using fp32 Training Increase Memory Usage?
      Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.

    • Reasons for Out-of-Memory (OOM) in Your Experiments:
      Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.

    • Role of DeepSpeed in LOMO:
      In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.

    • Have We Overclaimed LOMO's Memory Usage Reduction?
      No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.

  2. Downstream Performance in the AdaLomo Paper:
    We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

@misonsky
Copy link
Author

misonsky commented Jan 4, 2024

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

  1. On Reducing Memory Usage with LOMO:
    To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory.
    In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

    • Is LOMO Simply Reducing Precision to Decrease Memory Use?
      No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.
    • Why Does Using fp32 Training Increase Memory Usage?
      Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.
    • Reasons for Out-of-Memory (OOM) in Your Experiments:
      Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.
    • Role of DeepSpeed in LOMO:
      In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.
    • Have We Overclaimed LOMO's Memory Usage Reduction?
      No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.
  2. Downstream Performance in the AdaLomo Paper:
    We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

I don't question the advantages of integrated updates, but this is far from the author's ability, which is important! Mixed precision training only reduces dynamic memory usage, which is also important! If you use hugging face for normal mixed precision training you will observe different results. The author's approach actually does quantization, and although you set it to 16 bits, if you set the default to 8 bits I'd go back to lowering the memory even more. You didn't explain anything about this in the paper, which is also important.

@misonsky
Copy link
Author

misonsky commented Jan 4, 2024

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

  1. On Reducing Memory Usage with LOMO:
    To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory.
    In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

    • Is LOMO Simply Reducing Precision to Decrease Memory Use?
      No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.
    • Why Does Using fp32 Training Increase Memory Usage?
      Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.
    • Reasons for Out-of-Memory (OOM) in Your Experiments:
      Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.
    • Role of DeepSpeed in LOMO:
      In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.
    • Have We Overclaimed LOMO's Memory Usage Reduction?
      No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.
  2. Downstream Performance in the AdaLomo Paper:
    We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

I don't question the advantages of integrated updates, but this is far from the author's ability, which is important! Mixed precision training only reduces dynamic memory usage, which is also important! If you use hugging face for normal mixed precision training you will observe different results. The author's approach actually does quantization, and although you set it to 16 bits, if you set the default to 8 bits I'd go back to lowering the memory even more. You didn't explain anything about this in the paper, which is also important.

Mixed precision only reduces dynamic memory usage. Even for comparative experiments, I think the author should clearly tell readers which module reduces memory usage. Based on LLaMA-7B, I only used a batch of 1 and a sentence length of 512, which occupies about 63G of memory.

@misonsky
Copy link
Author

misonsky commented Jan 4, 2024

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

  1. On Reducing Memory Usage with LOMO:
    To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory.
    In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

    • Is LOMO Simply Reducing Precision to Decrease Memory Use?
      No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.
    • Why Does Using fp32 Training Increase Memory Usage?
      Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.
    • Reasons for Out-of-Memory (OOM) in Your Experiments:
      Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.
    • Role of DeepSpeed in LOMO:
      In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.
    • Have We Overclaimed LOMO's Memory Usage Reduction?
      No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.
  2. Downstream Performance in the AdaLomo Paper:
    We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

The author should give how much memory can be reduced by just the LOMO optimizer, rather than relying on setting 16-bit precision, using deepspeed. This is very confusing, if LOMO itself can do it, why use other technologies. The code is here for anyone to try and the conclusion is pretty clear.

@misonsky
Copy link
Author

misonsky commented Jan 4, 2024

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

  1. On Reducing Memory Usage with LOMO:
    To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory.
    In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

    • Is LOMO Simply Reducing Precision to Decrease Memory Use?
      No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.
    • Why Does Using fp32 Training Increase Memory Usage?
      Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.
    • Reasons for Out-of-Memory (OOM) in Your Experiments:
      Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.
    • Role of DeepSpeed in LOMO:
      In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.
    • Have We Overclaimed LOMO's Memory Usage Reduction?
      No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.
  2. Downstream Performance in the AdaLomo Paper:
    We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

I know the author's idea very well and the author does not need to explain it further. It is impossible to achieve the effect claimed by the author in the paper if we rely solely on fusion and update ideas. Remove the global 16-bit settings, use the LOMO optimizer alone, use the hugging face trainer for mixed precision training, modify its training logic, and use the author's ideas, which is easy to verify!

@misonsky
Copy link
Author

misonsky commented Jan 4, 2024

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

  1. On Reducing Memory Usage with LOMO:
    To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory.
    In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

    • Is LOMO Simply Reducing Precision to Decrease Memory Use?
      No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.
    • Why Does Using fp32 Training Increase Memory Usage?
      Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.
    • Reasons for Out-of-Memory (OOM) in Your Experiments:
      Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.
    • Role of DeepSpeed in LOMO:
      In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.
    • Have We Overclaimed LOMO's Memory Usage Reduction?
      No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.
  2. Downstream Performance in the AdaLomo Paper:
    We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

The author should give how much memory can be reduced by just the LOMO optimizer, rather than relying on setting 16-bit precision, using deepspeed. This is very confusing, if LOMO itself can do it, why use other technologies. The code is here for anyone to try and the conclusion is pretty clear.

Or you should clearly provide a comparative experiment to show who played a role in reducing memory.

@misonsky
Copy link
Author

misonsky commented Jan 4, 2024

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

  1. On Reducing Memory Usage with LOMO:
    To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory.
    In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

    • Is LOMO Simply Reducing Precision to Decrease Memory Use?
      No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.
    • Why Does Using fp32 Training Increase Memory Usage?
      Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.
    • Reasons for Out-of-Memory (OOM) in Your Experiments:
      Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.
    • Role of DeepSpeed in LOMO:
      In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.
    • Have We Overclaimed LOMO's Memory Usage Reduction?
      No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.
  2. Downstream Performance in the AdaLomo Paper:
    We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

The author should give how much memory can be reduced by just the LOMO optimizer, rather than relying on setting 16-bit precision, using deepspeed. This is very confusing, if LOMO itself can do it, why use other technologies. The code is here for anyone to try and the conclusion is pretty clear.

Or you should clearly provide a comparative experiment to show who played a role in reducing memory.

For example, you should compare deepspeed+fp16+... and deepspeed+fp16+...+lomo. Just like if you use BERT, you should provide the results of BERT and the results of your BERT+anything.

@misonsky
Copy link
Author

misonsky commented Jan 4, 2024

Thank you for your patient responses and pleasant discussion, we just hope the results are rigorous rather than vague.

@misonsky
Copy link
Author

misonsky commented Jan 4, 2024

#47 (comment)

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Jan 11, 2024

For example, you should compare deepspeed+fp16+... and deepspeed+fp16+...+lomo. Just like if you use BERT, you should provide the results of BERT and the results of your BERT+anything.

yes, we compared deepspeed+fp16+adamw and deepspeed+fp16+lomo, didn't we? @misonsky

@misonsky
Copy link
Author

I thought the author would listen and make corrections but actually quite the opposite.

1704976915771

1704976972693

1704976995931

1704977023061

1704977051582

Which experimental results can support the author's conclusion? Did the author tell readers which module can reduce memory? Is it torch.set_default_dtype(torch.float16), or gradient checkpointing? or LoMO? previous work have evaluated that calculation graphs occupy almost more than 50% of the memory, but the author's conclusion is exactly the opposite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants