Serious conclusion: LOMO does not significantly reduce GPU memory usage！ #72

misonsky · 2024-01-03T18:06:01Z

Through comparative experiments, we found that what really reduces GPU memory is "torch.set_default_dtype(torch.float16)" and deepspeed. We used LLaMA-7B to conduct experiments, using
{
"zero_optimization":{
"stage": 0
},
"gradient_accumulation_steps": 1,
"steps_per_print": 2000,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": false
}
configuration to cancel the deepspeed function. When we do not enable mixed precision, the output of the model is fp16. This result is obviously abnormal. After our check, we found that “torch.set_default_dtype(torch.float16)” played a key role! When we remove deepspeed and “torch.set_default_dtype(torch.float16)”, according to the default configuration on wic datasets, out of memory on the 80G A100 card!. After adding "“torch.set_default_dtype(torch.float16)”, the memory is directly reduced to about 35G. According to normal mixed precision training, the author's LOMO still out of memory on the 80G A100 card!

misonsky · 2024-01-03T18:17:19Z

This is the result of normal fine-tuning,

This is mixed precision FT!

The above results completely use the LOMO optimizer, without deepspeed and “torch.set_default_dtype(torch.float16)”.

misonsky · 2024-01-03T18:19:51Z

This is the result of normal fine-tuning,

This is mixed precision FT!

The above results completely use the LOMO optimizer, without deepspeed and “torch.set_default_dtype(torch.float16)"

this is mixed precision results.

misonsky · 2024-01-03T18:27:50Z

This is the result of memory usage on the wic data set when we only use the LOMO optimizer, the sentence length is 512, and the batch is 1.

misonsky · 2024-01-03T18:31:10Z

This is the result after adding torch.set_default_dtype(torch.float16)!

misonsky · 2024-01-03T18:36:52Z

The author calls it mixed precision training, but it is not! How much memory usage LOMO can reduce needs to be strictly verified through experiments, rather than attributing the effect of reducing the number of precision bits and deepspeed to LOMO! The problem with adaLOMO is similar.

misonsky · 2024-01-03T21:32:57Z

This is the normal weight type for mixed precision models.

KaiLv69 · 2024-01-04T06:35:31Z

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

On Reducing Memory Usage with LOMO:
To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory.

In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).
- Is LOMO Simply Reducing Precision to Decrease Memory Use?
  No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.
- Why Does Using fp32 Training Increase Memory Usage?
  Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.
- Reasons for Out-of-Memory (OOM) in Your Experiments:
  Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.
- Role of DeepSpeed in LOMO:
  In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.
- Have We Overclaimed LOMO's Memory Usage Reduction?
  No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.
Downstream Performance in the AdaLomo Paper:
We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

misonsky · 2024-01-04T08:48:11Z

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

On Reducing Memory Usage with LOMO:
To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory.
In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

Is LOMO Simply Reducing Precision to Decrease Memory Use?
No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.

Why Does Using fp32 Training Increase Memory Usage?
Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.

Reasons for Out-of-Memory (OOM) in Your Experiments:
Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.

Role of DeepSpeed in LOMO:
In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.

Have We Overclaimed LOMO's Memory Usage Reduction?
No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.

Downstream Performance in the AdaLomo Paper:
We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

I don't question the advantages of integrated updates, but this is far from the author's ability, which is important! Mixed precision training only reduces dynamic memory usage, which is also important! If you use hugging face for normal mixed precision training you will observe different results. The author's approach actually does quantization, and although you set it to 16 bits, if you set the default to 8 bits I'd go back to lowering the memory even more. You didn't explain anything about this in the paper, which is also important.

misonsky · 2024-01-04T09:08:44Z

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

On Reducing Memory Usage with LOMO:
To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory.
In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

Is LOMO Simply Reducing Precision to Decrease Memory Use?
No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.

Why Does Using fp32 Training Increase Memory Usage?
Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.

Reasons for Out-of-Memory (OOM) in Your Experiments:
Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.

Role of DeepSpeed in LOMO:
In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.

Have We Overclaimed LOMO's Memory Usage Reduction?
No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.

Downstream Performance in the AdaLomo Paper:
We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

I don't question the advantages of integrated updates, but this is far from the author's ability, which is important! Mixed precision training only reduces dynamic memory usage, which is also important! If you use hugging face for normal mixed precision training you will observe different results. The author's approach actually does quantization, and although you set it to 16 bits, if you set the default to 8 bits I'd go back to lowering the memory even more. You didn't explain anything about this in the paper, which is also important.

Mixed precision only reduces dynamic memory usage. Even for comparative experiments, I think the author should clearly tell readers which module reduces memory usage. Based on LLaMA-7B, I only used a batch of 1 and a sentence length of 512, which occupies about 63G of memory.

misonsky · 2024-01-04T09:28:32Z

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

On Reducing Memory Usage with LOMO:
To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory.
In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

Is LOMO Simply Reducing Precision to Decrease Memory Use?
No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.

Why Does Using fp32 Training Increase Memory Usage?
Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.

Reasons for Out-of-Memory (OOM) in Your Experiments:
Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.

Role of DeepSpeed in LOMO:
In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.

Have We Overclaimed LOMO's Memory Usage Reduction?
No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.

Downstream Performance in the AdaLomo Paper:
We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

The author should give how much memory can be reduced by just the LOMO optimizer, rather than relying on setting 16-bit precision, using deepspeed. This is very confusing, if LOMO itself can do it, why use other technologies. The code is here for anyone to try and the conclusion is pretty clear.

misonsky · 2024-01-04T09:44:21Z

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

On Reducing Memory Usage with LOMO:
To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory.
In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

Is LOMO Simply Reducing Precision to Decrease Memory Use?
No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.

Why Does Using fp32 Training Increase Memory Usage?
Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.

Reasons for Out-of-Memory (OOM) in Your Experiments:
Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.

Role of DeepSpeed in LOMO:
In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.

Have We Overclaimed LOMO's Memory Usage Reduction?
No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.

Downstream Performance in the AdaLomo Paper:
We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

I know the author's idea very well and the author does not need to explain it further. It is impossible to achieve the effect claimed by the author in the paper if we rely solely on fusion and update ideas. Remove the global 16-bit settings, use the LOMO optimizer alone, use the hugging face trainer for mixed precision training, modify its training logic, and use the author's ideas, which is easy to verify!

misonsky · 2024-01-04T10:09:26Z

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

On Reducing Memory Usage with LOMO:
To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory.
In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

Is LOMO Simply Reducing Precision to Decrease Memory Use?
No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.

Why Does Using fp32 Training Increase Memory Usage?
Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.

Reasons for Out-of-Memory (OOM) in Your Experiments:
Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.

Role of DeepSpeed in LOMO:
In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.

Have We Overclaimed LOMO's Memory Usage Reduction?
No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.

Downstream Performance in the AdaLomo Paper:
We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

The author should give how much memory can be reduced by just the LOMO optimizer, rather than relying on setting 16-bit precision, using deepspeed. This is very confusing, if LOMO itself can do it, why use other technologies. The code is here for anyone to try and the conclusion is pretty clear.

Or you should clearly provide a comparative experiment to show who played a role in reducing memory.

misonsky · 2024-01-04T10:21:21Z

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

On Reducing Memory Usage with LOMO:
To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory.
In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

Is LOMO Simply Reducing Precision to Decrease Memory Use?
No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.

Why Does Using fp32 Training Increase Memory Usage?
Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.

Reasons for Out-of-Memory (OOM) in Your Experiments:
Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.

Role of DeepSpeed in LOMO:
In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.

Have We Overclaimed LOMO's Memory Usage Reduction?
No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.

Downstream Performance in the AdaLomo Paper:
We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

The author should give how much memory can be reduced by just the LOMO optimizer, rather than relying on setting 16-bit precision, using deepspeed. This is very confusing, if LOMO itself can do it, why use other technologies. The code is here for anyone to try and the conclusion is pretty clear.

Or you should clearly provide a comparative experiment to show who played a role in reducing memory.

For example, you should compare deepspeed+fp16+... and deepspeed+fp16+...+lomo. Just like if you use BERT, you should provide the results of BERT and the results of your BERT+anything.

misonsky · 2024-01-04T10:36:50Z

Thank you for your patient responses and pleasant discussion, we just hope the results are rigorous rather than vague.

misonsky · 2024-01-04T15:17:18Z

#47 (comment)

KaiLv69 · 2024-01-11T11:51:13Z

For example, you should compare deepspeed+fp16+... and deepspeed+fp16+...+lomo. Just like if you use BERT, you should provide the results of BERT and the results of your BERT+anything.

yes, we compared deepspeed+fp16+adamw and deepspeed+fp16+lomo, didn't we? @misonsky

misonsky · 2024-01-11T12:53:23Z

I thought the author would listen and make corrections but actually quite the opposite.

Which experimental results can support the author's conclusion? Did the author tell readers which module can reduce memory? Is it torch.set_default_dtype(torch.float16), or gradient checkpointing? or LoMO? previous work have evaluated that calculation graphs occupy almost more than 50% of the memory, but the author's conclusion is exactly the opposite.

misonsky changed the title ~~Serious conclusion: LOMO does not reduce the effectiveness of GPU memory !~~ Serious conclusion: LOMO does not reduce the of GPU memory effectiveness ! Jan 3, 2024

misonsky changed the title ~~Serious conclusion: LOMO does not reduce the of GPU memory effectiveness !~~ Serious conclusion: LOMO does not significantly reduce GPU memory usage！ Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serious conclusion: LOMO does not significantly reduce GPU memory usage！ #72

Serious conclusion: LOMO does not significantly reduce GPU memory usage！ #72

misonsky commented Jan 3, 2024

misonsky commented Jan 3, 2024

misonsky commented Jan 3, 2024

misonsky commented Jan 3, 2024

misonsky commented Jan 3, 2024

misonsky commented Jan 3, 2024

misonsky commented Jan 3, 2024

KaiLv69 commented Jan 4, 2024

misonsky commented Jan 4, 2024

misonsky commented Jan 4, 2024

misonsky commented Jan 4, 2024

misonsky commented Jan 4, 2024

misonsky commented Jan 4, 2024

misonsky commented Jan 4, 2024

misonsky commented Jan 4, 2024

misonsky commented Jan 4, 2024

KaiLv69 commented Jan 11, 2024

misonsky commented Jan 11, 2024

Serious conclusion: LOMO does not significantly reduce GPU memory usage！ #72

Serious conclusion: LOMO does not significantly reduce GPU memory usage！ #72

Comments

misonsky commented Jan 3, 2024

misonsky commented Jan 3, 2024

misonsky commented Jan 3, 2024

misonsky commented Jan 3, 2024

misonsky commented Jan 3, 2024

misonsky commented Jan 3, 2024

misonsky commented Jan 3, 2024

KaiLv69 commented Jan 4, 2024

misonsky commented Jan 4, 2024

misonsky commented Jan 4, 2024

misonsky commented Jan 4, 2024

misonsky commented Jan 4, 2024

misonsky commented Jan 4, 2024

misonsky commented Jan 4, 2024

misonsky commented Jan 4, 2024

misonsky commented Jan 4, 2024

KaiLv69 commented Jan 11, 2024

misonsky commented Jan 11, 2024