Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-8128][VL] Retry borrow when granted less than size in multi-slot and shared mode #8132

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kecookier
Copy link
Contributor

@kecookier kecookier commented Dec 3, 2024

What changes were proposed in this pull request?

Retry memory borrow when granted less than requested size in multi-slot and Gluten shared mode. If ThrowOnOomMemoryTarget cannot acquire memory, spill more memory and then retry.

(Fixes: #8128)

If the case in issue #8128 occurs, the first borrow(sizeA) will spill sizeA memory. After spill, if curMemory is greater than the maxPerTaskSize, we'll get 0 returned. So we retry spill(long.max), then re-borrow.

  • Add argument isDynamicCapacity when construct TreeMemoryTargets#Node, which means multi-task-slot and shared mode.

How was this patch tested?

Test with my production ETL.

@github-actions github-actions bot added the CORE works for Gluten Core label Dec 3, 2024
Copy link

github-actions bot commented Dec 3, 2024

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

github-actions bot commented Dec 3, 2024

Run Gluten Clickhouse CI on x86

@kecookier kecookier changed the title [VL] Retry borrow when granted less than size in multi-core and shared mode [VL] Retry borrow when granted less than size in multi-slot and shared mode Dec 3, 2024
Copy link

github-actions bot commented Dec 3, 2024

Run Gluten Clickhouse CI on x86

@zhouyuan zhouyuan changed the title [VL] Retry borrow when granted less than size in multi-slot and shared mode [GLUTEN-8128][VL] Retry borrow when granted less than size in multi-slot and shared mode Dec 4, 2024
Copy link

github-actions bot commented Dec 4, 2024

#8128

@zhztheplayer
Copy link
Member

Hi @kecookier thanks! Is this ready for review or still in draft?

@kecookier
Copy link
Contributor Author

Hi @kecookier thanks! Is this ready for review or still in draft?

@zhztheplayer Thanks, still in draft, more tests are needed for this PR.

Copy link

github-actions bot commented Dec 6, 2024

Run Gluten Clickhouse CI on x86

1 similar comment
Copy link

github-actions bot commented Dec 6, 2024

Run Gluten Clickhouse CI on x86

@kecookier kecookier force-pushed the fix-multi-slot-shared branch from 057a94d to cce6aea Compare December 10, 2024 08:55
Copy link

Run Gluten Clickhouse CI on x86

2 similar comments
Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

@kecookier kecookier marked this pull request as ready for review December 10, 2024 14:39
Copy link

Run Gluten Clickhouse CI on x86

@kecookier
Copy link
Contributor Author

@zhztheplayer This PR is ready. Can you help review it?

Comment on lines 128 to 150
long requiredSize = Math.min(freeBytes(), size);
long granted = borrow0(requiredSize);

// If isDynamicCapacity is true, which means it is controlled by vanilla Spark,
// and if the granted memory is less than the required size, it may be because
// this task holds memory exceeding maxMemoryPerTask. Therefore, we actively retry
// spilling all memory. After this, if there is still not enough memory acquired,
// it should result in an OOM.
if (granted < requiredSize && isDynamicCapacity) {
LOGGER.info(
"Exceed Spark perTaskLimit with maxTaskSizeDynamic when "
+ "require:{} got:{}, try spill all.",
requiredSize,
granted);
long spilled = TreeMemoryTargets.spillTree(this, Long.MAX_VALUE);
long remain = requiredSize - granted;
if (spilled > remain) {
granted += borrow0(remain);
} else {
// OOM
}
}
return granted;
Copy link
Member

@zhztheplayer zhztheplayer Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @kecookier, thank you for working on this.

Can we add an individual memory target for doing the retries?

Something like

  public static class RetryOnOomMemoryTarget extends MemoryTarget {
    private final TreeMemoryTarget target;

    @Override
    public long borrow(long size) {
      final long granted = target.borrow(size);
      if (granted >= size) {
        return granted;
      }
      // Granted < size. Spill the underlying memory target then retry borrowing.
      final long remaining = size - granted;
      TreeMemoryTargets.spillTree(target, Long.MAX_VALUE);
      final long granted2 = target.borrow(remaining);
      return granted + granted2;
    }

I can help do a further benchmark too see if chaining another memory target hurts the overall performance. Thought we can make a try like this anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhztheplayer I see, something like ThrowOnOomMemoryTarget( RetryOnOomMemoryTarget( Overacquired(...) ) )?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like that but I assume we tended to add it to this line? Since shared mode is only interested.

Copy link
Member

@zhztheplayer zhztheplayer Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a comment like "Retry of spilling is needed in shared mode because ... " or something

    if (GlutenConfig.getConf().memoryIsolation()) {
      return TreeMemoryConsumers.isolated()
          .newConsumer(tmm, name, spiller, virtualChildren);
    } else {
      // Retry of spilling is needed in shared mode because...
      return MemoryTargets.retrySpillOnOom(
          TreeMemoryConsumers.shared()
              .newConsumer(tmm, name, spiller, virtualChildren));
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhztheplayer Thanks, I'll update the code soon.

Copy link

Run Gluten Clickhouse CI on x86

@kecookier kecookier force-pushed the fix-multi-slot-shared branch from c11d940 to 8637931 Compare December 16, 2024 08:30
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
Copy link

Run Gluten Clickhouse CI on x86

@kecookier kecookier force-pushed the fix-multi-slot-shared branch from f0e0eea to 0875530 Compare December 16, 2024 10:44
Copy link

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CORE works for Gluten Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[VL] Gluten OOM with multi-slot executor configuration due to the Vanilla Spark memory acquisition strategy
2 participants