Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insufficient memory of docker containers on CI #450

Open
3 tasks
fanyang-mono opened this issue Jul 25, 2023 · 23 comments
Open
3 tasks

Insufficient memory of docker containers on CI #450

fanyang-mono opened this issue Jul 25, 2023 · 23 comments

Comments

@fanyang-mono
Copy link
Member

fanyang-mono commented Jul 25, 2023

Build

https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=351450

Build leg reported

Build / browser-wasm linux Release LibraryTests / Build product

Pull Request

dotnet/runtime#89217

Known issue core information

Fill out the known issue JSON section by following the step by step documentation on how to create a known issue

 {
    "ErrorMessage" : "[error]Exit code 137 returned from process: file name '/usr/bin/docker'",
    "BuildRetry": false,
    "ErrorPattern": "",
    "ExcludeConsoleLog": false
 }

@dotnet/dnceng

Release Note Category

  • Feature changes/additions
  • Bug fixes
  • Internal Infrastructure Improvements

Release Note Description

Additional information about the issue reported

No response

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0

Known issue validation

Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=351450
Error message validated: [error]Exit code 137 returned from process: file name '/usr/bin/docker'
Result validation: ✅ Known issue matched with the provided build.
Validation performed at: 7/26/2023 2:43:39 PM UTC

@andriipatsula
Copy link
Member

Hello @fanyang-mono, could you please update the "ErrorMessage" : "" by following the step by step documentation on how to create a known issue

@fanyang-mono
Copy link
Member Author

Updated.

@missymessa
Copy link
Member

It's likely your process is using too much memory. Check to see when this started and if there were code changes around that time that could have caused this to occur.

https://www.airplane.dev/blog/exit-code-137

@missymessa
Copy link
Member

@fanyang-mono, is this an infra issue? It looks like the errors are isolated to Runtime.

@fanyang-mono
Copy link
Member Author

fanyang-mono commented Jul 27, 2023

@lewing Could you please confirm that this is a wasm build issue? This is the direct link to the build log https://dev.azure.com/dnceng-public/public/_build/results?buildId=351450&view=logs&j=d4e38924-13a0-58bd-9074-6a4810543e7c&t=102a6595-1420-53fc-8f17-b0a3f4b1242a&l=5722

@lewing
Copy link
Member

lewing commented Jul 27, 2023

@lewing
Copy link
Member

lewing commented Jul 27, 2023

exit code 127 typically means the process was sent a sig kill 128 + 9 = 137. Given that this is happening inside docker containers it is likely because they are hitting resource limits

@lewing
Copy link
Member

lewing commented Jul 27, 2023

what are the limits on the cloudtest containers?

@lewing
Copy link
Member

lewing commented Aug 12, 2023

Based on the tracking we're seeing failures across multiple unrelated lanes (although they tend to be llvm related lanes). This is going to continue to cause pain unless we can get some idea of which processes are using memory at the point that the container is killed.

@radical
Copy link
Member

radical commented Aug 12, 2023

@missymessa It would be very helpful to know what the limits are on the container. We might be running too close to the limits, in which case it would be helpful to have those bumped up.

radical added a commit to radical/runtime that referenced this issue Aug 12, 2023
lewing pushed a commit to dotnet/runtime that referenced this issue Aug 12, 2023
@lewing
Copy link
Member

lewing commented Aug 13, 2023

@dotnet/dnceng this is causing considerable pain how should we escalate it? We can't diagnose the failures across multiple lanes and different runtimes without more detail.

@dougbu
Copy link
Member

dougbu commented Aug 13, 2023

previous teams dealing w/ exit code 137 have worked w/ people on the runtime team to collect crash dumps and determine the root cause. it's also likely something changed in the runtime repo about a month ago that led to this issue.

@lewing
Copy link
Member

lewing commented Aug 13, 2023

@dougbu the failures here are fairly random and span very different runtimes so a crash dump isn't likely to be deterministic. I would love to see the state of the container at shutdown time.

@lewing
Copy link
Member

lewing commented Aug 13, 2023

cc @agocke for the nativeAOT failures

@lewing
Copy link
Member

lewing commented Aug 13, 2023

@dougbu or edit the core information to retry, I can't

@lewing
Copy link
Member

lewing commented Aug 13, 2023

also dotnet/runtime#89402

@dougbu
Copy link
Member

dougbu commented Aug 14, 2023

@lewing we don't have much to go on here. for one thing, we don't mess w/ "limits" in the Helix queues other than the file count maximum.

suggest you use the helix-repro-vms DevTest Labs to create a VM matching the queue used in your tests. then, do whatever you can to run the tests on that VM in a way that captures a dump. the dump should at least indicate what is causing the exit code. note the core dump should be created in the main process, not w/in the Docker container. I believe @agocke has experience using dumps to debug occasional build and test strangeness's.

we can increase whatever limit appears to be the problem, within limits.

@dougbu
Copy link
Member

dougbu commented Aug 14, 2023

on test retries, please consider changing your eng/test-configuration.json file. that's documented in https://github.com/dotnet/arcade/blob/d3b8861e20aaf0179034c6076d156e2442b26f9b/src/Microsoft.DotNet.Helix/Sdk/Readme.md#test-retry and dotnet/runtime's file already automatically retries based on a handful of error messages

@dougbu
Copy link
Member

dougbu commented Aug 14, 2023

oh, btw, if it's a true memory restriction as dotnet/runtime#89402 was, we might be able to bump things up. however there might not be budget and the problem certainly isn't related to a decrease in anything on our side. more likely the test count or memory footprint went up before this issue was observed. if that's the case, the most straightforward fix would be to split a large test project in two

@fanyang-mono
Copy link
Member Author

fanyang-mono commented Aug 24, 2023

According to the table, linux-x64 Mono LLVMFullAot RuntimeTests lane also ran out of memory of the docker container during AOT very often.

@fanyang-mono fanyang-mono changed the title [error]Exit code 137 returned from process: file name '/usr/bin/docker' Insufficient memory of docker containers Aug 24, 2023
@fanyang-mono fanyang-mono changed the title Insufficient memory of docker containers Insufficient memory of docker containers on CI Aug 24, 2023
@riarenas
Copy link
Member

riarenas commented Aug 24, 2023

Should this be moved to the runtime repo since it only affects that repo, especially since we're waiting for additional information while they check a repro vm?

@dougbu
Copy link
Member

dougbu commented Nov 23, 2023

@lewing please move this to the runtime repo (and, perhaps, work using the helix-repro-vms to narrow the issue down). when you've found a specific action to take, please describe it in the First Responders channel. we may have a way to bump limits but it's more likely the runtime team will need to reduce or simplify something to resolve this issue.

@dougbu
Copy link
Member

dougbu commented Jan 8, 2024

ping @lewing. we're still hitting this problem occasionally but I'm not seeing anything outside runtime builds. there might be some change we could make but we don't have any information on our side. if you have a suggestion…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants