-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Insufficient memory of docker containers on CI #450
Comments
Hello @fanyang-mono, could you please update the |
Updated. |
It's likely your process is using too much memory. Check to see when this started and if there were code changes around that time that could have caused this to occur. |
@fanyang-mono, is this an infra issue? It looks like the errors are isolated to Runtime. |
@lewing Could you please confirm that this is a wasm build issue? This is the direct link to the build log https://dev.azure.com/dnceng-public/public/_build/results?buildId=351450&view=logs&j=d4e38924-13a0-58bd-9074-6a4810543e7c&t=102a6595-1420-53fc-8f17-b0a3f4b1242a&l=5722 |
https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_apis/build/builds/352553/logs/541 is definitely not a wasm build issue |
exit code 127 typically means the process was sent a sig kill 128 + 9 = 137. Given that this is happening inside docker containers it is likely because they are hitting resource limits |
what are the limits on the cloudtest containers? |
Based on the tracking we're seeing failures across multiple unrelated lanes (although they tend to be llvm related lanes). This is going to continue to cause pain unless we can get some idea of which processes are using memory at the point that the container is killed. |
@missymessa It would be very helpful to know what the limits are on the container. We might be running too close to the limits, in which case it would be helpful to have those bumped up. |
Prompted by failures described in dotnet/dnceng#450
Prompted by failures described in dotnet/dnceng#450
@dotnet/dnceng this is causing considerable pain how should we escalate it? We can't diagnose the failures across multiple lanes and different runtimes without more detail. |
previous teams dealing w/ exit code |
@dougbu the failures here are fairly random and span very different runtimes so a crash dump isn't likely to be deterministic. I would love to see the state of the container at shutdown time. |
cc @agocke for the nativeAOT failures |
@dougbu or edit the core information to retry, I can't |
also dotnet/runtime#89402 |
@lewing we don't have much to go on here. for one thing, we don't mess w/ "limits" in the Helix queues other than the file count maximum. suggest you use the we can increase whatever limit appears to be the problem, within limits. |
on test retries, please consider changing your eng/test-configuration.json file. that's documented in https://github.com/dotnet/arcade/blob/d3b8861e20aaf0179034c6076d156e2442b26f9b/src/Microsoft.DotNet.Helix/Sdk/Readme.md#test-retry and dotnet/runtime's file already automatically retries based on a handful of error messages |
oh, btw, if it's a true memory restriction as dotnet/runtime#89402 was, we might be able to bump things up. however there might not be budget and the problem certainly isn't related to a decrease in anything on our side. more likely the test count or memory footprint went up before this issue was observed. if that's the case, the most straightforward fix would be to split a large test project in two |
According to the table, linux-x64 Mono LLVMFullAot RuntimeTests lane also ran out of memory of the docker container during AOT very often. |
Should this be moved to the runtime repo since it only affects that repo, especially since we're waiting for additional information while they check a repro vm? |
@lewing please move this to the runtime repo (and, perhaps, work using the |
ping @lewing. we're still hitting this problem occasionally but I'm not seeing anything outside runtime builds. there might be some change we could make but we don't have any information on our side. if you have a suggestion… |
Build
https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=351450
Build leg reported
Build / browser-wasm linux Release LibraryTests / Build product
Pull Request
dotnet/runtime#89217
Known issue core information
Fill out the known issue JSON section by following the step by step documentation on how to create a known issue
@dotnet/dnceng
Release Note Category
Release Note Description
Additional information about the issue reported
No response
Report
Summary
Known issue validation
Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=351450
Error message validated:
[error]Exit code 137 returned from process: file name '/usr/bin/docker'
Result validation: ✅ Known issue matched with the provided build.
Validation performed at: 7/26/2023 2:43:39 PM UTC
The text was updated successfully, but these errors were encountered: