Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JsonSerializer.Deserialize(ROS<byte>, JsonTypeInfo<T>>) hangs with AllowOutOfOrderMetadataProperties on Amazon Linux 2023 with native AoT #107740

Closed
martincostello opened this issue Sep 12, 2024 · 11 comments

Comments

@martincostello
Copy link
Member

martincostello commented Sep 12, 2024

Description

In trying to deploy a native AoT application to AWS Lambda using Amazon Linux 2023 that uses JSON polymorphism with AllowOutOfOrderMetadataProperties=true, it appears that the code hangs when trying to deserialize a payload. Requests timeout after 10 seconds (the Lambda timeout), but I've increased this as high as 2 minutes and the code still times out. Due to the limited ability to reach into the execution environment for AWS Lambda, most of my observations have been through Console.WriteLine() based observations of which log messages are or aren't reached.

The debugging trial and error can be observed here: martincostello/alexa-london-travel#1434

The use of AllowOutOfOrderMetadataProperties is needed as the payloads I wish to deserialize are AWS-owned events for Alexa devices where the type discriminator is not the first property of the JSON documents and this is not something that can be changed.

I haven't yet been able to create a minimal repro for this, but these are the points of interest I've deduced through lots of trial and error.

  • The application works fine with .NET 9 RC1 in general if polymorphism isn't used.
  • The application works with polymorphism if native AoT is disabled.
  • Setting AllowOutOfOrderMetadataProperties=false causes the serializer to throw, as expected, so the issue seems specific to enabling this feature.
  • Whatever the issue is is still an issue with the latest nightly builds (I've been testing today with version 9.0.100-rc.2.24462.3 of the SDK).
  • I haven't been able to replicate this issue with my tests, including ones that run under the Microsoft Test SDK's native AoT support on macOS, Linux or Windows in GitHub Actions. It's possible it's specific to Amazon Linux 2023 running on Graviton (ARM64).

I also did a bit of digging around whether this was an issue with an exception happening, and then its logging causing a hang when writing to the console (see microsoft/testfx#3485 (comment)), but I think I've ruled that out by messing around with turning logging off and other things (unless it is that, but the fix is incomplete).

Reproduction Steps

Currently all I can offer is this commit for inspection: martincostello/alexa-london-travel@d8ced7f

I'll try to distil things down to a more minimal repro, but if it's specific to Amazon Linux 2023 and/or Graviton, that might be tricky.

Here's an example payload:

{
  "session": {
    "new": true,
    "sessionId": "amzn1.echo-api.session.LAMBDA-TEST-DEV",
    "attributes": {},
    "user": {
      "userId": "amzn1.ask.account.LAMBDA-TEST-DEV"
    },
    "application": {
      "applicationId": "amzn1.ask.skill.6ccfbdb5-3e68-47fa-bf80-b3ed54a64ac5"
    }
  },
  "version": "1.0",
  "request": {
    "locale": "en-US",
    "timestamp": "2016-10-27T18:21:44Z",
    "type": "IntentRequest",
    "requestId": "amzn1.echo-api.request.LAMBDA-TEST-DEV",
    "intent": {
      "name": "AMAZON.HelpIntent",
      "slots": {}
    }
  },
  "context": {
    "AudioPlayer": {
      "playerActivity": "IDLE"
    },
    "System": {
      "device": {
        "supportedInterfaces": {
          "AudioPlayer": {}
        }
      },
      "application": {
        "applicationId": "amzn1.ask.skill.6ccfbdb5-3e68-47fa-bf80-b3ed54a64ac5"
      },
      "user": {
        "userId": "amzn1.ask.account.LAMBDA-TEST-DEV"
      }
    }
  }
}

Expected behavior

The JSON payloads are (de)serialized correctly as they are when native AoT is not used.

Actual behavior

The code hangs. The AWS Lambda runtime doesn't appear to be emitting any logs and telemetry that suggest the runtime is crashing.

Regression?

It was working at some point during the .NET 9 development cycle, as once I got out-of-order metadata polymorphism, I ran into #105034. Some time after that I used testfx's AoT support as a way to get feedback on native AoT issues, so needing to deploy to the real environment to validate things became less needed.

Known Workarounds

Don't use JSON polymorphism.

Configuration

  • .NET SDK 9.0.100-rc.1.24452.12 and 9.0.100-rc.2.24462.3
  • Amazon Linux 2023

Other information

No response

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Sep 12, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis
See info in area-owners.md if you want to be subscribed.

@eiriktsarpalis
Copy link
Member

Looks like a tricky one. Does the hang repro consistently with the same JSON payload that you shared here? Are you able to insert any tracing that might pinpoint where exactly the hang is happening? What is the smallest possible polymorphic type that reproduces the hang?

@vcsjones
Copy link
Member

I have a Graviton / Amazon Linux instance I can test this on.

Any tips or commands I should run to try and repro? Even if it isn't minimal, I can try to whittle it down from there.

@martincostello
Copy link
Member Author

Does the hang repro consistently with the same JSON payload that you shared here?

The hangs come through invoking the Lambda directly through the AWS Lambda APIs (to bypass the need to simulate an Alexa device) and are from these payloads in the test project. Each one seems to have the same issue and precipitate the hang.

Are you able to insert any tracing that might pinpoint where exactly the hang is happening?

The best attempt at tracing I've gotten to are these two Console.WriteLine() calls: code.

The first one outputs to CloudWatch, and the second one isn't ever seen. There's just a $TIMEOUT length gap after the base64 encoded payload getting logged, and then the timeout being logged by the Lambda runtime. The console messages in the serializer method below are never emitted, not are my "normal" logs from the application code that runs which receives the deserialized payload.

Here's an example of the logs I see for a single invocation:

2024-09-12T12:56:21.602Z - START RequestId: 6535ff38-3723-40e9-a4e3-e9229e344360 Version: $LATEST
2024-09-12T12:56:21.603Z - 6535ff38-3723-40e9-a4e3-e9229e344360	info	Request: ewogICJzZXNzaW9uIjogewogICAgIm5ldyI6IHRydWUsCiAgICAic2Vzc2lvbklkIjogImFtem4xLmVjaG8tYXBpLnNlc3Npb24uTEFNQkRBLVRFU1QtREVWIiwKICAgICJhdHRyaWJ1dGVzIjoge30sCiAgICAidXNlciI6IHsKICAgICAgInVzZXJJZCI6ICJhbXpuMS5hc2suYWNjb3VudC5MQU1CREEtVEVTVC1ERVYiCiAgICB9LAogICAgImFwcGxpY2F0aW9uIjogewogICAgICAiYXBwbGljYXRpb25JZCI6ICJhbXpuMS5hc2suc2tpbGwuNDlmMTM1NzQtODEzNC00NzQ4LWFmZWItNjJlZjFkZWZmZmE2IgogICAgfQogIH0sCiAgInZlcnNpb24iOiAiMS4wIiwKICAicmVxdWVzdCI6IHsKICAgICJsb2NhbGUiOiAiZW4tVVMiLAogICAgInRpbWVzdGFtcCI6ICIyMDE2LTEwLTI3VDE4OjIxOjQ0WiIsCiAgICAidHlwZSI6ICJJbnRlbnRSZXF1ZXN0IiwKICAgICJyZXF1ZXN0SWQiOiAiYW16bjEuZWNoby1hcGkucmVxdWVzdC5MQU1CREEtVEVTVC1ERVYiLAogICAgImludGVudCI6IHsKICAgICAgIm5hbWUiOiAiU3RhdHVzSW50ZW50IiwKICAgICAgInNsb3RzIjogewogICAgICAgICJMSU5FIjogewogICAgICAgICAgIm5hbWUiOiAiTElORSIsCiAgICAgICAgICAidmFsdWUiOiAiTm9ydGhlcm4iCiAgICAgICAgfQogICAgICB9CiAgICB9CiAgfSwKICAiY29udGV4dCI6IHsKICAgICJBdWRpb1BsYXllciI6IHsKICAgICAgInBsYXllckFjdGl2aXR5IjogIklETEUiCiAgICB9LAogICAgIlN5c3RlbSI6IHsKICAgICAgImRldmljZSI6IHsKICAgICAgICAic3VwcG9ydGVkSW50ZXJmYWNlcyI6IHsKICAgICAgICAgICJBdWRpb1BsYXllciI6IHt9CiAgICAgICAgfQogICAgICB9LAogICAgICAiYXBwbGljYXRpb24iOiB7CiAgICAgICAgImFwcGxpY2F0aW9uSWQiOiAiYW16bjEuYXNrLnNraWxsLjQ5ZjEzNTc0LTgxMzQtNDc0OC1hZmViLTYyZWYxZGVmZmZhNiIKICAgICAgfSwKICAgICAgInVzZXIiOiB7CiAgICAgICAgInVzZXJJZCI6ICJhbXpuMS5hc2suYWNjb3VudC5MQU1CREEtVEVTVC1ERVYiCiAgICAgIH0KICAgIH0KICB9Cn0K
2024-09-12T12:56:31.707Z - 6535ff38-3723-40e9-a4e3-e9229e344360 Task timed out after 10.10 seconds
2024-09-12T12:56:31.707Z - END RequestId: 6535ff38-3723-40e9-a4e3-e9229e344360
2024-09-12T12:56:31.707Z - REPORT RequestId: 6535ff38-3723-40e9-a4e3-e9229e344360	Duration: 10104.59 ms	Billed Duration: 10000 ms	Memory Size: 192 MB	Max Memory Used: 58 MB	XRAY TraceId: 1-66e2e4f5-14e0125b0ec936b12c541a63	SegmentId: 015480636e94dbf6	Sampled: true	

Unfortunately X-Ray/OTel doesn't give me any granularity to get deeper.

What is the smallest possible polymorphic type that reproduces the hang?

I'm just about to try and see if I can put something together using the Amazon Linux 2023 docker image.

@martincostello
Copy link
Member Author

Any tips or commands I should run to try and repro? Even if it isn't minimal, I can try to whittle it down from there.

As a starting point I guess run the end to end test project - if it's as simple as "Graviton bad" then maybe the tests will just hang/explode immediately.

Failing that, you could try hacking apart FunctionEntrypoint to instead of creating the bootstrap and then calling RunAsync() (which assumes the AWS Lambda runtime API is there), and instead just directly load the sample JSON payload and run it through the deserializer manually.

@jkotas
Copy link
Member

jkotas commented Sep 12, 2024

This may be duplicate of #107347 . cc @MichalStrehovsky

Can you try to add:

  <ItemGroup>
    <PackageReference Include="Microsoft.Dotnet.ILCompiler" Version="9.0.0-rc.2.24461.16" />
  </ItemGroup>

to you csproj and check whether it repros? It will reference the latest nightly build of dotnet/runtime. The SDK is always a few days behind and the SDK that you are testing with does not have the fix that I have linked yet.

@martincostello
Copy link
Member Author

Sure I'll try that now.

@martincostello
Copy link
Member Author

Just testing now (it's looking promising), but for reference I put together a Dockerfile to run the native AoT tests I have on arm64 for Amazon Linux 2023 which @hwoodiwiss kindly ran for me on an arm64 machine as my Windows version of Docker wasn't having a good time of it, and that didn't repro the issue I get when deployed either:

FROM --platform=arm64 amazonlinux:latest

RUN dnf install findutils gzip libicu krb5-libs openssl-libs tar zlib -y

COPY . /source
WORKDIR /source

RUN curl -sSL https://dot.net/v1/dotnet-install.sh | bash /dev/stdin --jsonfile ./global.json --install-dir $HOME/.dotnet

RUN --mount=type=cache,id=nuget,target=/root/.nuget/packages \
    export DOTNET_ROOT=$HOME/.dotnet \
    && export PATH=$PATH:$DOTNET_ROOT:$DOTNET_ROOT/tools \
    && dotnet run --project test/LondonTravel.Skill.NativeAotTests

@martincostello
Copy link
Member Author

Yep, thanks @jkotas, adding these two packages resolved the issue:

<PackageReference Include="Microsoft.Dotnet.ILCompiler" Version="9.0.0-rc.2.24461.16" />
<PackageReference Include="runtime.linux-x64.Microsoft.DotNet.ILCompiler" Version="9.0.0-rc.2.24461.16" />

Now the tests pass as expected: martincostello/alexa-london-travel#1434 (comment)

Looks like this was a duplicate of #107300.

@eiriktsarpalis
Copy link
Member

Thanks, closing as duplicate of #107300.

@martincostello
Copy link
Member Author

Confirmed the issue is resolved using 9.0.100-rc.2.24463.1 of the SDK without needing the additional package references.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants