Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(route53): new Route53 CrossAccountRecordSet construct #31281

Closed
wants to merge 1 commit into from

Conversation

lukepafford
Copy link

@lukepafford lukepafford commented Sep 2, 2024

Issue

Closes #15213

Reason for this change

Deploying your APEX Route53 HostedZone (example.com) to a designated AWS account, and delegating child HostedZones (us-east-1.prod.example.com, us-east-2.prod.example.com) to other AWS accounts is a common setup. Managing this setup through CDK becomes impossible once a child account contains RecordSet information that needs to be created in the APEX zone. This scenario is most common when you want to create a RecordSet at the root of your domain that contains a non-simple routing policy, however the information needed to create those RecordSets is distributed between multiple child CloudFormation stacks.

This CrossAccountRecordSet construct bridges this gap by creating the RecordSet in the APEX HostedZone, and it only needs the following props:

  • A crossAccountRole: iam.IRole that the custom resource will assume which authorizes the Route53 changes in the APEX HostedZone.
  • The same RecordSetProps that exists for a RecordSet.

This issue is quite popular, as #15213 has 36 reactions, and there are already a couple constructs addressing this on ConstructHub:

This change was created because I believe this functionality should exist as a core part of the CDK. As for using this implementation over the ConstructHub implementations, I think the implementations are too simple for how many rough edges the Route53 ChangeResourceRecordSets API has, which is explained further in the change. I also think the CrossAccountRecordSet construct interface is intuitive, and easy for existing users to adopt.


This change doesn't solve all CrossAccount DNS related problems. A specific pain for me is that the CertificateValidation doesn't support cross account updates (see #8934), which diminishes the impact of this change. Still, this change is already quite large and those problems can be addressed in later changes. I first wanted to confirm this change moves in a direction the CDK team finds acceptable considering the DnsValidatedCertificate deprecation has so much back and forth dialogue around it.

Side note: The CloudFormation AWS::CertificateManager::Certificate should make the CNAME RecordName and RecordValue a Return Value accessible through Fn::GetAtt. You could then pass those values to this CrossAccountRecordSet construct to enable cross account DNS validation for your ACM certificate!
...One problem at a time.

Key Design Decisions

  • I spent a significant amount of time debating back and forth between the construct accepting RecordSetProps versus accepting an array of RecordSetProps[] which would fully address the concerns of aws_route53: Allow updates to existing records in a single operation #26754. The aws-route53/README.md update defends why this decision was made.

  • The @aws-cdk/custom-resource-handlers/lib/aws-route53/cross-account-record-set-handler/index.ts custom resource handler may appear to be overly complicated because the implementation performs DELETE/CREATE ChangeBatches, rather than an UPSERT. I faced so much pain when testing various RecordSet changes relying on only Route53 UPSERT, and I firmly believe this will result in a better user experience for the price of a more complicated implementation.

  • The @aws-cdk/custom-resource-handlers/lib/aws-route53/cross-account-record-set-handler/index.ts returns its own PhysicalResourceId, which requires sending a Custom Resource Response Object. The PhysicalResourceId class explains why this was necessary. Route53 uniquely identifies a RecordSet in a non obvious way, and this implementation 🪄 just does the right thing 🪄 .

  • Up until the very end, the CrossAccountRecordSet construct supported an evaluateTargetHealth prop, but I decided to remove it because the RecordSetProps doesn't support that prop, and I only wanted this change to match the features for what already exists (See feat(route53): added EvaluateTargetHealth to Route53 Alias targets (#9481) #30664). There are a few parameters to the ChangeResourceRecordSet API that aren't available in the Route53 CDK package, and should probably be addressed in a separate change. I'm most interested in figuring out a sane way to configure health checks, but looks like this is an open issue that is being worked on ([route53] Support for health checks #9481 feat(route53): added L2 construct for Route53's health checks #30739).

  • By the time I was mostly done with the implementation, I reread the NEW_CONSTRUCTS_GUIDE and realized this construct should probably go into an alpha package. If reviewers want this to be made an alpha package, I'll do the work of refactoring the code, but I don't want to spend more time on this unless I know the changes would be merged.

Description of changes

  • Creates a new CrossAccountRecordSet construct that creates Route53 RecordSet resources in a HostedZone that exists in another account.
  • Minor refactors to utility files for code reuse:
    • route53Region from @aws-cdk/custom-resource-handlers/lib/aws-route53/cross-account-zone-delegation-handler/index.ts -> @aws-cdk/custom-resource-handlers/lib/aws-route53/sts-util.ts.
    • validateRecordSetProps from aws-cdk-lib/aws-route53/lib/record-set.ts -> aws-cdk-lib/aws-route53/lib/util.ts.
  • submitResponse was made a public export from packages/@aws-cdk/custom-resource-handlers/lib/core/nodejs-entrypoint-handler/index.ts This seemed like a safe change to me, and it would have been a waste to rewrite this functionality in the custom handler.

Description of how you validated changes

  • Unit Tests (100% cov)
  • Integ Tests

Personal Issues

  • This may be an issue on my part, but I couldn't find any good documentation on creating and validating .lit.ts files. I basically had to copy/paste packages/aws-cdk-lib/aws-route53/test/integ.cross-account-record-set-simple-example.lit.ts into the integ test at packages/@aws-cdk-testing/framework-integ/test/aws-route53/test/integ.cross-account-record-set-simple-example.lit.ts. One file purely exists for README documentation, and the other exists as an IntegTest. From what I can tell, past changes are basically duplicating these files as well. Hopefully I'm missing something.

  • In order to get Istanbul to report my code coverage correctly, I had to make the following tsconfig.json changes:

diff --git a/packages/@aws-cdk/custom-resource-handlers/tsconfig.json b/packages/@aws-cdk/custom-resource-handlers/tsconfig.json
index 312bd74a61..eb97e8ad3d 100644
--- a/packages/@aws-cdk/custom-resource-handlers/tsconfig.json
+++ b/packages/@aws-cdk/custom-resource-handlers/tsconfig.json
@@ -6,8 +6,7 @@
         "strict": true,
         "alwaysStrict": true,
         "declaration": true,
-        "sourceMap": false,
-        "inlineSourceMap": false,
+        "inlineSourceMap": true,
         "inlineSources": false,
         "noUnusedLocals": true,
         "noUnusedParameters": true,
  • I spent half my time fighting the build system. npx lerna run build never compiled the files like I thought it would, and yarn build was inconsistent. I would frequently deploy changes and be confused as to why it didn't work, only to realize the build commands didn't actually work. I finally realized I should use yarn watch as that was mostly consistent for compiling my changes on file save, but even that had to be frequently killed and restarted. I'm not sure if this is a problem with my setup, but I basically learned to not trust that my files were actually built, and always inspect the compiled .js files to confirm they contained the changes that I expected.

Checklist


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@github-actions github-actions bot added the beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK label Sep 2, 2024
@aws-cdk-automation aws-cdk-automation requested a review from a team September 2, 2024 06:13
@github-actions github-actions bot added the p2 label Sep 2, 2024
@lukepafford lukepafford force-pushed the cross-account-record-set branch from 74fbbd8 to c96be8d Compare September 2, 2024 06:13
if (props.SetIdentifier) {
listResourceRecordSetArgs.StartRecordIdentifier = props.SetIdentifier;
}
defaultLog('Searching for recordsets with args: ', listResourceRecordSetArgs);
Copy link
Author

@lukepafford lukepafford Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module has a lot of log statements which were very beneficial when testing the construct, and troubleshooting in the CloudWatch log groups. I see some custom resources that have log statements, but most don't. I'm fine with removing the log statements if required.

Comment on lines -95 to -132
/**
* Return the region that hosts the Route53 endpoint
*
* Route53 is a partitional service: the control plane lives in one particular region,
* which is different for every partition.
*
* The SDK knows how to convert a "target region" to a "route53 endpoint", which
* equates to a (potentially different) region. However, when we use STS
* AssumeRole credentials, we must grab credentials that will work in that
* region.
*
* By default, STS AssumeRole will call the STS endpoint for the same region
* as the Lambda runs in. Normally, this is all good. However, when the AssumeRole
* is used to assume a role in a different account A, the AssumeRole will fail if the
* Lambda is executing in an an opt-in region R to which account A has not been opted in.
*
* To solve this, we will always AssumeRole in the same region as the Route53 call will
* resolve to.
*/
function route53Region(region: string) {
const partitions = {
'cn': 'cn-northwest-1',
'us-gov': 'us-gov-west-1',
'us-iso': 'us-iso-east-1',
'us-isob': 'us-isob-east-1',
'eu-isoe': 'eu-isoe-west-1',
'us-isof': 'us-isof-south-1',
};

for (const [prefix, mainRegion] of Object.entries(partitions)) {
if (region.startsWith(`${prefix}-`)) {
return mainRegion;
}
}

// Default for commercial partition
return 'us-east-1';
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved into sts-util.ts so the logic can be reused. I didn't want to take any chances and reintroduce problems solved in #26593

Comment on lines -98 to +99
async function submitResponse(status: 'SUCCESS' | 'FAILED', event: Response) {
export async function submitResponse(status: 'SUCCESS' | 'FAILED', event: Response) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make submitResponse public so we can reuse it in the CrossAccountRecordSet custom resource handler.

@lukepafford lukepafford force-pushed the cross-account-record-set branch 4 times, most recently from dd9f8ed to 4c22422 Compare September 2, 2024 09:21
@lukepafford
Copy link
Author

lukepafford commented Sep 2, 2024

Need to figure out the correct way to get the integration tests to pass in the CodeBuild job, then the change will be ready for review.

Test snapshots seem to have the account ID magically referenced as 12345678. It's very confusing where this value is coming from.


Finally found it it's in the runner-base.ts file. Still not clear how anyone correctly checks in integ tests with a second AWS account, but still experimenting..


Looks like I have to modify the integ test so that the CodeBuild job doesn't try to replace these resources, but that requires changing the test quite a bit which is unfortunate:

@aws-cdk-testing/framework-integ: Failed: /codebuild/output/src2376603837/src/github.com/aws/aws-cdk/packages/@aws-cdk-testing/framework-integ/test/aws-route53/test/integ.cross-account-record-set.js
@aws-cdk-testing/framework-integ: !!! This test contains destructive changes !!!
@aws-cdk-testing/framework-integ: Stack: Api1Stack - Resource: ChildHostedZone4B14AC71 - Impact: WILL_REPLACE
@aws-cdk-testing/framework-integ: Stack: Api1Stack - Resource: mockApiCustomDomainF001DAB6 - Impact: WILL_REPLACE
@aws-cdk-testing/framework-integ: Stack: Api1Stack - Resource: mockApiCustomDomainMapApi1StackmockApi2124752F62A59016 - Impact: WILL_REPLACE
@aws-cdk-testing/framework-integ: !!! If these destructive changes are necessary, please indicate this on the PR !!!


I rewrote the test so that there are no WILL_REPLACE errors, but the build still fails saying that 18 out of 937 integ tests failed and this change only adds two of those integ tests.. I'm not sure if it's just expected that the build fails when new integ tests are added or not, I've been digging for past commits to see if that's true or not but haven't found much luck.


Yeah, not super clear if there's anything I'm supposed to do to fix this build issue.
Screenshot 2024-09-03 at 1 56 00 PM

I'll publish the PR and push any fixes for snapshot weirdness if requested.


So... I definitely created a hack that will make it so the snapshots succeed:

  1. Run the integ tests with the correct ENV variables
  2. Create Recursive find/replace script which contains the values of my real ENV values, along with the default values used within the integ test:
    #!/bin/bash
    # generalize.sh
    if [[ "$#" -ne 1 ]]; then
        echo "$0: replaceDir"
        exit 1
    fi
    
    find "${1}" -type f -print0 | xargs -0 sed -i '' -e 's/MyRealAwsAccountId1/12345678/g'
    find "${1}" -type f -print0 | xargs -0 sed -i '' -e 's/MyRealAwsAccountId2/12345677/g'
    find "${1}" -type f -print0 | xargs -0 sed -i '' -e 's/my-real-domain.com/example.com/g'
    find "${1}" -type f -print0 | xargs -0 sed -i '' -e 's/my-real-hosted-zone-id/notarealzoneid/g'
  3. Find/Replace snapshot files:
    ./generalize.sh integ.cross-account-record-set-simple-example.lit.js.snapshot
    ./generalize.sh integ.cross-account-record-set.js.snapshot
  4. Commit the snapshot files so the build server will detect that no changes are required.

This feels really clunky, and I'm not sure if this is what everyone else has been doing, but it should work...


Well, this definitely worked! except 16 integ tests unrelated to this change are still failing:

@aws-cdk-testing/framework-integ: Snapshot Results: 
@aws-cdk-testing/framework-integ: Tests:    16 failed, 937 total

This is definitely confusing as to why the build fails on these tests...


Maybe there is something to these 16 failing tests.
Locally, it tells me the S3 keys are different:

yarn integ aws-ec2/test/integ.launch-template.js
yarn run v1.22.19
$ integ-runner --language javascript aws-ec2/test/integ.launch-template.js

Verifying integration test snapshots...

  CHANGED    aws-ec2/test/integ.launch-template 0.821s
      Resources
[~] AWS::Lambda::Function CustomVpcRestrictDefaultSGCustomResourceProviderHandlerDC833E5E
 └─ [~] Code
     └─ [~] .S3Key:
         ├─ [-] bde7b5c89cb43285f884c94f0b9e17cdb0f5eb5345005114dd60342e0b8a85a1.zip
         └─ [+] 4fabb91af38bfa4b232ef9e8db9a055aa28ba261170742fddb4a53ebe05f6393.zip



Snapshot Results:

Tests:    1 failed, 1 total

My changes seemingly have no reason to cause these changes, but I assume there is a valid reason somewhere.


huh.. I suspect the cause is most likely the single line change in packages/@aws-cdk/custom-resource-handlers/lib/core/nodejs-entrypoint-handler/index.ts that made submitResponse public. This change likely changed the hash of all the handlers that relied on this... Fantastic. I'll figure out how I want to handle this tomorrow.

I can either commit the new hash of all the 16 integ tests, or rewrite the functionality. It probably makes sense to just commit the hashes.


Actually I was too annoyed and through the almighty power of ChatGPT I cooked up an absolutely cursed script that updated the S3 key in all the complaining tests:

#!/bin/bash
# fix-s3-keys.sh
failingTests=(
    aws-cloudfront/test/integ.cloudfront-cross-region-cert.js
    aws-cloudfront/test/integ.distribution-lambda-cross-region.js
    aws-ec2/test/integ.launch-template.js
    aws-ec2/test/integ.vpc-flow-logs-kinesis.js
    aws-ec2/test/integ.vpc-restrict-default-sg.js
    aws-elasticloadbalancingv2/test/integ.alb.oidc.js
    aws-iam/test/integ.oidc-provider.js
    aws-iam/test/integ.condition-with-ref.js
    aws-route53/test/integ.cross-account-zone-delegation.js
    aws-route53/test/integ.delete-existing-record-set.js
    aws-route53/test/integ.rename-cross-account-zone-delegation.js
    aws-route53-patterns/test/integ.hosted-redirect.js
    aws-s3/test/integ.bucket-auto-delete-objects.js
    core/test/integ.cross-region-references.js
    custom-resources/test/provider-framework/integ.provider-with-waiter-state-machine.js
    triggers/test/integ.triggers.js
)

for failingTest in "${failingTests[@]}"; do
    output=$(yarn integ "${failingTest}" 2>&1 >/dev/null)

    # Extract the old and new keys
    oldKey=$(echo "$output" | grep -oE '\[-\] [a-f0-9]{64}' | awk '{print $2}')
    newKey=$(echo "$output" | grep -oE '\[\+\] [a-f0-9]{64}' | awk '{print $2}')

    echo "Replacing $oldKey with $newKey for $failingTest"
    read -p "Are you sure you want to continue? <y/N> " prompt
    if [[ $prompt == "y" || $prompt == "Y" || $prompt == "yes" || $prompt == "Yes" ]]
    then
        find "${failingTest}.snapshot" -type f -print0 | xargs -0 sed -i '' -e "s/${oldKey}/${newKey}/g"
        continue
    fi

    # Handle multiple keys
    oldKeys=($(echo "$output" | grep -oE '\[-\] [a-f0-9]{64}' | awk '{print $2}'))
    newKeys=($(echo "$output" | grep -oE '\[\+\] [a-f0-9]{64}' | awk '{print $2}'))
    alternateOldKey="${oldKeys[0]}"
    alternateNewKey="${newKeys[0]}"

    echo "Replacing $alternateOldKey with $alternateNewKey for $failingTest"
    read -p "Are you sure you want to continue? <y/N> " prompt
    if [[ $prompt == "y" || $prompt == "Y" || $prompt == "yes" || $prompt == "Yes" ]]
    then
        find "${failingTest}.snapshot" -type f -print0 | xargs -0 sed -i '' -e "s/${alternateOldKey}/${alternateNewKey}/g"
        continue
    else
        echo "${failingTest} failed to parse yarn integ"
        exit 1
    fi
done

Hopefully this is the last of the build issues..


Well the build still tries changing the S3 keys on the build job even though local tests now report success... The fix is probably something silly, will try and look more tomorrow.


It looks like modifying @aws-cdk/custom-resource-handlers/lib/custom-resources-framework/config.ts forces you to rerun the integration tests for every test within that file such as in #31166

@lukepafford lukepafford force-pushed the cross-account-record-set branch 8 times, most recently from 33e2c1b to 90739a8 Compare September 3, 2024 18:46
@lukepafford lukepafford marked this pull request as ready for review September 3, 2024 18:57
@lukepafford lukepafford force-pushed the cross-account-record-set branch 4 times, most recently from 384b1ac to 7072fdd Compare September 5, 2024 02:23
@lukepafford lukepafford marked this pull request as draft September 5, 2024 04:58
@lukepafford lukepafford force-pushed the cross-account-record-set branch 3 times, most recently from 6a75a29 to 2f6ac12 Compare September 9, 2024 01:54
@lukepafford
Copy link
Author

lukepafford commented Sep 9, 2024

A critical piece of information has to be missing from INTEGRATION_TESTS.md; I have now spent twice as long trying to successfully update snapshots as I did as making the change.

Catch 22 Integ Tests

Integ tests can run with 3 different environment variables:

  1. The default values set by the test runner itself.
  2. Environment variables you set when you run the integ tests.
  3. Default values set by the integ test when no environment variables are found.

When the CodeBuild project checks all the snapshots, it won't run with your environment variable values. It will either use the default values set in the integ runner, or the default values set in the integ test.

These are dummy values that you can't use when you rerun the snapshot tests. If you did try and use them, the tests would surely fail. So you fundamentally have to regenerate the snapshots with valid environment variables that the CodeBuild project will never have access to. Your snapshot, and the CodeBuild snapshot will never align.

So how did past snapshots ever get generated? If you inspect a lot of the existing tests, you can see that the snapshot uses the AWS account 12345678, indicating someone committed a snapshot using a fake account ID. How did they do this? Was there some script that retroactively found/replaced real environment variables with the dummy values? I've now ran git blame on so many past integration tests at this point just to get the tiniest bit of insight into how these past snapshots were generated, but no wisdom seems to have been left behind.

It's almost as if people are running the integ tests with real environment values, but the snapshot is saved with the fake values. I have not been able to figure out the incantation to make that spell work.

What's going wrong?

I'm torn between two conclusions which dramatically impacts my mental decision tree going forward. One of the two scenarios is true:

  1. I'm doing something stupid, and creating snapshots works fine
    • Past committers literally just run yarn integ --update-on-failed integ/test.js. Their snapshots are somehow correctly generated with fake environment values.
    • When I do the same, my real env values are saved into the snapshot.
  2. Past committers do something unique in their environment to get their snapshots to align with what the CodeBuild project also expects.

Common sense tells me scenario 2 can't be true, because that would mean there has been lots of past commits where people have been successfully following a magical undocumented process. However I'm quickly running out of other conclusions to reach, since I have tried everything under the sun. I would love to be missing something, but I've thought about this from every possible angle and I'm starting to lose hope.

@lukepafford
Copy link
Author

I thought I almost found the secret sauce. Perhaps if the explicit env declaration was removed from the stack, and the profile was passed to the CDK command, the generated snapshot wouldn't contain real env values:

@@ -158,9 +156,12 @@ const crossAccountRoleStack = new CrossAccountRoleStack(app, 'CrossAccountRoleSt
   authorizedAccounts: [appAccountId],
   changeResourceRecordSetsRecordTypes: ['A'],
   changeResourceRecordSetsNormalizedRecordNames: [`test.${hostedZoneName}`],
-  env: {
-    account: apexAccountId,
-    region: 'us-east-1',
+  cdkCommandOptions: {
+    deploy: {
+      args: {
+        profile: 'cdk_integ_account',
+      },
+    },
   },
 });

@@ -170,14 +171,17 @@ const appStack = new AppStack(app, 'AppStack', {
   apexAccountId: crossAccountRoleStack.account,
   crossAccountRoleName: crossAccountRoleName,
   recordName: 'test',
-  env: {
-    account: appAccountId,
-    region: 'us-west-1',
+  cdkCommandOptions: {
+    deploy: {
+      args: {
+        profile: 'cdk_integ_cross_account',
+      },
+    },
   },
 });

This unfortunately did not work, and the generated snapshot still contained real values.

@lukepafford lukepafford force-pushed the cross-account-record-set branch 5 times, most recently from 2ababcf to 1ad78d5 Compare September 15, 2024 21:04
@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: 1ad78d5
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@aws-cdk-automation
Copy link
Collaborator

This PR has been in the BUILD FAILING state for 3 weeks, and looks abandoned. To keep this PR from being closed, please continue work on it. If not, it will automatically be closed in a week.

@aws-cdk-automation
Copy link
Collaborator

This PR has been deemed to be abandoned, and will be automatically closed. Please create a new PR for these changes if you think this decision has been made in error.

@aws-cdk-automation aws-cdk-automation added the closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. label Oct 15, 2024
Copy link

Comments on closed issues and PRs are hard for our team to see.
If you need help, please open a new issue that references this one.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. p2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Route53] cdk-route53 Cross account records.
2 participants