Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure Capture Nodes Autoscale #31

Open
chelma opened this issue Apr 25, 2023 · 4 comments
Open

Ensure Capture Nodes Autoscale #31

chelma opened this issue Apr 25, 2023 · 4 comments
Labels
Capture Resilience Work to make traffic capture more resilient to changes in load, configuration, and sources
Milestone

Comments

@chelma
Copy link
Collaborator

chelma commented Apr 25, 2023

Description

The Capture Nodes use ECS-on-EC2 for their compute. However, it's unclear whether the current CDK configuration will actually enable scaling of the containers as expected when their CPU/Memory usage increases. This task is to ensure the ECS capture containers do scale up to the limit provided by their backing EC2 ASG.

Acceptance Criteria

  • Demonstrate the ability of the Capture Nodes to automatically scale up/down within the boundary of the backing EC2 ASG
  • Update code/configuration as necessary to achieve that demo
@chelma chelma added the Capture Resilience Work to make traffic capture more resilient to changes in load, configuration, and sources label Apr 25, 2023
@chelma chelma changed the title Fix Capture Node Autoscaling Ensure Capture Nodes Autoscale Apr 25, 2023
@chelma chelma added this to the Arkimeet milestone Apr 27, 2023
@chelma
Copy link
Collaborator Author

chelma commented May 11, 2023

After thinking about this task a bit, it seems to present a need for a better way to generate test traffic than our existing demo generators allow. Specifically, the current demo generators hit third-party websites (Alexa top 100) that we don't own. The amount of traffic we're currently driving against them is negligible, but in order to stress-test our capture setups we'll want to drive substantial volumes of traffic through our mirroring mechanism. Therefore, the responsible (and practical) thing to do seems to be to create our own traffic sink(s) to receive our test traffic, and update our traffic generation mechanism drive more traffic per host.

Basically, I'd propose that we create a new pair of top-level CLI commands: create-stress-test-setup and destroy-stress-test-setup.

  • create-stress-test-setup
    • Create one or more VPCs containing ECS Fargate tasks that execute a Docker container that will generate large volumes of traffic to a specified location
    • Create a VPC containing ECS Fargate tasks that receive/respond to the traffic generated
    • Communication between source and sink will occur over the public internet to trigger our mirroring filter
  • destroy-stress-test-setup
    • Tear down the CDK stacks for the test bed

@chelma
Copy link
Collaborator Author

chelma commented May 11, 2023

It looks like there's a few tools we can use as a traffic sink. HTTPBin even has an official Docker image we can simply reuse.

FROM kennethreitz/httpbin:latest

# Expose the port the app runs on
EXPOSE 80

Sample CDK Snippet to generate the sink VPC:

import * as cdk from 'aws-cdk-lib';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ecs_patterns from 'aws-cdk-lib/aws-ecs-patterns';

export class TrafficSinkStack extends cdk.Stack {
  constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Create a VPC
    const vpc = new ec2.Vpc(this, 'VPC', {
      maxAzs: 2,
    });

    // Create an ECS cluster
    const cluster = new ecs.Cluster(this, 'Cluster', {
      vpc: vpc,
    });

    // Create a Fargate service
    const fargateService = new ecs_patterns.ApplicationLoadBalancedFargateService(
      this,
      'FargateService',
      {
        cluster: cluster,
        taskImageOptions: {
          image: ecs.ContainerImage.fromAsset(__dirname, {
            file: 'Dockerfile',
          }),
          containerPort: 80,
        },
        publicLoadBalancer: true,
      }
    );

    // Output the DNS name of the ALB
    new cdk.CfnOutput(this, 'LoadBalancerDNS', {
      value: fargateService.loadBalancer.loadBalancerDnsName,
    });
  }
}

@awick
Copy link
Contributor

awick commented May 11, 2023

In general I'm concerned about capture auto scaling because we need the same traffic flows to go to the same capture instances. It looks like the GWLB handles scaling up by using sticky flows, its the scaling down that I want to make sure we test well. It does look like the GWLB supports a 350 second draining state where it will continue to send old flows to a draining target, but not new flows. We probably should add something about this to the acceptance criteria, that we are deregistering on scale down, waiting the 350s to terminate, etc.

This is all to say that initially it is more important that we get the create-cluster --expected-gbps (#34) feature implemented first and use that for initial scaling.

It also means that for testing we should not just test by size of flows but number of flows also.

We should also decide what unit of capture we want each capture instance to handle?

@chelma
Copy link
Collaborator Author

chelma commented Jan 17, 2024

I spent some time testing the scaling-up of our ECS-on-EC2 Capture Nodes as part of #147. It's currently not working how I would expect and I'm unsure on how to get it working without a further deep dive.

Based on the ECS docs/Blogs [1] [2], what should happen is the following:

  1. Demand increases on the ECS containers (mem, cpu, etc), exceeding the set limit according to the ECS Service's scaling policy
  2. A CloudWatch Alarm created by the ECS Service's scaling policy should fire, kicking off a CloudWatch Action to place another Task in the EC2 ASG.
  3. If there is not enough space in the ASG to fit another ECS Task, it continues with creating the task but places it in the PROVISIONING state.
  4. ECS Tasks in the PROVISIONING state increase the CloudWatch Metric CapacityProviderReservation in the namespace AWS/ECS/ManagedScaling, with a separate metric for each ECS Service. When the metric goes over 100%, that tells the associated ASG to provision new instances up to its own scaling limits.
  5. The ASG spins up new instances according to the combined scaling policy
  6. ECS attempts to place the PROVISIONING Tasks onto the new Instances, continuing the process until the scaling limits are reached or all Tasks are placed.

With our current CDK code, everything appears to be set up and linked correctly, but when the ECS Service attempts to spin up a new Task and finds there isn't room (Step 3), no Tasks are created in the PROVISIONING state so the metric the linked ASG is looking at to scale (step 4) never increases. Instead, we just get the standard (and expected) "unable to place a task because no container instance met all of its requirements" error message in the Service event history which is supposed to precede Tasks being stuck in a PROVISIONING state.

[1] https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cluster-auto-scaling.html
[2] https://aws.amazon.com/blogs/containers/deep-dive-on-amazon-ecs-cluster-auto-scaling/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Capture Resilience Work to make traffic capture more resilient to changes in load, configuration, and sources
Projects
None yet
Development

No branches or pull requests

2 participants