-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure Capture Nodes Autoscale #31
Comments
After thinking about this task a bit, it seems to present a need for a better way to generate test traffic than our existing demo generators allow. Specifically, the current demo generators hit third-party websites (Alexa top 100) that we don't own. The amount of traffic we're currently driving against them is negligible, but in order to stress-test our capture setups we'll want to drive substantial volumes of traffic through our mirroring mechanism. Therefore, the responsible (and practical) thing to do seems to be to create our own traffic sink(s) to receive our test traffic, and update our traffic generation mechanism drive more traffic per host. Basically, I'd propose that we create a new pair of top-level CLI commands:
|
It looks like there's a few tools we can use as a traffic sink. HTTPBin even has an official Docker image we can simply reuse.
Sample CDK Snippet to generate the sink VPC:
|
In general I'm concerned about capture auto scaling because we need the same traffic flows to go to the same capture instances. It looks like the GWLB handles scaling up by using sticky flows, its the scaling down that I want to make sure we test well. It does look like the GWLB supports a 350 second draining state where it will continue to send old flows to a draining target, but not new flows. We probably should add something about this to the acceptance criteria, that we are deregistering on scale down, waiting the 350s to terminate, etc. This is all to say that initially it is more important that we get the create-cluster --expected-gbps (#34) feature implemented first and use that for initial scaling. It also means that for testing we should not just test by size of flows but number of flows also. We should also decide what unit of capture we want each capture instance to handle? |
I spent some time testing the scaling-up of our ECS-on-EC2 Capture Nodes as part of #147. It's currently not working how I would expect and I'm unsure on how to get it working without a further deep dive. Based on the ECS docs/Blogs [1] [2], what should happen is the following:
With our current CDK code, everything appears to be set up and linked correctly, but when the ECS Service attempts to spin up a new Task and finds there isn't room (Step 3), no Tasks are created in the PROVISIONING state so the metric the linked ASG is looking at to scale (step 4) never increases. Instead, we just get the standard (and expected) "unable to place a task because no container instance met all of its requirements" error message in the Service event history which is supposed to precede Tasks being stuck in a PROVISIONING state. [1] https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cluster-auto-scaling.html |
Description
The Capture Nodes use ECS-on-EC2 for their compute. However, it's unclear whether the current CDK configuration will actually enable scaling of the containers as expected when their CPU/Memory usage increases. This task is to ensure the ECS capture containers do scale up to the limit provided by their backing EC2 ASG.
Acceptance Criteria
The text was updated successfully, but these errors were encountered: