Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer routing traffic within an AWS availability zone to save $$$ #686

Open
iamdanfox opened this issue Apr 28, 2020 · 7 comments
Open

Comments

@iamdanfox
Copy link
Contributor

Most of our services have nodes in different availability zones. Given that there's nothing constraining traffic in any aws-specific way, every time a node of email-service wants to talk to a node of MP, it might pick a node in any region. This means we're probably paying $$$ in cross-AZ traffic when we don't need to.

Pricing diagram from this blog post

image

It seems like if we can slightly bias connections towards staying in their region (e.g. eu-west-1a <-> eu-west-1a) then we'd be able to cut down on our spend a bit.

Proposal

When a server is running on AWS, there's a magic IP address we can call to find out which region it's currently in, e.g.

$ curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone
eu-west-1a

(Atlasdb currently uses this).

Then, servers could either advertise this information somehow. Either using a header, or a dedicated metadata endpoint, or perhaps even plumbed through yaml somehow. We might even be able to DNS resolve the hosts we're given and match them against amazon's published IP ranges: https://ip-ranges.amazonaws.com/ip-ranges.json.

With this information, I'd suggest that we add a tiny constant bias to the Balanced Channel's scores, so rather than starting everything off at 0, we'd say hosts that are in other availability zones get a minimum score of 1. This would mean that under zero utilization, the first request would always go intra AZ.

Possible downsides?

Obviously this would need to fail gracefully when running locally, in docker or on Azure.

@carterkozak
Copy link
Contributor

How do client preceived latencies differ between nodes in different AZs? I'd rather use that data to rank targets than to target specific cloud vendors in an rpc library.
Another option is for deployment infrastructure to provide a quality-factor based on availability zones along with URIs, centralizing that discovery.

@iamdanfox
Copy link
Contributor Author

So the idea here is more about $ savings than latencies tbh

@carterkozak
Copy link
Contributor

Right, we can solve the problem without vendor-specific implementation.

@j-baker
Copy link

j-baker commented Apr 28, 2020

Latencies are the same +- 0.1, 0.2ms.

@j-baker
Copy link

j-baker commented Apr 28, 2020

basically - this isn't a perf thing - it's a spend thing. And just to be clear it's not $0.01 as the doc implies - Amazon are sneaky and charge you on the way in and on the way out for $0.02 per GB.

@j-baker
Copy link

j-baker commented Apr 28, 2020

and with latencies esp when transitives are involved you also start taking into account their good or bad decisions - because with latency you can't help but care about all the hops, whereas you really want to care about only the one you'd like to make. But nice try :)

@carterkozak
Copy link
Contributor

Again, my point is that this is the wrong place to approach that type of problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants