Background |
---|
Speakers
- Matt Klein, Lyft
Envoy is an extensible proxy with many other products built upon it. Envoy was created with a very rapid velocity, but needs to transition into a more sustainable, long-term approach for future growth.
Envoy recognizes that its API is the most important part, and that the proxy is an implementation detail. There was been recent investment in improving security with help from Google, however improving security affects velocity.
Envoy wants help from the open-source community with maintaining the project, code reviews, documentation, etc.
Background | Slides |
---|
Speakers
- Alex Sundström, Spotify
- Erik Lindblad, Spotify
- Kateryna Nezdolii, Spotify
They migrated to Envoy in their "perimeter" between the internet and their internal services.
They learned a lot in migration effort, in particular how useful it was to have a "big red button" to roll back changes that anyone could hit.
Background | Slides | Envoy Mobile | GitHub |
---|
Speakers
- Jose Nino, Lyft
- Michael Schore, Lyft
Why Envoy on mobile? Lyft is looking to extend all of the benefits that Envoy provides for backend services, to their mobile platforms/application. The idea is to treat mobile devices like any other node in a network topology.
To accomplish this across multiple mobile platforms, they went to change Envoy into a library rather than just a single process. This would allow for standardization across all platforms, common tooling for common problems, and reducing cognitive load.
Designing a mobile approach meant including support for many different languages. For example, Envoy is C++, Android is Kotlin, iOS is Swift, and C bindings as a common bridge across them all. The multithreading design required a significant amount of thinking, not just about how to convert the existing Envoy multithreaded process into a thread that can be run from another program, but also how to allow native handling of threads and callbacks across platforms.
Another big advantage is gaining the observability that Envoy provides for mobile devices, something that was missing from their full picture of metrics gathering.
Going forward, they want to further refine Envoy Mobile as a drop-in replacement for other platform-specific implementations. They also want to explore protocol experimentations, intelligent network behavior (health checks/load balancing, as well as switching across interfaces like choosing between wifi/cell networks, protocols, and IPv4 vs IPv6 for best performance). They also want to implement dynamic configuration.
Background | Slides | Contour | GitHub |
---|
Speakers
- Steve Sloka, VMware
Contour is an ingress controller for Kubernetes. It runs all traffic through Envoy. VMware found Contour was using very high levels of memory from what was expected and learned some interesting characteristics during their debugging.
Changes to Secrets (e.g. Certificates) caused updates to LDS. High rate of changes caused lots of old configurations that needed to be drained from listeners. Listeners had a default drain timeout of 600s and held onto memory during the drain.
Background | Cilium | GitHub |
---|
Speakers
- Thomas Graf, Cilium / Isovalent
Cilium with Envoy via Sidecar pattern in Kubernetes.
Shift to have one Envoy per Node, instead of per Pod, to alleviate scaling issues.
The concept of "namespaces" for Envoy is to allow fair sharing of the Node Envoy across the client pods.
Development is happening in SIG-Envoy.
Background | Slides |
---|
Speakers
- Shubha Rao, AWS
Presentation of AWS architecture for how they manage large volumes of Envoy instances.
Background | Slides |
---|
Speakers
- Dylan Carney, Stripe
Stripe is a technology company that builds economic infrastructure for the internet.
TLS negotiation is expensive, especially when client and server are physically far from each other over the internet. To solve this, they employ Envoy near clients around the world and their servers in the US for mTLS and HTTP/2. HTTP/2 in particular is useful for multiplexing requests over a single TCP connection.
They use blue/green deployments, where traditionally 100% of traffic goes to one deployment, and components can be upgraded on the other before cutting clients over. With Envoy, they can route a certain percentage of traffic across both environments to prevent issues from client-side DNS caching. They can also route requests away from bad hardware/software deployments while they remediate.
From Microbenchmarks to HTTP2 Load-testing: 5 Performance Tools and Techniques to Improve Envoy Scalability
Background | Slides | Google Benchmark GitHub | Nighthawk GitHub |
---|
Speakers
- Joshua Marantz, Google
- Otto van der Schaaf, We-Amp B.V.
Information-packed presentation with a lot more details in the slides.
They added C++ macros to enable performance monitoring and found a lot of time was being spent in regular expression matching misses.
They also designed an HTTP load generator called Nighthawk
to evaluate Envoy
performance.
Additionally they used fuzz testing techniques to find performance and security issues with Envoy, however the fuzz testing is slow to run.
Background | Slides |
---|
Speakers
- Lita Cho, Lyft
- Tom Wanielista, Lyft
Prior to Kubernetes adoption within Lyft, they implemented a means of service
discovery on their existing infrastructure. To help with their internal
migration to Kubernetes, they designed a control-plane mechanism that exists
outside of Kubernetes to allow legacy infrastructure and Kubernetes deployments
to continue to discover each other. This is accomplished by the control-plane
watching for events from apiserver
for the components moved into Kubernetes
and the components reaching out to the control-plane to discover the services.
They ran into issues with Kubernetes Pod scale up and down events, as well as the order of operations in which sidecars run, so they currently run a patched version of Kubernetes with some fixes they need.
Background | Slides |
---|
Speakers
- Ben Plotnick, Cruise
By using request headers, the request can be routed to a particular instance of a service. This could be useful for testing a new version of a service in a real-world "production" environment.
Background | Slides | Istio | GitHub |
---|
Speakers
- John Howard, Google
- Snow Petterson, Square
- Liam White, Tetrate
The goal here is to keep as much traffic within the same locality/availability zone to improve responsiveness from reduced latency and to save costs from egress traffic across regions. They implement this idea using a load balancing algorithm to select services within a region, but allowing fail-over into other regions based upon priorities.
Istio and Square implement locality-aware load balancing, but in different ways.
There are still some caveats to using this approach, namely it could be possible to be in the position with an uneven distribution of load. That could potentially increase latency by overloading a particular region and wasting money on the under-utilized regions. Health checks can also be tricky, e.g. missing health checks may not necessarily mean failing health checks.
Background | Slides |
---|
Speakers
- Wayne Zhang, Google
- Yangmin Zhu, Google
There are three authorization filters covered for various use-cases, and can be used together or independently:
jwt_authn
Envoy filter for JWT tokenRBAC
filer for authorization inside of Envoyext_authz
filter for authorization outside of Envoy
Sample jwt_authn
config:
providers:
provider_name1:
issuer: https://example.com
audiences:
- bookstore_android.apps.googleusercontent.com
remote_jwks:
http_uri:
uri: https://example.com/jwks.json
cluster: example_jwks_cluster
provider_name2:
issuer: https://example2.com
local_jwks:*S
inline_string: PUBLIC-KEY
from_headers:
- name: jwt-assertion
forward: true
forward_payload_header: x-jwt-payload
rules:
# /health doesn’t require verification
- match:
prefix: /health
# /api paths use provider_name1 jwt
- match:
prefix: /api
requires:
provider_and_audiences:
provider_name: provider_name1
audiences:
Api_audience
# all other paths use provider_name2 jwt
- match:
prefix: /
requires:
provider_name: provider_name2
Sample RBAC
config:
action: ALLOW
policies:
"product-viewer":
permissions:
- and_rules:
rules:
- header: { name: ":method", exact_match: "GET" }
- header: { name: ":path", prefix_match: "/admin" }
- destination_port: 80
principals:
- or_ids:
ids:
- authenticated:
principal_name:
exact: "production"
- metadata:
filter: envoy.filters.http.jwt_authn
path:
- key: https://example.com
- key: sub
value:
string_match:
exact: admin
Sample ext_authz
config:
http_filters:
- name: envoy.ext_authz
config:
http_service:
server_uri:
uri: 127.0.0.1:10003
cluster: ext-authz
timeout: 0.25s
failure_mode_allow: false
metadata_context_namespaces:
- envoy.filters.http.jwt_authn
clusters:
- name: ext-authz
connect_timeout: 0.25s
type: logical_dns
lb_policy: round_robin
load_assignment:
cluster_name: ext-authz
endpoints:
- # Omitted
tls_context:
# Omitted
Background | Slides | Grano |
---|
Speakers
- Anoop Koloth, eBay
- Hanzhang Wang, eBay
They used metrics from Envoy with Grano for machine-learning anomaly detection to determine if there are bots/attacks hitting their services, perform traffic analysis, and make decisions for scaling.
Background |
---|
Speakers
- Harvey Tuch, Google
Envoy can be configured via xDS
(a gRPC
API). It currently supports
multiple versions of the API, which because of a backwards-compatibility
guarantee, carries some tech debt they would like to pay down.
Going forward, API releases will become more defined - including deprecation and removal of older API version support.
By 2020, they plan on taking their xDS
API to Universal Dataplane API (UDPA).
Background | Contour | GitHub |
---|
Speakers
- Nick Young, VMWare
Lightning talk about Contour being a Kubernetes Ingress Controller.
Background | Slides |
---|
Speakers
- Nicolas Flacco, Lyft
- Henry Yang, Lyft
- Mitch Sulaski, Workday
Redis Proxy is a simple Redis client using Envoy. It provides a single point of abstraction to work with Redis instance(s) or Redis Cluster. It maintains a connection pool to the Redis instance(s) on behalf of the client, as well as provides load balancing.
Background | Slides |
---|
Speakers
- Mitch Kelley, Solo.io
The Tap filter for Envoy is a powerful tool because it can capture entire request and response headers and bodies when a provided match criteria is met. This can be useful for debugging issues that can be hard to troubleshoot.
Background | Slides |
---|
Speakers
- Cynthia Coan, Datawire
Clients have a lot going on and may not always have the time to upgrade to newer versions of your API, libraries, and products as fast as you would like to release updates. Additionally, communicating breaking changes can be challenging and/or problematic. SemVer can go wrong without accounting for the context of your clients.
How to fix this? You should communicate with you clients and make your upgrade process super easy. Document more than your public API, and test everything. If you have an agreement with a client for a technical detail, make sure you write a test to ensure that agreement remains fulfilled. Invest seriously in making sure your API is followed properly. If you can't do these things, give your client time.
Tech debt affects more than you, and every bit of tech debt adds complexity. When planning something out, "keep it boring" and follow the YAGNI principle to prevent accumulation of tech debt.
Your team's image and reputation is very important to have a successful relationship with your clients, and to keep them onboard with embracing your changes.
Every new feature needs something like a Definition of Ready and Definition of Done. Use a linter to automate code reviews and enforce best practices. Keep meeting heavy days to a single day per week to avoid daily/frequent context switching. Constantly develop of master and your CI process should use master.