Release Release v0.9 · CentaurusInfra/arktos

This release focuses on doubling the throughput of large Arktos scale-out cluster, minimizing management cost, as well as enabling service support.

Some highlights include:

Arktos now supports 50, 000 nodes in a cluster with only two tenant partitions (TP) and two resource partitions (RP) and with similar pod start up latency. This significantly reduces the management cost of 50,000 node Arktos cluster.
This release also doubles Arktos system throughput, thanks to many optimizations in API Server, Controller Manager, as well as in Kubelet. (Taking the 60% management cost reduction into consideration, each TP actually has 5 times system throughput with 2.5 times cluster size increase.)
Service is now supported in Arktos scale-out and scale-up architecture. Customers now can create and deploy services, and associate pods with the service in Arktos.
Pod start up latency and system throughput:

Release		v0.8 (June 2021)	v0.9 (September 2021)
System Scalability (Cluster Size)		50K (nodes in a cluster)	50K		25K
System Architecture Partition (Cost)		5 Tenant Partition (TP) & 5 Resource Partition (RP)	5x5	2x2	1x1
System Throughput (Combined QPS)		100 QPS in Server / 25 QPS in Client	200/50	200/50	100/25
Latency/Performance (Pod Startup Latency in seconds)	P50	1.8278	1.7879	1.8307	1.7987
	P90	2.7846	2.5756	2.7759	2.6265
	P99	5.7178	3.7062	7.3256	4.9631

Service support:

Scalability and performance tuning changes:

Avoid GET node for each node PATCH in kubelet (PR 835)
Refresh resource version with idle watchers upon watch session renewal (PR 1183)
Reduce pod list requests in perf test (PR 1187)
Cherry pick performance related community changes:
- Use watch instead of list pods in node controller (PR 1129, 1173)
- Disable watchcache for events (PR 1184)
Cherry pick perf test changes:
- Add channel for events to PodStartupLatency (PR 1187)

Perf test tool changes:

Security fixes:

Bug fixes:

Fix a bug that event client was created with wrong user agent (PR 1120)
Set user agent for clients when talking to API server in another partition (PR 1125, 1186)

Others: