Skip to content

Release v0.9

Pre-release
Pre-release
Compare
Choose a tag to compare
@Sindica Sindica released this 01 Oct 20:14
306c447

This release focuses on doubling the throughput of large Arktos scale-out cluster, minimizing management cost, as well as enabling service support.

Some highlights include:

  • Arktos now supports 50, 000 nodes in a cluster with only two tenant partitions (TP) and two resource partitions (RP) and with similar pod start up latency. This significantly reduces the management cost of 50,000 node Arktos cluster.
  • This release also doubles Arktos system throughput, thanks to many optimizations in API Server, Controller Manager, as well as in Kubelet. (Taking the 60% management cost reduction into consideration, each TP actually has 5 times system throughput with 2.5 times cluster size increase.)
  • Service is now supported in Arktos scale-out and scale-up architecture. Customers now can create and deploy services, and associate pods with the service in Arktos.
  • Pod start up latency and system throughput:
Release v0.8 (June 2021) v0.9 (September 2021)
System Scalability (Cluster Size) 50K (nodes in a cluster) 50K 25K
System Architecture Partition (Cost) 5 Tenant Partition (TP) & 5 Resource Partition (RP) 5x5 2x2 1x1
System Throughput (Combined QPS) 100 QPS in Server / 25 QPS in Client 200/50 200/50 100/25
Latency/Performance
(Pod Startup Latency in seconds)
P50 1.8278 1.7879 1.8307 1.7987
P90 2.7846 2.5756 2.7759 2.6265
P99 5.7178 3.7062 7.3256 4.9631

Features/Improvements/Bug fixes:

Service support:

  • Scale-out cluster is able to use flannel cni
  • Service support is enabled in local dev cluster by default

Scalability and performance tuning changes:

  • Avoid GET node for each node PATCH in kubelet (PR 835)
  • Refresh resource version with idle watchers upon watch session renewal (PR 1183)
  • Reduce pod list requests in perf test (PR 1187)
  • Cherry pick performance related community changes:
    • Use watch instead of list pods in node controller (PR 1129, 1173)
    • Disable watchcache for events (PR 1184)
  • Cherry pick perf test changes:
    • Add channel for events to PodStartupLatency (PR 1187)

Perf test tool changes:

  • Decouple proxy operation in kube-up and kubemark (PR 1105)
  • Fix Prometheus config to include HAProxy metrics (PR 1103)
  • Kubemark cluster starts partition servers in parallel (PR 1113)
  • Support skipping pod deletion phase in perf test (PR 1159)
  • Perf test config for large cluster (PR 1187)

Security fixes:

  • Bump gorilla/websocket to v1.14.2 (PR 1127)

Bug fixes:

  • Fix a bug that event client was created with wrong user agent (PR 1120)
  • Set user agent for clients when talking to API server in another partition (PR 1125, 1186)

Others: