Skip to content

Computing 930 2021 Tracks

Ying Huang edited this page Sep 17, 2021 · 43 revisions

Goals

  1. Burst scheduling support
    • QPS 40 per 10K cluster
    • 1TP/1RP QPS >= 40
    • 50K cluster QPS 200: 5TP/5RP, 3TP/4RP possible < 200
  2. Minimal management cost for 50K cluster
    • Number of TP <= 5, number of RP <= 5
  3. Service Support in scale out cluster
  4. Daemonset handling in RP (on hold in favor of service support)
    • System tenant only
    • Multi-tenancy daemonset out of scope
  5. System partition pod handling raw design

Current status

Release 0.8

  1. 5TP/5RP 50K cluster 20 QPS, pod start up latency <= 6s (p99)

Current Work in Progress (9/16)

1. Burst scheduling support & minimal management cost for 50K cluster

a. 1TP/1RP maximal nodes - 1x30K

Date Cluster Size QPS p50(s) p90(s) p99(s) Changes Note
8/26 1x25K 100/5 1.82 2.65 5.00 Reduced list/watch pods in perf test
9/01 1x25K 100/5 1.82 2.65 4.88 Using event receiving time as watch time in perf test
9/08 1x25K 100/5 1.81 2.62 4.51 Index pod by label selector in perf test
9/14 1x25K 150/25 1.82 2.66 5.53 * Increased cache size for 25K cluster (previous cachesize is for 10K cluster)
* Send bookmark event to client
* Saturation pod QPS 100 -> 150, Latency pod QPS 5->25
9/15 1x30K 150/30 1.83 2.76 6.96 * Cachesize increased to 30K cluster
* Latency pod QPS 30
9/16 1x35K 150/35 1.87 2.89 8.63 * Cachesize increased to 35K cluster
* Latency pod QPS 35
* KCM 500 error

b. 2TP/2RP 2x25K=50K node

Date Cluster Size QPS p50(s) p90(s) p99(s) Changes
8/17 2x25K 2x100 1.87
1.87
2.74
2.73
6.59
6.29
Could be incorrect as misconfiguration caused watchers number are much less than usually
(using as a reference)
9/02 2x25K 2x100 1.87
1.87
2.90
2.84
9.14
8.27
Same as 9/1 1x25K cluster run
9/09 2x25K 2x100 1.83
1.85
2.81
2.80
9.89
9.01
9/8 + increased cache size for pod to accomodate 25K cluster
9/13 2x25K 2x100 1.79
1.79
2.54
2.54
2.99
2.96
9/9 + send bookmark event to client

c. 930 release test plan

  • Scale up/scale out/1TP1RP/2TP2RP 50K
  • Density/Load
  • Cluster QPS/Latency pod QPS

d. 50K cluster tp99 improvement thoughts

  • Reduce # of secret watcher - Yunwen (TBD)

2. Service support in arktos - Hongwei/Carl

  1. Implemented components, need to enable & verify (WIP)
    1. kubernetes service entries: have to be network specific, instead of kubernetes global (done)
    2. kube-dns (in kube-system namespace) service entries: have to be network specific; each network should have its own deployment (done)
    3. Make flannel working in arktos
      1. Scale up (done)
      2. Scale out (WIP)
    4. Start dns pods in arktos
      1. Scale up (done)
      2. Scale out (WIP)
    5. Arktos network controller: whenever a teannt is created, the default network object should be created automatically, plus its kubernetes and kube-dns service entry; for flat type network, it should also take care of kube-dns deployment.
  2. kubelet: when initializing pod sandbox, should provision /etc/resolv.conf with proper kube-dns_{network} service IP
  3. Make Kube proxy aware multi-tenancy
  4. Simple on/off feature gate(s)
  5. Containerize network controller
  6. Data entry in Prometheus (and solve previous 404 issue)

3. Allow new node join RP - manually set up full arktos scale out cluster - Carl

  1. Add new node into scale up cluster, update manual - DONE
  2. Add new node into scale out cluster, update manual - TBD

4. Security alert (TBD)

  1. Issue 1126 - github dependabot alerts
    1. github.com/gorilla/websocket to v1.4.1: code/PR ready: https://github.com/CentaurusInfra/arktos/pull/1127 - (DONE - Sonya)
    2. containerd to v1.4.8: As dependency, k8s.io/utils upgrade is necessary; tracked by issue https://github.com/CentaurusInfra/arktos/issues/924
    3. runc to v1.0.0-rc95: As dependency, k8s.io/utils upgrade is necessary; tracked by issue https://github.com/CentaurusInfra/arktos/issues/924

5. Scalability improvement thoughts

  1. Reduce perf test duration
    1. Increase QPS for latency pod creation - perhaps in cluster loader (saturation 20->100, latency 5->25?)
  2. Fine tuning
    1. Evaluate all list requests from clients that go to ETCD directly (on demand)

Completed Task

  1. Burst scheduling support
    1. 1TP/1RP 10K cluster 100 QPS - 6/21, pod start up latency <= 3s (p99)
    2. 1TP/1PR 15K cluster 100 QPS - 7/26, pod start up latency p50 1.8s, p90 2.6s, p99 4.2s
    3. 1TP/1PR 20K cluster 100 QPS - 7/29, pod start up latency p50 1.8s, p90 2.7s, p99 5.5s
    4. 1TP/1RP 25K cluster 100 QPS - 8/26, pod start up latency p50 1.8s, p90 2.7s, p99 5.0s
    5. 1.18.5 15K cluster 100 QPS (1 API server, 1 ETCD) - 7/1, pod start up latency 1.41s, 2.17s, 5.25s
      1. Diff between pod_start_up and run_to_watch is whole seconds: 0s (9192), 1s(4961), 2s(449), 3s(211), 4s, 5s(34)
    6. 1.21 15K cluster 100 QPS - 7/15, pod start up latency 1.54s, 2.6s1, 5.65s
    7. 1.21 20K cluster 100 QPS - 7/16. Scheduler restarted multiple times due to leader election lost
    8. 1.21 20K cluster 100 QPS - 7/22. p50 1.7s, p90 3.3s, p99 9.3s, saturation latency bad (p50 926s)
    9. Set up Promethus for k8s 1.18.5 & 1.21 (7/12)
    10. Identify 1.18 perf improvement changes
      1. Node controller has expensive list pods, switched to watch PR 1129, PR 1151 (Issue 77733)
      2. Reduce cachesize for event in apiserver (https://github.com/kubernetes/kubernetes/pull/96117)
    11. Arktos perf change
      1. Reduce kubelet getting node PR 835
      2. Increased watch timeout from 5min mean to 30 min mean (reverted - watch cannot be longer than 10 min)
      3. Reduce list pods from perf test PR 1163
      4. Add indexer to perf test PR 1169
      5. Increase pod cache size - YingH PR 1175
      6. Send fake bookmark event to client to reduce size of initEvents - YingH PR 1179
  2. Bug fix
    1. Fix user agent of event client - PR 1120 https://github.com/CentaurusInfra/arktos/pull/1120
  3. Minimal management cost for 50K cluster
    1. Start TP in parallel, start RP in parallel - Done 7/8 PR 1113
  4. Daemonset handling in RP
    1. Design - Hongwei (Done 7/6)
  5. Reduce perf test duration
    1. Skip garbage collection step - (Done 8/23)

Parking Tasks

  1. Promethus support for k8s perf test
    1. Automatically preserve historical promethus data
    2. Periodically pulling profiling data automatically
  2. API server performance: log, code analysis
    1. Kubelet container died - Yunwen
    2. Pod creation event diff in 1.18&arktos audit log - YingH (Parking)
      1. 1.18.5 behave the same as arktos in local cluster up
      2. 1.18.5 uses v1 for event in kube up, no audit log
      3. Arktos use v1beta1 for event in kube up (same as local cluster up)
  3. Scan all K8s performance improvment - Carl - still necessary?
    1. Current focus on watch improvement
  4. Daemonset handling
    1. Implementation - TBD
  5. Scalability improvement thoughts
    1. Utilization of audit log (post 930)
      1. Enable apiserver audit in local dev env
      2. Auto scan audit log, summarize all request type, resources, duration, etc. (Look for existing tools)
    2. Enable api server request latency (post 930)
      1. Migrate to scalability metrics framework PR 980
    3. Start cluster in parallel
      1. Start TP/RP in parallel - needs a lot of work, does not have significant improvement