Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[clusteragent/autoscaling] Impose 100 DatadogPodAutoscaler limit in cluster agent #28684

Merged
merged 33 commits into from
Oct 4, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
8ee3911
Add creation timestamp to DatadogPodAutoscalerInternal
jennchenn Aug 20, 2024
6cd039c
Fix minor typos
jennchenn Aug 21, 2024
46ece6f
Implement heap to track number of DPA objects
jennchenn Aug 22, 2024
cabcad1
Update heap names to reduce redundancy in calls
jennchenn Aug 23, 2024
e70833e
Add basic unit test for creation of heap
jennchenn Aug 23, 2024
843d36c
fixup! Update heap names to reduce redundancy in calls
jennchenn Aug 23, 2024
b200155
Make error message shown in status more explicit
jennchenn Aug 30, 2024
c6d6431
Reorder PAI params to have metadata fields first
jennchenn Aug 30, 2024
1863ce1
Prevent adding objects with 0 timestamp to heap
jennchenn Aug 30, 2024
9ecf501
Remove hashheap from generic controller definition and add delete obs…
jennchenn Aug 30, 2024
156ecff
Register observer on store to update heap
jennchenn Sep 6, 2024
90799f2
Merge remote-tracking branch 'origin/main' into jenn/CASCL-1_autoscal…
jennchenn Sep 6, 2024
b680558
Add check for 100 DPAs to validation logic
jennchenn Sep 6, 2024
ea7979d
Refactor max heap functions to use Exists helper
jennchenn Sep 6, 2024
91e6b9f
Update DCA target test with autoscaler creation timestamp
jennchenn Sep 6, 2024
0622fb4
Remove period from error message; remove redundant return
jennchenn Sep 6, 2024
7254318
Update local test to check behavior on delete
jennchenn Sep 25, 2024
03fd859
Add RWMutex to hash heap
jennchenn Sep 25, 2024
b29cfd0
Remove test namespace hardcoded value
jennchenn Sep 26, 2024
565a3ef
Update creation timestamp for remote owner
jennchenn Sep 26, 2024
b92ac07
Rename controller hashHeap to limitHeap for better description of usage
jennchenn Sep 27, 2024
3913a97
Remove redundant variable initialization
jennchenn Sep 27, 2024
6669594
Rename test to be more descriptive
jennchenn Sep 27, 2024
44589e5
Update creation timestamp only when spec changes
jennchenn Sep 27, 2024
4ae0ab3
Simplify validation logic and pass only one param
jennchenn Sep 27, 2024
3285d31
Skip rest of DCA validation logic if self pod name cannot be found
jennchenn Sep 27, 2024
aedc9f8
fixup! Rename controller hashHeap to limitHeap for better description…
jennchenn Sep 27, 2024
0429380
Use store directly in hash heap to avoid passing obj to observer
jennchenn Sep 27, 2024
653ab74
Test hash heap on remote owner DPA objects
jennchenn Oct 1, 2024
6cdaf91
Merge remote-tracking branch 'origin/main' into jenn/CASCL-1_autoscal…
jennchenn Oct 1, 2024
b9a47f3
Pass in store to hash heap
jennchenn Oct 2, 2024
96ef6dd
Remove duplication and clean up fake pod autoscaler helper function
jennchenn Oct 3, 2024
4a317d5
Remove requeue after creation timestamp update
jennchenn Oct 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions pkg/clusteragent/autoscaling/workload/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -250,9 +250,12 @@ func (c *Controller) syncPodAutoscaler(ctx context.Context, key, ns, name string
// and compare it with the one in the PodAutoscaler. If they differ, we should update the PodAutoscaler
// otherwise store the Generation
if podAutoscalerInternal.Generation() != podAutoscaler.Generation {
podAutoscalerInternal.UpdateCreationTimestamp(podAutoscaler.CreationTimestamp.Time)
if podAutoscalerInternal.CreationTimestamp().IsZero() {
podAutoscalerInternal.UpdateCreationTimestamp(podAutoscaler.CreationTimestamp.Time)
return autoscaling.Requeue, c.updateAutoscalerStatusAndUnlock(ctx, key, ns, name, nil, podAutoscalerInternal, podAutoscaler)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why we requeue here? I'd not expect it to be necessary (similar to the Generation update).
nit: I'd move this code close the SetGeneration for consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originally set it to requeue to trigger a store update earlier, and minimize the amount of time that a status may show an error state incorrectly (i.e. when the limit is not exceeded but the key hasn't yet been added to the heap because it had a zero creation timestamp). moved this and removed the requeue for now as well!

}

localHash, err := autoscaling.ObjectHash(podAutoscalerInternal.Spec)
localHash, err := autoscaling.ObjectHash(podAutoscalerInternal.Spec())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find! We need to check all usage of ObjectHash to make sure we do not have other issues.

if err != nil {
c.store.Unlock(key)
return autoscaling.Requeue, fmt.Errorf("Failed to compute Spec hash for PodAutoscaler: %s/%s, err: %v", ns, name, err)
Expand Down Expand Up @@ -420,7 +423,7 @@ func (c *Controller) deletePodAutoscaler(ns, name string) error {
func (c *Controller) validateAutoscaler(podAutoscalerInternal model.PodAutoscalerInternal) error {
// Check that we are within the limit of 100 DatadogPodAutoscalers
key := podAutoscalerInternal.ID()
if !c.limitHeap.Exists(key) {
if !podAutoscalerInternal.CreationTimestamp().IsZero() && !c.limitHeap.Exists(key) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to reach this point while have a zero CreationTimestamp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! i can update it to check if the value in the store is 0 if necessary but removed for now

return fmt.Errorf("Autoscaler disabled as maximum number per cluster reached (%d)", maxDatadogPodAutoscalerObjects)
}

Expand Down
240 changes: 229 additions & 11 deletions pkg/clusteragent/autoscaling/workload/controller_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -71,17 +71,29 @@ func newFixture(t *testing.T, testTime time.Time) *fixture {
}

func newFakePodAutoscaler(ns, name string, gen int64, creationTimestamp time.Time, spec datadoghq.DatadogPodAutoscalerSpec, status datadoghq.DatadogPodAutoscalerStatus) (obj *unstructured.Unstructured, dpa *datadoghq.DatadogPodAutoscaler) {
dpa = &datadoghq.DatadogPodAutoscaler{
TypeMeta: podAutoscalerMeta,
ObjectMeta: metav1.ObjectMeta{
Name: name,
Namespace: ns,
Generation: gen,
UID: uuid.NewUUID(),
CreationTimestamp: metav1.NewTime(creationTimestamp),
},
Spec: spec,
Status: status,
if gen == -1 { // Create fake pod autoscaler for remote owner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avoid duplication and set in dpa if gen > 0 (0 is not a valid value in existing object IIRC).

if gen > 0 {
  dpa.UID = ...
  dpa.Generation = gen
  dpa.CreationTimestamp = ...
}

dpa = &datadoghq.DatadogPodAutoscaler{
TypeMeta: podAutoscalerMeta,
ObjectMeta: metav1.ObjectMeta{
Name: name,
Namespace: ns,
},
Spec: spec,
Status: status,
}
} else {
dpa = &datadoghq.DatadogPodAutoscaler{
TypeMeta: podAutoscalerMeta,
ObjectMeta: metav1.ObjectMeta{
Name: name,
Namespace: ns,
Generation: gen,
UID: uuid.NewUUID(),
CreationTimestamp: metav1.NewTime(creationTimestamp),
},
Spec: spec,
Status: status,
}
}

obj, err := autoscaling.ToUnstructured(dpa)
Expand Down Expand Up @@ -492,3 +504,209 @@ func TestPodAutoscalerLocalOwnerObjectsLimit(t *testing.T) {
assert.Falsef(t, f.autoscalingHeap.Keys[dpa1ID], "Expected dpa-1 to not be in heap")
assert.Truef(t, f.autoscalingHeap.Keys[dpa2ID], "Expected dpa-2 to be in heap")
}

func TestPodAutoscalerRemoteOwnerObjectsLimit(t *testing.T) {
testTime := time.Now()
f := newFixture(t, testTime)

dpaSpec := datadoghq.DatadogPodAutoscalerSpec{
TargetRef: autoscalingv2.CrossVersionObjectReference{
Kind: "Deployment",
Name: "app-0",
APIVersion: "apps/v1",
},
// Remote owner means .Spec source of truth is Datadog App
Owner: datadoghq.DatadogPodAutoscalerRemoteOwner,
}

dpa1Spec := datadoghq.DatadogPodAutoscalerSpec{
TargetRef: autoscalingv2.CrossVersionObjectReference{
Kind: "Deployment",
Name: "app-1",
APIVersion: "apps/v1",
},
// Remote owner means .Spec source of truth is Datadog App
Owner: datadoghq.DatadogPodAutoscalerRemoteOwner,
}
dpa2Spec := datadoghq.DatadogPodAutoscalerSpec{
TargetRef: autoscalingv2.CrossVersionObjectReference{
Kind: "Deployment",
Name: "app-2",
APIVersion: "apps/v1",
},
// Remote owner means .Spec source of truth is Datadog App
Owner: datadoghq.DatadogPodAutoscalerRemoteOwner,
}

dpaInternal := model.FakePodAutoscalerInternal{
Namespace: "default",
Name: "dpa-0",
Spec: &dpaSpec,
}
f.store.Set("default/dpa-0", dpaInternal.Build(), controllerID)

dpaInternal1 := model.FakePodAutoscalerInternal{
Namespace: "default",
Name: "dpa-1",
Spec: &dpa1Spec,
}
f.store.Set("default/dpa-1", dpaInternal1.Build(), controllerID)

dpaInternal2 := model.FakePodAutoscalerInternal{
Namespace: "default",
Name: "dpa-2",
Spec: &dpa2Spec,
}
f.store.Set("default/dpa-2", dpaInternal2.Build(), controllerID)

// Should create object in Kubernetes
expectedStatus := datadoghq.DatadogPodAutoscalerStatus{
Conditions: []datadoghq.DatadogPodAutoscalerCondition{
{
Type: datadoghq.DatadogPodAutoscalerErrorCondition,
Status: corev1.ConditionFalse,
LastTransitionTime: metav1.NewTime(testTime),
},
{
Type: datadoghq.DatadogPodAutoscalerActiveCondition,
Status: corev1.ConditionTrue,
LastTransitionTime: metav1.NewTime(testTime),
},
{
Type: datadoghq.DatadogPodAutoscalerHorizontalAbleToRecommendCondition,
Status: corev1.ConditionUnknown,
LastTransitionTime: metav1.NewTime(testTime),
},
{
Type: datadoghq.DatadogPodAutoscalerVerticalAbleToRecommendCondition,
Status: corev1.ConditionUnknown,
LastTransitionTime: metav1.NewTime(testTime),
},
{
Type: datadoghq.DatadogPodAutoscalerHorizontalScalingLimitedCondition,
Status: corev1.ConditionFalse,
LastTransitionTime: metav1.NewTime(testTime),
},
{
Type: datadoghq.DatadogPodAutoscalerHorizontalAbleToScaleCondition,
Status: corev1.ConditionUnknown,
LastTransitionTime: metav1.NewTime(testTime),
},
{
Type: datadoghq.DatadogPodAutoscalerVerticalAbleToApply,
Status: corev1.ConditionUnknown,
LastTransitionTime: metav1.NewTime(testTime),
},
},
}
expectedUnstructured, _ := newFakePodAutoscaler("default", "dpa-0", -1, time.Time{}, dpaSpec, expectedStatus)
f.ExpectCreateAction(expectedUnstructured)
f.RunControllerSync(true, "default/dpa-0")

expectedUnstructured1, _ := newFakePodAutoscaler("default", "dpa-1", -1, time.Time{}, dpa1Spec, expectedStatus)
f.Actions = nil
f.ExpectCreateAction(expectedUnstructured1)
f.RunControllerSync(true, "default/dpa-1")

expectedUnstructured2, _ := newFakePodAutoscaler("default", "dpa-2", -1, time.Time{}, dpa2Spec, expectedStatus)
f.Actions = nil
f.ExpectCreateAction(expectedUnstructured2)
f.RunControllerSync(true, "default/dpa-2")
assert.Len(t, f.store.GetAll(), 3)

dpaTime := testTime.Add(-1 * time.Hour)
dpa1Time := testTime
dpa2Time := testTime.Add(1 * time.Hour)

dpa, dpaTyped := newFakePodAutoscaler("default", "dpa-0", 1, dpaTime, dpaSpec, expectedStatus)
dpa1, dpaTyped1 := newFakePodAutoscaler("default", "dpa-1", 1, dpa1Time, dpaSpec, expectedStatus)
dpa2, dpaTyped2 := newFakePodAutoscaler("default", "dpa-2", 1, dpa2Time, dpaSpec, expectedStatus)

f.Actions = nil
f.InformerObjects = append(f.InformerObjects, dpa, dpa1, dpa2)
f.Objects = append(f.Objects, dpaTyped, dpaTyped1, dpaTyped2)

// Check that DatadogPodAutoscaler object is inserted into heap
f.RunControllerSync(true, "default/dpa-1")
assert.Equal(t, 1, f.autoscalingHeap.MaxHeap.Len())
assert.Equal(t, "default/dpa-1", f.autoscalingHeap.MaxHeap.Peek().Key)
assert.Truef(t, f.autoscalingHeap.Keys["default/dpa-1"], "Expected dpa-1 to be in heap")

// Check that multiple objects can be inserted with ordering preserved
f.RunControllerSync(true, "default/dpa-2")
assert.Equal(t, 2, f.autoscalingHeap.MaxHeap.Len())
assert.Equal(t, "default/dpa-2", f.autoscalingHeap.MaxHeap.Peek().Key)
assert.Truef(t, f.autoscalingHeap.Keys["default/dpa-1"], "Expected dpa-1 to be in heap")
assert.Truef(t, f.autoscalingHeap.Keys["default/dpa-2"], "Expected dpa-2 to be in heap")

// Check that heap ordering is preserved and limit is not exceeeded
f.RunControllerSync(true, "default/dpa-0")
assert.Equal(t, 2, f.autoscalingHeap.MaxHeap.Len())
assert.Equal(t, "default/dpa-1", f.autoscalingHeap.MaxHeap.Peek().Key)
assert.Truef(t, f.autoscalingHeap.Keys["default/dpa-0"], "Expected dpa-0 to be in heap")
assert.Truef(t, f.autoscalingHeap.Keys["default/dpa-1"], "Expected dpa-1 to be in heap")
assert.Falsef(t, f.autoscalingHeap.Keys["default/dpa-2"], "Expected dpa-2 to not be in heap")

// Check that when object (dpa1) is deleted, heap is updated accordingly
dpaInternal1.Deleted = true
f.store.Set("default/dpa-1", dpaInternal1.Build(), controllerID)
f.ExpectDeleteAction("default", "dpa-1")
f.RunControllerSync(true, "default/dpa-1")
assert.Len(t, f.store.GetAll(), 3)

f.InformerObjects = nil
f.Objects = nil
f.Actions = nil

f.RunControllerSync(true, "default/dpa-1")

// dpa-2 status currently has an error, it will get resolved in next reconcile
errorStatus := datadoghq.DatadogPodAutoscalerStatus{
Conditions: []datadoghq.DatadogPodAutoscalerCondition{
{
Type: datadoghq.DatadogPodAutoscalerErrorCondition,
Status: corev1.ConditionTrue,
LastTransitionTime: metav1.NewTime(testTime),
Reason: "Autoscaler disabled as maximum number per cluster reached (100)",
},
{
Type: datadoghq.DatadogPodAutoscalerActiveCondition,
Status: corev1.ConditionTrue,
LastTransitionTime: metav1.NewTime(testTime),
},
{
Type: datadoghq.DatadogPodAutoscalerHorizontalAbleToRecommendCondition,
Status: corev1.ConditionUnknown,
LastTransitionTime: metav1.NewTime(testTime),
},
{
Type: datadoghq.DatadogPodAutoscalerVerticalAbleToRecommendCondition,
Status: corev1.ConditionUnknown,
LastTransitionTime: metav1.NewTime(testTime),
},
{
Type: datadoghq.DatadogPodAutoscalerHorizontalScalingLimitedCondition,
Status: corev1.ConditionFalse,
LastTransitionTime: metav1.NewTime(testTime),
},
{
Type: datadoghq.DatadogPodAutoscalerHorizontalAbleToScaleCondition,
Status: corev1.ConditionUnknown,
LastTransitionTime: metav1.NewTime(testTime),
},
{
Type: datadoghq.DatadogPodAutoscalerVerticalAbleToApply,
Status: corev1.ConditionUnknown,
LastTransitionTime: metav1.NewTime(testTime),
},
},
}
dpa2, dpaTyped2 = newFakePodAutoscaler("default", "dpa-2", 0, dpa2Time, dpaSpec, errorStatus)
f.InformerObjects = append(f.InformerObjects, dpa2)
f.Objects = append(f.Objects, dpaTyped2)
f.RunControllerSync(true, "default/dpa-2")
assert.Len(t, f.store.GetAll(), 2)
assert.Truef(t, f.autoscalingHeap.Keys["default/dpa-0"], "Expected dpa-0 to be in heap")
assert.Falsef(t, f.autoscalingHeap.Keys["default/dpa-1"], "Expected dpa-1 to not be in heap")
assert.Truef(t, f.autoscalingHeap.Keys["default/dpa-2"], "Expected dpa-2 to be in heap")
}
2 changes: 1 addition & 1 deletion pkg/clusteragent/autoscaling/workload/provider.go
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ func StartWorkloadAutoscaling(
return nil, fmt.Errorf("Unable to start local telemetry for autoscaling: %w", err)
}

limitHeap := autoscaling.NewHashHeap(maxDatadogPodAutoscalerObjects, autoscaling.NewStore[model.PodAutoscalerInternal]())
limitHeap := autoscaling.NewHashHeap(maxDatadogPodAutoscalerObjects, store)

controller, err := newController(clusterID, eventRecorder, apiCl.RESTMapper, apiCl.ScaleCl, apiCl.DynamicInformerCl, apiCl.DynamicInformerFactory, le.IsLeader, store, podWatcher, sender, limitHeap)
if err != nil {
Expand Down
Loading