Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

annotate all pods under our domain, not just those in the STS #120

Merged

Conversation

tekicode
Copy link
Contributor

@tekicode tekicode commented Sep 18, 2023

annotatePods() only affects pods which are running in RouterIngestor mode: Pods which declare a hashring file AND are also a member of that hashring. It works by iterating through every expected pod in the STS. This has a couple of problems:

  1. If the size of the STS is larger than the number of running pods (failed pods) there will be failures to update those pods. This is a nuisance error.
  2. If we're operating in a split setup (Router separately from Ingestor) the controller does not know to annotate the Router pods, it only annotates the discovered STS pods. This means the Routers are left to reload their CM on their fallback timeout, instead of immediately.
  3. Only StatefulSets are affected, if using Routers as Deployments instead of STS, the controller will not annotate them with updates.

This change will discover any pods with the config.StatefulSetLabel value default: controller.receive.thanos.io=thanos-receive-controller LABEL and add a controller.receive.thanos.io/lastControllerUpdate annotation to each pod when the --annotate-pods-on-change flag is used and a new CM is generated.

It is not necessary to specify the controller.receive.thanos.io/hashring label when using this feature on pods which are not part of the hashring.

@tekicode
Copy link
Contributor Author

The failures in the checks/Generate look unrelated jsonnet updates?

@tekicode
Copy link
Contributor Author

@matej-g Could you review? Thank you!

@saswatamcode
Copy link
Member

cc: @philipgough

@squat
Copy link
Member

squat commented Sep 22, 2023

@tekicode if we go down this road, then we must change the configuration flag name from --statefulset-label [0] to something different, e.g. --label because it is misleading to users to suggest that we only use this to select statefulset pods. In this case, I would suggest we continue to accept --statefulset-label flag and deprecate it in favor of a new one that takes preference over it. This will allow existing users of the thanos-receive-controller to upgrade gracefully.

Separately, I am realizing now that the description of the --annotate-pods flag is also misleading. It indicates that we will annotate the pod with the hash of the latest configuration, but that is not true: we annotate the pods with a timestamp.

[0] https://github.com/observatorium/thanos-receive-controller/blob/master/main.go#L81

squat added a commit that referenced this pull request Sep 22, 2023
This addresses the comment in:
#120 (comment).

Signed-off-by: Lucas Servén Marín <lserven@gmail.com>
@philipgough
Copy link
Collaborator

@squat aside from agreeing with all your points I actually think the --annotate-pods flag is risky. It uses the kube client to do an update but it doesn't handle failures very well. Take an example where you might have a large hashring, you go to update a Pod with the annotation and you get a conflict. It just moves on and now you can have different replicas with different understandings of how the hashring looks causing issues with replication.

@tekicode
Copy link
Contributor Author

@squat aside from agreeing with all your points I actually think the --annotate-pods flag is risky. It uses the kube client to do an update but it doesn't handle failures very well. Take an example where you might have a large hashring, you go to update a Pod with the annotation and you get a conflict. It just moves on and now you can have different replicas with different understandings of how the hashring looks causing issues with replication.

I'm not sure how to address this comment. Without using annotate-pods, kubelet will update the configmap mounted on each container at most once every minute. It's not particularly at the top of the minute, just one minute timeout. The reason for the annotate-pods is to speed that up. So, if it fails, the fallback would be the default behavior (which is not ideal, I'll admit)

I already see annotate-pods failures with the current code; so my changes here do not address the root behavior, only make it more useful.

@tekicode
Copy link
Contributor Author

@tekicode if we go down this road, then we must change the configuration flag name from --statefulset-label [0] to something different, e.g. --label because it is misleading to users to suggest that we only use this to select statefulset pods. In this case, I would suggest we continue to accept --statefulset-label flag and deprecate it in favor of a new one that takes preference over it. This will allow existing users of the thanos-receive-controller to upgrade gracefully.

Separately, I am realizing now that the description of the --annotate-pods flag is also misleading. It indicates that we will annotate the pod with the hash of the latest configuration, but that is not true: we annotate the pods with a timestamp.

[0] https://github.com/observatorium/thanos-receive-controller/blob/master/main.go#L81

Hi thanks for the review.

Slightly concerned that by changing the flag as you suggest, we would indicate to users that the controller is capable of allowing non statefulset pods into the hashring; which it is not and would not be recommended.

The intention of my change here is for those of us using a split setup, with Statefulsets as endpoint targets, and Deployments working as autoscaling routers. These deployments are never in the endpoints list.

Either way, it sounds like something that is more solved with documentation; would we rather docs say "Do not use --label to select Deployments for the hashring, as --label is only meant for statefulset selection" or "--annotate-pods will annotate ANY pod with this label, not just statefulset pods"

IMO, I believe users would be more confused by trying to use --label and believing it would add Deployments to the endpoints list.

Naming things is hard, it's more important we see the controller do the functionality than picking a name. Let me know how we can move forward!

@philipgough
Copy link
Collaborator

I'm not sure how to address this comment. Without using annotate-pods, kubelet will update the configmap mounted on each container at most once every minute. It's not particularly at the top of the minute, just one minute timeout. The reason for the annotate-pods is to speed that up. So, if it fails, the fallback would be the default behavior (which is not ideal, I'll admit)

I already see annotate-pods failures with the current code; so my changes here do not address the root behavior, only make it more useful.

You are not sure how to address it in what way? I'm aware of the reason for the flag to exist in the first place. I was merely commenting within a conversation where we were discussing issues with the existing approach.

From a technical perspective, it can likely be addressed in some way by using an event driven approach and re-enqueuing Pods on failure to update, but that would make the controller significantly more complicated too.

@squat
Copy link
Member

squat commented Sep 23, 2023

re-enqueuing Pods on failure to update

That would mean using controller-runtime tooling like the queue and caches etc. It would for sure make the controller more robust, turning it into a fully-fledged operator pretty much. Not a bad thing! It's just a trade-off: more work and complexity vs robustness and functionality.

@squat
Copy link
Member

squat commented Sep 23, 2023

Another possible approach would be to eschew pod annotation entirely and instead instruct users to install third party tools like https://github.com/wave-k8s/wave or https://github.com/stakater/Reloader of they want "instantaneous" config updates.

@tekicode
Copy link
Contributor Author

@squat @philipgough I unfortunately don't have the time to invest in making the controller more robust in the sync of the hashring file. Can we table that for another PR or issue?

@philipgough Addressing --statefulset-label vs --label I am going through right now to deprecate statefulset-label. What should the verbiage be for the use of the new flag?

The label Pods must have to be watched by the controller

The label StatefulSets must have to be watched by the controller

The label Workloads must have to be watched by the controller.

Each of these has a different implication for the end user in how the controller can function and I see issues with the language of all three.

Please let me know if this is an impasse. We've found use of this controller in our pre-production environments and it's Pretty Good(tm) just a few things here and there to make it better.

@squat
Copy link
Member

squat commented Sep 23, 2023

@tekicode i agree with doing the Bigger Refactor in a different PR :)
as for verbiage, I like The label workloads must have to be watched by the controller. (with a lowercase w).

@squat
Copy link
Member

squat commented Sep 23, 2023

Also, I fixed the generated files in #121 so that CI passes. Please rebase your PR on main :)

@tekicode
Copy link
Contributor Author

tekicode commented Oct 3, 2023

@tekicode i agree with doing the Bigger Refactor in a different PR :) as for verbiage, I like The label workloads must have to be watched by the controller. (with a lowercase w).

Thank you for the review @squat Addressed this feedback with 519d41c

Please let me know what you think!

@tekicode
Copy link
Contributor Author

tekicode commented Oct 3, 2023

Changes in 45f615c are due to a limit on statements allowed in main() according to the lint rules.

@tekicode
Copy link
Contributor Author

@squat @philipgough any chance either of you could review? ty!

@tekicode
Copy link
Contributor Author

@saswatamcode available to review? ty!

@philipgough
Copy link
Collaborator

Sorry, I will take a look at this one asap

annotations = make(map[string]string)
}
// Select pods that have a controllerLabel matching ours.
podList, err := c.klient.CoreV1().Pods(c.options.namespace).List(ctx,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this accurate? what if the controller is watching and acting on multiple StatefulSets? Won't it annotate Pods for each of those workloads even in the absence of a diff of their hashring which may be confusing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe the controller differentiates when it performs hashring updates as to which pods it touches. The sync code previously updated all pods on a hashring change, regardless of which hashring.

main.go Outdated
})
if err != nil {
level.Error(c.logger).Log("msg", "failed to list pods belonging to controller", "err", err)
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this else block? assume we can just return if we fail to list Pods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed with be74ca3

main.go Outdated

annotations["annotationTimestamp"] = fmt.Sprintf("%d", time.Now().Unix())
pod.SetAnnotations(annotations)
annotations[annotationKey] = fmt.Sprintf("%d", time.Now().Unix())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of iterating over the Pods it might make more sense to calculate the timestamp in advance and set all Pods to the same value as when the hashring changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed with 0861153

main.go Outdated
}
}
}
} // for range podList
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove this commented code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0861153 Thank you so much for the review!

@tekicode
Copy link
Contributor Author

tekicode commented Nov 4, 2023

@philipgough Good to merge?

Copy link
Collaborator

@philipgough philipgough left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@philipgough philipgough merged commit c57219e into observatorium:main Nov 6, 2023
4 checks passed
@tekicode tekicode deleted the annotate-all-pods-on-change branch November 8, 2023 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants