-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Add architecture document #359
Merged
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,163 @@ | ||
# Cosmos Operator Architecture | ||
|
||
This is a high-level overview of the architecture of the Cosmos Operator. It is intended to be a reference for | ||
developers. | ||
|
||
## Overview | ||
|
||
The operator was written with the [kubebuilder](https://github.com/kubernetes-sigs/kubebuilder) framework. | ||
|
||
Kubebuilder simplifies and provides abstractions for creating a Kubernetes controller. | ||
|
||
In a nutshell, an operator observes | ||
a [CRD](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/). Its job is to match | ||
cluster state with the desired state in the CRD. It | ||
continually watches for changes and updates the cluster accordingly - a "control loop" pattern. | ||
|
||
Each controller implements a Reconcile method: | ||
|
||
```go | ||
Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) | ||
``` | ||
|
||
Unlike "built-in" controllers like Deployments or StatefulSets, operator controllers are visible in the cluster - one pod | ||
backed by a Deployment under the cosmos-operator-system namespace. | ||
|
||
A controller can watch resources outside of the CRD it manages. For example, CosmosFullNode watches for pod deletions, | ||
so it can spin up new pods if a user deletes one manually. | ||
|
||
The watching of resources is in this method for each controller: | ||
|
||
```go | ||
SetupWithManager(ctx context.Context, mgr ctrl.Manager) error | ||
``` | ||
|
||
Refer to kubebuilder docs for more info. | ||
|
||
### Makefile | ||
|
||
Kubebuilder generated much of the Makefile. It contains common tasks for developers. | ||
|
||
### `api` directory | ||
|
||
This directory contains the different CRDs. | ||
|
||
You should run `make generate manifests` each time you change CRDs. | ||
|
||
A CI job should fail if you forget to run this command after modifying the api structs. | ||
|
||
### `config` directory | ||
|
||
The config directory contains kustomize files generated by Kubebuilder. | ||
Strangelove uses these files to deploy the operator (instead of a helm chart). | ||
A helm chart is on the road map but presents challenges in keeping the kustomize and helm code in sync. | ||
|
||
### `controllers` directory | ||
|
||
The controllers directory contains every controller. | ||
|
||
This directory is not unit tested. The code in controllers should act like `main()` functions where it's mostly wiring | ||
up of dependencies from `internal`. | ||
|
||
### `internal` directory | ||
|
||
Almost all the business logic lives in `internal` and houses the unit and integration tests. | ||
|
||
# CosmosFullNode | ||
|
||
This is the flagship CRD of the Cosmos Operator and contains the most complexity. | ||
|
||
### Builder, Diff, and Control Pattern | ||
|
||
Each resource has its own builder and controller (referred as "control" in this context). For example, | ||
see `pvc_builder.go` and `pvc_control.go` which only manages PVCs. All builders should have file suffix `_builder.go` | ||
and all control objects `_control.go`. | ||
|
||
The most complex builder is `pod_builder.go`. There may be opportunities to refactor it. | ||
|
||
The "control" pattern was loosely inspired by Kubernetes source code. | ||
|
||
Within the controller's `Reconcile(...)` method, the controller determines the order of operations of the separate | ||
Control objects. | ||
|
||
On process start, each Control is initialized with a Diff and a Builder. | ||
|
||
On each reconcile loop: | ||
|
||
1. The Builder builds the desired resources from the CRD. | ||
2. Control fetches a list of existing resources. | ||
3. Control uses Diff to compute a diff of the existing to the desired. | ||
4. Control makes changes based on what Diff reports. | ||
|
||
The Control tests are *integration tests* where we mock out the Kubernetes API, but not the Builder or Diff. The | ||
tests run quickly (like unit tests) because we do not make any network calls. | ||
|
||
The Diff object (`type Diff[T client.Object] struct`) took several iterations to get right. There is probably little | ||
need to tweak it further. | ||
|
||
The hardest problem with diffing is determining updates. Essentially, Diff looks for a `Revision() string` method on the | ||
resource and sets a revision annotation. The revision is a simple fnv hash. It compares `Revision` to the existing annotation. | ||
If different, we know it's an update. We cannot compare equality of existing resources directly because Kubernetes adds additional | ||
annotations and fields. | ||
|
||
Builders return a `diff.Resource[T]` which Diff can use. Therefore, Control does not need to adapt resources. | ||
|
||
The fnv hash is computed from a resource's JSON representation, which has proven to be stable. | ||
|
||
### Special Note on Updating Status | ||
|
||
There are several controllers that update a | ||
CosmosFullNode's [status subresource](https://book-v1.book.kubebuilder.io/basics/status_subresource): | ||
|
||
* CosmosFullNode | ||
* ScheduledVolumeSnapshot | ||
* SelfHealing | ||
|
||
Each update to the status subresource triggers another reconcile loop. We found multiple controllers updating status | ||
caused race conditions. Updates were not applied or applied incorrectly. | ||
Some controllers read the status to take action, so it's important to preserve the integrity of the status. | ||
|
||
Therefore, you must use the special `SyncUpdate(...)` method from `fullnode.StatusClient`. It ensures updates are | ||
performed serially per CosmosFullNode. | ||
|
||
### Sentries | ||
|
||
Sentries are special because you should not include a readiness probe due to the way Tendermint/Comet remote | ||
signing works. | ||
|
||
The remote signer reaches out to the sentry on the privval port. This is the inverse of what you'd expect, the sentry | ||
reaching out to the remote signer. | ||
|
||
If the sentry does not detect a remote signer connection, it crashes. And the stable way to connect to a pod is through | ||
a Kube Service. So we have a chicken or egg problem. The sentry must be "ready" to be added to the Service, but the | ||
remote signer must connect to the sentry through the Service so it doesn't crash. | ||
|
||
Therefore, the CosmosFullNode controller inspects Tendermint/Comet as part of its rolling update strategy - not just | ||
pod readiness state. | ||
|
||
### CacheController | ||
|
||
The CacheController is special in that it does not manage a CRD. | ||
|
||
It periodically polls every pod for its Tendermint/Comet status such as block height. The polling is done in | ||
the background. It's a controller because it needs the reconcile loop to update which pods it needs to poll. | ||
|
||
The CacheController prevents slow reconcile loops. Previously, we queried this status on every reconcile loop. | ||
|
||
When other controllers want Comet status, they always hit the cache controller. | ||
|
||
# Scheduled Volume Snapshot | ||
|
||
Scheduled Volume Snapshot takes periodic backups. | ||
|
||
To preserve data integrity, it will temporarily delete a pod, so it can capture a PVC snapshot without any process | ||
writing to it. | ||
|
||
It uses a finite state machine pattern in the main reconcile loop. | ||
|
||
# StatefulJob | ||
|
||
StatefulJob periodically runs a job on an interval (crontab not supported yet). The purpose is to run a job that | ||
attaches to a PVC created from a VolumeSnapshot. | ||
|
||
It's the least developed of the CRDs. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test wasn't doing anything and, frankly, an e2e test with cosmos nodes is going to be really tough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to have a staging/dev CosmosFullNode that's continuously delivered off
main
. And we monitor it for any strangeness.