Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IBCDPE-938] Deploy Signoz (OTEL Visualization) to kubernetes cluster #35

Merged
merged 86 commits into from
Nov 21, 2024
Merged
Show file tree
Hide file tree
Changes from 44 commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
0f93061
Deploy signoz
BryanFauble Aug 29, 2024
73bb182
Deploy signoz
BryanFauble Aug 29, 2024
4e3c472
Update signoz readme
BryanFauble Sep 11, 2024
6b457dd
Disable a handful of items not needed
BryanFauble Sep 12, 2024
5b694bb
Try out kong-ingress
BryanFauble Sep 30, 2024
2f0636a
Correct chart name
BryanFauble Sep 30, 2024
a002873
Update values
BryanFauble Sep 30, 2024
b3768c5
Deploy oauth2 plugin
BryanFauble Sep 30, 2024
d525af3
Correct issuer
BryanFauble Sep 30, 2024
81a3adc
Correct indent
BryanFauble Sep 30, 2024
8c89bc6
Correct revision target
BryanFauble Sep 30, 2024
23ef64e
Deploy out cert-manager
BryanFauble Sep 30, 2024
ef8bd3f
Add oauth2 plugin
BryanFauble Sep 30, 2024
502276b
update plugin
BryanFauble Sep 30, 2024
7a864e7
set openid-connect
BryanFauble Sep 30, 2024
c0e2a4c
Try out envoy gateway
BryanFauble Sep 30, 2024
6bf8506
Disable service monitor
BryanFauble Sep 30, 2024
8b8be2e
Set argocd docker registry oci
BryanFauble Sep 30, 2024
4bf72d9
Point to local argo-cd
BryanFauble Sep 30, 2024
4983fb2
Deploy dex idp
BryanFauble Sep 30, 2024
81f24ef
Correct envoy-gateway chart repo
BryanFauble Sep 30, 2024
e4f3fbb
Set google conncetor
BryanFauble Sep 30, 2024
75a3063
Set storage
BryanFauble Sep 30, 2024
875010b
Set issuer
BryanFauble Sep 30, 2024
41e9147
Deploy DB for dex
BryanFauble Sep 30, 2024
8814411
Deploy DB operator
BryanFauble Sep 30, 2024
0eaafed
Set dex to use postgres
BryanFauble Sep 30, 2024
6c82806
Disable ssl
BryanFauble Sep 30, 2024
74c83f3
ssl
BryanFauble Sep 30, 2024
d4fae1f
Enable cert-manager gateway api support
BryanFauble Oct 1, 2024
1dfa08a
Deploy out ingress
BryanFauble Oct 1, 2024
8dfedba
Try out on-demand node lifecycle
BryanFauble Oct 1, 2024
be902b3
Correct path
BryanFauble Oct 1, 2024
ac6a1e3
Include gatway class
BryanFauble Oct 1, 2024
bf53696
Add some notes
BryanFauble Oct 1, 2024
19190c1
Set scaling back
BryanFauble Oct 1, 2024
fd8fe4f
Run 1 replica but on-demand
BryanFauble Oct 1, 2024
2f6bae7
Remove todo comment
BryanFauble Oct 1, 2024
5ed270d
Point to correct revision
BryanFauble Oct 1, 2024
a61bdd0
Correct comment
BryanFauble Oct 1, 2024
bec8d9d
Add to readme
BryanFauble Oct 1, 2024
5b0aa64
Leave at 2 az deployment
BryanFauble Oct 1, 2024
aefa2e1
Update modules/cluster-ingress/README.md
BryanFauble Oct 1, 2024
58b2de4
Update readme
BryanFauble Oct 2, 2024
e26e7e4
Set param
BryanFauble Oct 4, 2024
a501f32
no multiple sources
BryanFauble Oct 4, 2024
f284709
Set
BryanFauble Oct 4, 2024
5aa954b
Note that the admin password is randomized
BryanFauble Oct 4, 2024
eebfca1
Update modules/signoz/README.md
BryanFauble Oct 4, 2024
835de37
Enable replication for schema migrator
BryanFauble Oct 8, 2024
e8f989f
Set back to single replica for DB init
BryanFauble Oct 8, 2024
5ac8424
Bump replica back to 2
BryanFauble Oct 8, 2024
5947065
Envoy Gateway Minimum TLS (#36)
BryanFauble Oct 15, 2024
e457ef1
Merge branch 'main' into signoz-testing
BryanFauble Oct 17, 2024
139dd6a
Shrink VPC size and create subnets specifically for worker nodes that…
BryanFauble Oct 17, 2024
f3f7647
Add back var
BryanFauble Oct 17, 2024
1dab275
Correct cidr block
BryanFauble Oct 17, 2024
63a54ad
Update cidr blocks
BryanFauble Oct 17, 2024
d4c79d7
Correct node lengths
BryanFauble Oct 18, 2024
204b2ff
Correct array slicing
BryanFauble Oct 18, 2024
34e27cb
Correct indexing
BryanFauble Oct 18, 2024
373b800
Update default eks cluster version
BryanFauble Oct 18, 2024
1b6170e
Shrink EKS control plane subnet range
BryanFauble Oct 21, 2024
3482837
Set range back
BryanFauble Oct 21, 2024
5c5654f
[IBCDPE-1095] Setup TLS/Auth0 for cluster ingress with telemetry dat…
BryanFauble Nov 5, 2024
f314bde
Remove py file
BryanFauble Nov 5, 2024
d4bf895
Update readme note
BryanFauble Nov 5, 2024
ae8eacb
Remove comments about moving to provider
BryanFauble Nov 5, 2024
fc53860
Upgrade helmchart for signoz (#46)
BryanFauble Nov 6, 2024
a295e48
Deploy SES module only when emails are provided
BryanFauble Nov 6, 2024
1db7b42
Correct output conditional
BryanFauble Nov 6, 2024
7f652ec
Create moved blocks for resources
BryanFauble Nov 6, 2024
24bc617
Correct moved blocks (Bad AI)
BryanFauble Nov 6, 2024
b8509df
Remove moved blocks are they're not needed
BryanFauble Nov 6, 2024
fe1e37b
Conditionally deploy auth0 spacelift stack
BryanFauble Nov 6, 2024
4a4da49
Don't autodeploy admin stack and do deploy auth0 for dev
BryanFauble Nov 6, 2024
5774102
Point to specific resource instance
BryanFauble Nov 6, 2024
77f2c22
Move conditional check to for_each loop
BryanFauble Nov 6, 2024
767ac82
Try list instead of map
BryanFauble Nov 6, 2024
908201b
Try `tomap` conversion
BryanFauble Nov 6, 2024
afb6f6e
Try handling dependency with depends_on
BryanFauble Nov 6, 2024
2cfcaee
Add if check within for_each loop
BryanFauble Nov 6, 2024
436908f
Remove unused moved blocks
BryanFauble Nov 6, 2024
501b1d3
[IBCDPE-1095] Use scope based authorization on telemetry upload route…
BryanFauble Nov 19, 2024
74f33bf
[SCHEMATIC-138] SigNoz cold storage and backups (#47)
BryanFauble Nov 21, 2024
dbe7f70
Correction to namespace of where ingress resources are deployed
BryanFauble Nov 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
*.tfstate*
.terraform
terraform.tfvars
settings.json
settings.json
temporary_files*
20 changes: 2 additions & 18 deletions deployments/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -37,15 +37,7 @@ module "dpe-sandbox-spacelift-development" {
cluster_name = "dpe-k8-sandbox"
vpc_name = "dpe-sandbox"

vpc_cidr_block = "10.51.0.0/16"
# public_subnet_cidrs = ["10.51.1.0/24", "10.51.2.0/24", "10.51.3.0/24"]
# private_subnet_cidrs = ["10.51.4.0/24", "10.51.5.0/24", "10.51.6.0/24"]
# azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
# For now, we are only using one public and one private subnet. This is due to how
# EBS can only be mounted to a single AZ. We will need to revisit this if we want to
# allow usage of EFS ($$$$), or add some kind of EBS volume replication.
# Note: EKS requires at least two subnets in different AZs. However, we are only using
# a single subnet for node deployment.
vpc_cidr_block = "10.51.0.0/16"
BryanFauble marked this conversation as resolved.
Show resolved Hide resolved
public_subnet_cidrs = ["10.51.1.0/24", "10.51.2.0/24"]
private_subnet_cidrs = ["10.51.4.0/24", "10.51.5.0/24"]
azs = ["us-east-1a", "us-east-1b"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaving this on a 2 AZ deployment (Since it cannot be changed after the k8s cluster is created). I did set the single AZ part of deployments/stacks/dpe-k8s-deployments/main.tf to false so that the spot io cluster autoscaler can deploy EC2 instances to either private subnet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an issue for Airflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, This should work in conjunction with this change to limit the deployment to nodes in a specific zone, while still allowing nodes to spin up in either zone:

nodeSelector: {
  failure-domain.beta.kubernetes.io/zone: us-east-1a
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also created https://sagebionetworks.jira.com/browse/IBCDPE-1097 as a follow-up here: Move worker node subnet off EKS cluster subnets

From these docs: https://aws.github.io/aws-eks-best-practices/networking/subnets/#vpc-configurations

Kubernetes worker nodes can run in the cluster subnets, but it is not recommended. During cluster upgrades Amazon EKS provisions additional ENIs in the cluster subnets. When your cluster scales out, worker nodes and pods may consume the available IPs in the cluster subnet. Hence in order to make sure there are enough available IPs you might want to consider using dedicated cluster subnets with /28 netmask.

Expand Down Expand Up @@ -75,15 +67,7 @@ module "dpe-sandbox-spacelift-production" {
cluster_name = "dpe-k8"
vpc_name = "dpe-k8"

vpc_cidr_block = "10.52.0.0/16"
# public_subnet_cidrs = ["10.52.1.0/24", "10.52.2.0/24", "10.52.3.0/24"]
# private_subnet_cidrs = ["10.52.4.0/24", "10.52.5.0/24", "10.52.6.0/24"]
# azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
# For now, we are only using one public and one private subnet. This is due to how
# EBS can only be mounted to a single AZ. We will need to revisit this if we want to
# allow usage of EFS ($$$$), or add some kind of EBS volume replication.
# Note: EKS requires at least two subnets in different AZs. However, we are only using
# a single subnet for node deployment.
vpc_cidr_block = "10.52.0.0/16"
BryanFauble marked this conversation as resolved.
Show resolved Hide resolved
public_subnet_cidrs = ["10.52.1.0/24", "10.52.2.0/24"]
private_subnet_cidrs = ["10.52.4.0/24", "10.52.5.0/24"]
azs = ["us-east-1a", "us-east-1b"]
Expand Down
70 changes: 65 additions & 5 deletions deployments/stacks/dpe-k8s-deployments/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ module "sage-aws-eks-autoscaler" {
vpc_id = var.vpc_id
node_security_group_id = var.node_security_group_id
spotinst_account = var.spotinst_account
single_az = true
single_az = false
desired_capacity = 3
}

Expand All @@ -21,8 +21,9 @@ module "sage-aws-eks-addons" {

module "argo-cd" {
depends_on = [module.sage-aws-eks-autoscaler]
source = "spacelift.io/sagebionetworks/argo-cd/aws"
version = "0.3.1"
# source = "spacelift.io/sagebionetworks/argo-cd/aws"
# version = "0.3.1"
source = "../../../modules/argo-cd"
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am liking the usage of this relative source import, especially for testing.


module "victoria-metrics" {
Expand Down Expand Up @@ -66,10 +67,69 @@ module "postgres-cloud-native-database" {
depends_on = [module.postgres-cloud-native-operator, module.airflow, module.argo-cd]
source = "spacelift.io/sagebionetworks/postgres-cloud-native-database/aws"
version = "0.5.0"
auto_deploy = true
auto_prune = true
auto_deploy = var.auto_deploy
auto_prune = var.auto_prune
git_revision = var.git_revision
namespace = "airflow"
argo_deployment_name = "airflow-postgres-cloud-native"
}


module "signoz" {
depends_on = [module.argo-cd]
# source = "spacelift.io/sagebionetworks/postgres-cloud-native-database/aws"
# version = "0.5.0"
source = "../../../modules/signoz"
auto_deploy = var.auto_deploy
auto_prune = var.auto_prune
git_revision = var.git_revision
namespace = "signoz"
argo_deployment_name = "signoz"
}

module "envoy-gateway" {
# TODO: This is temporary until we are ready to deploy the ingress controller: https://sagebionetworks.jira.com/browse/IBCDPE-1095
count = 0
depends_on = [module.argo-cd]
# source = "spacelift.io/sagebionetworks/postgres-cloud-native-database/aws"
# version = "0.5.0"
source = "../../../modules/envoy-gateway"
auto_deploy = var.auto_deploy
auto_prune = var.auto_prune
git_revision = var.git_revision
namespace = "envoy-gateway"
argo_deployment_name = "envoy-gateway"
}

module "cert-manager" {
# TODO: This is temporary until we are ready to deploy the ingress controller: https://sagebionetworks.jira.com/browse/IBCDPE-1095
count = 0
depends_on = [module.argo-cd]
# source = "spacelift.io/sagebionetworks/postgres-cloud-native-database/aws"
# version = "0.5.0"
source = "../../../modules/cert-manager"
auto_deploy = var.auto_deploy
auto_prune = var.auto_prune
git_revision = var.git_revision
namespace = "cert-manager"
argo_deployment_name = "cert-manager"
}

module "cluster-ingress" {
# TODO: This is temporary until we are ready to deploy the ingress controller: https://sagebionetworks.jira.com/browse/IBCDPE-1095
count = 0
depends_on = [module.argo-cd]
# source = "spacelift.io/sagebionetworks/postgres-cloud-native-database/aws"
# version = "0.5.0"
source = "../../../modules/cluster-ingress"
auto_deploy = var.auto_deploy
auto_prune = var.auto_prune
git_revision = var.git_revision
namespace = "envoy-gateway"
argo_deployment_name = "cluster-ingress"

# To determine more elegant ways to fill in these values, for example, if we have
# a pre-defined DNS name for the cluster (https://sagebionetworks.jira.com/browse/IT-3931)
ssl_hostname = "unknown-to-fill-in"
cluster_issuer_name = "selfsigned"
}
26 changes: 18 additions & 8 deletions modules/apache-airflow/templates/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,9 @@ images:
pullPolicy: IfNotPresent

# Select certain nodes for airflow pods.
nodeSelector: {}
nodeSelector: {
failure-domain.beta.kubernetes.io/zone: us-east-1a
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the changes in this values.yaml file to update replicas from 2 to 1. Running these services with on-demand instances will get us towards the stability we wanted. Running these workloads on spot instances was not the right decision.

affinity: {}
tolerations: []
topologySpreadConstraints: []
Expand Down Expand Up @@ -467,7 +469,7 @@ kerberos:
# Airflow Worker Config
workers:
# Number of airflow celery workers in StatefulSet
replicas: 2
replicas: 1
# Max number of old replicasets to retain
revisionHistoryLimit: ~

Expand Down Expand Up @@ -636,7 +638,9 @@ workers:
extraVolumeMounts: []

# Select certain nodes for airflow worker pods.
nodeSelector: {}
nodeSelector: {
spotinst.io/node-lifecycle: "od"
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This node selector defined how the spot io cluster autoscaler requests EC2 instances to spin up for the cluster. This makes sure that an on-demand instance is created and that the pods are running on it.

runtimeClassName: ~
priorityClassName: ~
affinity:
Expand Down Expand Up @@ -721,7 +725,7 @@ scheduler:
command: ~
# Airflow 2.0 allows users to run multiple schedulers,
# However this feature is only recommended for MySQL 8+ and Postgres
replicas: 2
replicas: 1
# Max number of old replicasets to retain
revisionHistoryLimit: ~

Expand Down Expand Up @@ -806,7 +810,9 @@ scheduler:
extraVolumeMounts: []

# Select certain nodes for airflow scheduler pods.
nodeSelector: {}
nodeSelector: {
spotinst.io/node-lifecycle: "od"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for later we can think about other spot setup configuraitons that would make Airflow more resilient so we actually gain the benefit of the spot instances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For airflow I think we can take advantage of this if we move over to kubernetes workers instead of celery works. The reason is that the individual tasks can run on spot instances, but everything to kick off those tasks don't do well to be interrupted on spot instances.

}
affinity:
# default scheduler affinity is:
podAntiAffinity:
Expand Down Expand Up @@ -1248,7 +1254,7 @@ webserver:
triggerer:
enabled: true
# Number of airflow triggerers in the deployment
replicas: 2
replicas: 1
# Max number of old replicasets to retain
revisionHistoryLimit: ~

Expand Down Expand Up @@ -1348,7 +1354,9 @@ triggerer:
extraVolumeMounts: []

# Select certain nodes for airflow triggerer pods.
nodeSelector: {}
nodeSelector: {
spotinst.io/node-lifecycle: "od"
}
affinity:
# default triggerer affinity is:
podAntiAffinity:
Expand Down Expand Up @@ -1942,7 +1950,9 @@ redis:
safeToEvict: true

# Select certain nodes for redis pods.
nodeSelector: {}
nodeSelector: {
spotinst.io/node-lifecycle: "od"
}
affinity: {}
tolerations: []
topologySpreadConstraints: []
Expand Down
9 changes: 8 additions & 1 deletion modules/argo-cd/templates/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -501,7 +501,14 @@ configs:
# -- Repositories list to be used by applications
## Creates a secret for each key/value specified below to create repositories
## Note: the last example in the list would use a repository credential template, configured under "configs.credentialTemplates".
repositories: {}
repositories:
docker-registry:
url: registry-1.docker.io
# username: "docker"
# password: ""
name: docker-registry
enableOCI: "true"
type: "helm"
Comment on lines -504 to +511
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was testing some helm charts that use the oci:// prefix, this was needed.

# istio-helm-repo:
# url: https://storage.googleapis.com/istio-prerelease/daily-build/master-latest-daily/charts
# name: istio.io
Expand Down
13 changes: 13 additions & 0 deletions modules/cert-manager/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Purpose
This module is used to deploy the cert-manager helm chart. cert-manager is responsible
for creating SSL certs to use within the cluster.

Resources:

- <https://cert-manager.io/docs/installation/helm/>

## Relation to envoy-gateway
The envoy-gateway is responsible for handling ingress for the kubernetes cluster.
cert-manager has an a integration to watch for changes to `kind: Gateway` resources to
determine when to provision SSL certs. This integration is in the `values.yaml` file
of this directory under `kind: ControllerConfiguration`.
38 changes: 38 additions & 0 deletions modules/cert-manager/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
resource "kubernetes_namespace" "cert-manager" {
metadata {
name = var.namespace
}
}

resource "kubectl_manifest" "cert-manager" {
depends_on = [kubernetes_namespace.cert-manager]

yaml_body = <<YAML
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: cert-manager
namespace: argocd
spec:
project: default
%{if var.auto_deploy}
syncPolicy:
automated:
prune: ${var.auto_prune}
%{endif}
sources:
- repoURL: 'https://charts.jetstack.io'
chart: cert-manager
targetRevision: v1.15.1
helm:
releaseName: cert-manager
valueFiles:
- $values/modules/cert-manager/templates/values.yaml
- repoURL: 'https://github.com/Sage-Bionetworks-Workflows/eks-stack.git'
targetRevision: ${var.git_revision}
ref: values
destination:
server: 'https://kubernetes.default.svc'
namespace: ${var.namespace}
YAML
}
Loading