Skip to content

Commit

Permalink
feat: known issues (#95)
Browse files Browse the repository at this point in the history
Co-authored-by: Cas Lubbers <clubbers@akamai.com>
  • Loading branch information
srodenhuis and CasLubbers authored Dec 17, 2024
1 parent 974e730 commit be73f3d
Show file tree
Hide file tree
Showing 8 changed files with 132 additions and 95 deletions.
33 changes: 22 additions & 11 deletions .cspell.json
Original file line number Diff line number Diff line change
@@ -1,10 +1,19 @@
{
"version": "0.1",
"allowCompoundWords": true,
"enabledLanguageIds": ["json", "jsonc", "markdown", "yaml", "yml"],
"ignoreRegExpList": ["/'s\\b/"],
"enabledLanguageIds": [
"json",
"jsonc",
"markdown",
"yaml",
"yml"
],
"ignoreRegExpList": [
"/'s\\b/"
],
"ignoreWords": [
"ABDEFHIJZ",
"AGE-SECRET-KEY-1KTYK6RVLN5TAPE7VF6FQQSKZ9HWWCDSKUGXXNUQDWZ7XXT5YK5LSF3UTKQ",
"FPpLvZyAdAmuzc3N",
"aspinu",
"auths",
"Fzcs",
Expand Down Expand Up @@ -107,11 +116,13 @@
],
"language": "en",
"words": [
"ABDEFHIJZ",
"activity",
"argoproj",
"authz",
"autocd",
"bitnami",
"certutil",
"chartrepo",
"chmod",
"ciso",
Expand All @@ -123,8 +134,10 @@
"descheduler",
"devs",
"dockerconfigjson",
"falco",
"gcloud",
"gitea",
"Glasnostic",
"gogs",
"goharbor",
"hashicorp",
Expand All @@ -134,28 +147,31 @@
"initalize",
"istio",
"Istio",
"OTEL",
"jeager",
"jwks",
"kiali",
"knative",
"konstraint",
"kube",
"kubeapps",
"Kubeclarity",
"kubeconfig",
"kubectl",
"kubei",
"kubernetes",
"kubeval",
"Kustomize",
"ldap",
"msteams",
"mtls",
"nslookup",
"oaut2",
"oidc",
"onboarded",
"onprem",
"openid",
"orgs",
"OTEL",
"otomi",
"otomise",
"owasp",
Expand All @@ -174,13 +190,8 @@
"unencrypted",
"unparameterized",
"untrusted",
"urandom",
"velero",
"xlarge",
"kubei",
"falco",
"certutil",
"oaut2",
"Kubeclarity",
"Glasnostic"
"xlarge"
]
}
8 changes: 6 additions & 2 deletions docs/for-ops/how-to/change-admin-password.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,13 @@ export ENV_DIR=~/workspace/values-folder
- Retrieve the SOPS_AGE_KEY from secret:
```
```bash
kubectl get secret otomi-sops-secrets -n otomi-pipelines -o jsonpath='{.data.SOPS_AGE_KEY}' | base64 -d
# Example output: AGE-SECRET-KEY-1KTYK6RVLN5TAPE7VF6FQQSKZ9HWWCDSKUGXXNUQDWZ7XXT5YK5LSF3UTKQ
```
Example output:
```bash
AGE-SECRET-KEY-1KTYK6RVLN5TAPE7VF6FQQSKZ9HWWCDSKUGXXNUQDWZ7XXT5YK5LSF3UTKQ
```
- Create the `.secrets` file in the root of the values directory with the SOPS_AGE_KEY secret. The file contents should look like this:
Expand Down
4 changes: 2 additions & 2 deletions docs/for-ops/sre/daily.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ title: SRE Daily Routine
sidebar_label: Daily Routine
---

As an SRE you would like to keep your daily tasks to a minimum and be automatically informed on issues. APL offers the following tooling to automate this:
As an SRE you would like to keep your daily tasks to a minimum and be automatically informed on issues. App Platform offers the following tools to automate this:

- Prometheus is the main monitoring tool, and notifications will be triggered for issues that need attention

Expand All @@ -14,7 +14,7 @@ As an SRE you would like to keep your daily tasks to a minimum and be automatica

- Prometheus BlackBox exporter is a service probing tool used by Prometheus to periodically probe services over HTTP, TCP, UDP, and ICMP. When it receives non-valid responses it will trigger an alert

APL makes use of Slack (but MS Teams and email can also be configured) as the main notifications channel. Subscribe to the configured channels.
App Platform makes use of Slack (but MS Teams and email can also be configured) as the main notifications channel. Subscribe to the configured channels.

### Steps to perform

Expand Down
67 changes: 67 additions & 0 deletions docs/for-ops/sre/known-issues.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
slug: known-issues
title: Known Issues
sidebar_label: Known Issues
---

## Installation gets stuck because of a quota exceeded exception

### Details

When provisioning App Platform for LKE in Akamai Connected Cloud the installation can fail because a quota exceeded exception. If the URL of the Portal Endpoint does not appear in the App Platform for LKE section after 30 minutes, this could be caused by a quota exceeded exception.

Next to the resources required for LKE, the App Platform also uses a NodeBalancer and a minimum of 11 Storage Volumes. This might result in a quota exceeding exception. Linode currently does not provide quota limits in your account details at this time.

The following issue might be related to quota exceeding exception:

Pods that require a Storage Volume get stuck in a pending state with the following message:

`pod has unbound immediate PersistentVolumeClaims. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.`

### Workaround

N.A.

### Resolution

- Remove any Storage Volumes that are Unattached.

- If you would like to know your account's limits or want to increase the number of entities you can create, the best way is to get that information through a support ticket.


## The Let’s Encrypt secret request was not successful

### Details

For each cluster with the App Platform for LKE enabled, a Let’s Encrypt certificate will be requested. If the certificate is not ready within 30 minutes, the installation of the App Platform will fail. Run the following command to see if the certificate is created:

```bash
kubectl get secret -n istio-system
```

There should be a secret called: `apl-<cluster-id>-wildcard-cert`

If this secret is not present, then the request failed.

### Workaround

N.A.

### Resolution

- Delete the LKE cluster with App Platform for LKE enabled and create a new cluster with App Platform for LKE enabled.

## Argo CD does not synchronize anymore


### Details

Argo CD may occasionally stop synchronizing without a clear cause. In some instances, errors may appear in the logs, while in others, no errors are logged. This issue results in platform updates being halted.

### Workaround

N.A.

### Resolution

- Increase the resource allocation for the Argo CD Application Controller. This can be achieved by updating the resource configuration in the values repository within Gitea (`apps/argocd.yaml`). The updated configuration will automatically restart the Argo CD application.
18 changes: 10 additions & 8 deletions docs/for-ops/sre/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,31 +4,31 @@ title: SRE Overview
sidebar_label: Overview
---

APL is a set of functions built on top of a suite of pre-configured and integrated open source applications. Instead of selecting, configuring, and integrating all the parts that are needed to securely manage containerized applications in multi- and hybrid environments, APL offers all required parts in a single package. APL can be seen as any other Kubernetes application or add-on, with the difference that APL is pre-configured and offers a higher abstraction of configuration for all the integrated solutions. All integrated applications can however be used freely, meaning that a user can benefit from the pre-configuration to start using the offered applications.
App Platform is a set of functions built on top of a suite of pre-configured and integrated open source applications. Instead of selecting, configuring, and integrating all the parts that are needed to securely manage containerized applications in multi- and hybrid environments, App Platform offers all required parts in a single package. App Platform can be seen as any other Kubernetes application or add-on, with the difference that App Platform is pre-configured and offers a higher abstraction of configuration for all the integrated solutions. All integrated applications can however be used freely, meaning that a user can benefit from the pre-configuration to start using the offered applications.

The user controls the configuration of all objects installed by APL, based on the [values schema](https://github.com/redkubes/otomi-core/blob/main/values-schema.yaml) provided by APL, and the user controls the full configuration of all Kubernetes objects deployed. Lets take a closer look:
The user controls the configuration of all objects installed by App Platform, based on the [values schema](https://github.com/redkubes/otomi-core/blob/main/values-schema.yaml) provided by App Platform, and the user controls the full configuration of all Kubernetes objects deployed. Lets take a closer look:

## Reference configuration

APL provides a reference configuration (APL Values) that can be used as a quick-start to install and configure a complete suite of integrated open source applications, an advanced ingress architecture, multi-tenancy, developer self-service, and implemented security best-practices. The reference configuration can be modified using the APL Console and APL API, based on a pre-defined value schema. SRE can change and optimize the reference configuration when needed. There are 2 supported options:
App Platform provides a reference configuration (`values` repository) that can be used as a quick-start to install and configure a complete suite of integrated open source applications, an advanced ingress architecture, multi-tenancy, developer self-service, and implemented security best-practices. The reference configuration can be modified using the App Platform Console and App Platform API, based on a pre-defined value schema. SRE can change and optimize the reference configuration when needed. There are 2 supported options:

- Standard, using the APL values schema to modify the configuration
- Standard, using the `values-schema` to modify the configuration

- Advanced, customization using overrides

Let's take a closer look at both options.

### Standard

Out-of-the-box, APL comes with an extensive values [schema](https://github.com/linode/apl-core/blob/main/values-changes.yaml). Most of the standard values (platform configuration) can be modified using the values editor in APL Console. Changes made through the APL Console are translated into configuration code (based on the values schema). The APL values schema supports the most common use-cases when working with Kubernetes.
Out-of-the-box, App Platform comes with an extensive values [schema](https://github.com/linode/apl-core/blob/main/values-changes.yaml). Most of the standard values (platform configuration) can be modified using the values editor in App Platform Console. Changes made through the APL Console are translated into configuration code (based on the values schema). The values-schema supports the most common use-cases when working with Kubernetes.

### Advanced

For advanced use-cases, configuration values of all integrated open source applications can be customized. Together with the fully integrated observability suite, SRE's can pro-actively monitor the resource usage of the integrated open source applications (like Istio and Ingress Nginx) and optimize the configuration when needed.

The APL values can be overridden by custom configuration values using `_rawValues`. Custom configuration values can be all values supported by the upstream Helm chart of the integrated open source application in APL Core.
The values can be overridden by custom configuration values using `_rawValues`. Custom configuration values can be all values supported by the upstream Helm chart of the integrated open source application in App Platform Core (`apl-core` repo).

SRE's can use APL Console to change configuration settings (like security policies), but can also change the APL values directly using the APL values schema and by using overrides. In all cases, the configuration is stored in code (the `values` repository).
SRE's can use App Platform Console to change configuration settings (like security policies), but can also change the values directly using the values-schema and by using overrides. In all cases, the configuration is stored in code (the `values` repository).

The following code shows the configuration values of the ingress-nginx chart.

Expand All @@ -46,7 +46,7 @@ charts:
error-log-level: info
```
Line 1-7 are configuration options supported in the APL values schema. Line 8-11 are used to add specific (not schema supported) configuration values using overrides (rawValues).
Line 1-7 are configuration options supported in the values-schema. Line 8-11 are used to add specific (not schema supported) configuration values using overrides (rawValues).
## Guides & checklists
Expand All @@ -55,3 +55,5 @@ For SRE's we have created a couple of guides and checklists:
- [Daily routine](daily.md)
- [Troubleshooting](troubleshooting.md)
- [Known Issues](known-issues.md)
63 changes: 23 additions & 40 deletions docs/for-ops/sre/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ title: SRE Troubleshooting Checklist
sidebar_label: Troubleshooting
---

## Pods not starting
## Pods

Pods that are unable to start do not show any log output, the issue is related to k8s. Look for a pod with status Pending. Most of the time this is related to resources and container component issues.

Expand All @@ -20,23 +20,7 @@ Pods that are unable to start do not show any log output, the issue is related t

- Does the cluster have enough resources available?

### Advanced

- Check affinity and node selector rules

- Is the image tag valid and compatible with the host CPU? (exec format error)

- Check namespace quotas for pod, cm or secret limits etc.

- Check service account and permissions

- Is the pod a job, deployment, daemonset or statefulset?

- Is there a limitrange configured in the namespace?

- Is the template spec in the pod matching the running container?

## Pods not running
### Pod status

Pods that are running but restart for whatever reason indicate that a container itself is having issues. Look for pod status `Crashloop, OOMkilled` or `incomplete ready status (2/3)`

Expand All @@ -60,8 +44,6 @@ Pods that are running but restart for whatever reason indicate that a container

- Inspect the restart counter for the pod, a high value (32+) indicates an unstable pod

### Advanced

- Check pod's service account permissions

- Attach shell and inspect container status
Expand All @@ -72,7 +54,23 @@ Pods that are running but restart for whatever reason indicate that a container

- Check volume permissions

## Network services not working
### Advanced

- Check affinity and node selector rules

- Is the image tag valid and compatible with the host CPU? (exec format error)

- Check namespace quotas for pod, cm or secret limits etc.

- Check service account and permissions

- Is the pod a job, deployment, daemonset or statefulset?

- Is there a limitrange configured in the namespace?

- Is the template spec in the pod matching the running container?

## Services

Pods are working but a user can't connect to the service. Most HTTP-based services use an Ingress object, non HTTP services require a service port to be defined.

Expand Down Expand Up @@ -106,7 +104,7 @@ Network policies or Istio policies can deny pods from communicating, note that D

- Run `istioctl analyze`

## Istio issues
## Istio

Istio sidecars manipulate the container's network to reroute traffic. A namespace can have an Istio sidecar policy indicated by a label, the same is valid for a deployment or pod. Make sure you see Istio sidecars running when applicable (indicated by the 3/3 Ready status).

Expand All @@ -128,35 +126,20 @@ Istio sidecars manipulate the container's network to reroute traffic. A namespac

- Turn on logging for a context of an istio sidecar: `ksh exec -it $container_id -c istio-proxy -- sh -c 'curl -k -X POST localhost:15000/logging?jwt=debug'`

## DNS issues
## ExternalDNS

The ExternalDNS service is registering DNS names to makes sure that the service names are publicly available.

- Make sure `external-dns` logs indicate All records are already up to date

- Are the credentials configured correctly?

## Certificate issues
## Cert-manager

- Check cert-manager working
Check cert-manager working:

- Run `kubectl describe orders.acme.cert-manager.io -A`

- Run `kubectl describe challenges.acme.cert-manager.io -A`

- Run `kubectl describe certificates.cert-manager.io -A`

## Storage issues

Check available storage classes `std` and `fast` exist

### The otomi-pipeline pipeline failure

In the otomi-pipeline execution failure, read carefully last few lines from the ` PipelineRun`` output.
Errors containing: `unable to build kubernetes objects from release manifest: Get "https://10.32.0.1:443/openapi/v2?timeout=32s": net/http: request canceled`string, indicates that the kube-api was not available. Admin can restart the pipeline by triggering webhook from Gitea app. Go to `otomi/values`repository -> click`Settings`-> select `Webhooks`tab -> click the `Test Delivery` button.

### Advanced

- Describe pv and pvc, check if pv's are `rwo` or `rwx` and look for conflicts

- Check if container expects or `rwx` pv
Loading

0 comments on commit be73f3d

Please sign in to comment.