bosh_troubleshooting.prolific

SSH into a running BOSH job

### What?
To SSH into a BOSH job, you'll use the `bosh` cli.

### How?
bosh-bootloader sets up a jumpbox that allows `bosh` commands to be run from your workstation
* `eval "$(bbl print-env)"` to target the director
* Now you will only need the VM name to use `bosh ssh`.

### Expected Result
Running `bosh -d cf ssh VM/GUID` opens a shell in your targeted machine.

### Resources
[Forum question: What's the distinction between an HTTP proxy, tunnel, and gateway?](http://stackoverflow.com/questions/10377679/whats-distinction-of-http-proxy-tunnel-gateway)
L: bosh operator

---

Scale the number of Diego Cells

### What?
With BOSH it is easy to scale deployments. All you need to do is modify the number of instances of that [job](https://bosh.io/docs/jobs.html) in the deployment manifest and redeploy.

What's the best way to achive that? You actually have a few options, but in this story, we're going to focus on using **[ops files](http://bosh.io/docs/cli-ops-files/#ops)**.

Ops files are a way for operators to make edits to a deployment manifest in a reproducible way. Using a special syntax ([here are some examples](https://github.com/cppforlife/go-patch/blob/master/docs/examples.md)), they include a list of mutations to make to any YAML file (although you'll mostly use them for your manifest). You can use ops-files to do the following:
- **Remove** a section of YAML
- **Replace** one section of YAML with your own YAML. Tweaks to a "replace" command can also allow you to:
  - **Add** a new section of YAML
  - **Append** to a list

### How?
1. Get your current bosh manifest by running `bosh -d cf manifest > /tmp/cf.yml`
1. Write an ops-file called `scale-diego-cells.yml` that will scale up the number of Diego cells to 3 instances. See if you can use the ops-file documentation and examples to figure out how. (Some hints: think about which spot in the manifest you want to update. What operation do you need to carry out?)
1. Vet your changes by running `bosh interpolate`:
  ```
  bosh interpolate /tmp/cf.yml \
  -o ... [any other ops-files you used to deploy orginally] ... \
  -o scale-diego-cells.yml \
  -v system_domain=$SYSTEM_DOMAIN
  ```
  Dig around in the output, and see if it matches what you expected. Specifically, is the instance count for the `diego-cell` instance group set to what you wanted?
1. Deploy:
  ```
  bosh deploy /tmp/cf.yml \
  -o ... [any other ops-files you used to deploy originally] ... \
  -o scale-diego-cells.yml \
  -v system_domain=$SYSTEM_DOMAIN
  ```
  (Also, notice the parallels between the `interpolate` and `deploy` commands.)

###  Expected Result
You should easily be able to scale the number of Diego Cells up or down. What happens to your apps at that point? Are they redistributed as soon as there is a new cell or do you have to scale the app to trigger a relocation?
cf-deployment installs [`cfdot`](https://github.com/cloudfoundry/cfdot) on the diego cells, which you can use to interrogate Diego. The `watch` and `tail` bash commands will also be your friends during this investigation.

### Resources
[Docs: BOSH Deploying, a step-by-step walk through](http://bosh.io/docs/deploying-step-by-step.html)
[Docs: BOSH Deployment Manifest docs](http://bosh.io/docs/manifest-v2/)
[Docs: Creating a new BOSH VM](http://bosh.io/docs/bosh-components.html#create-vm)
[YAML Validator](http://codebeautify.org/yaml-validator)
[JSON Patch, the inspriation for ops-files](JSON Patch](http://jsonpatch.com/)
[cf-deployment ops-files](https://github.com/cloudfoundry/cf-deployment/tree/master/operations)
L: bosh operator

---

Trigger a failing BOSH job

### What?
When a BOSH VM is healthy, it is listed with the status "running". Let's use **[Monit](https://mmonit.com/monit/)** to trigger a state change.

### How?
1. Run `watch bosh vms`. (If you don't have `watch` installed, run `brew install watch`)
1. In another buffer or tab, `bosh ssh` into one of your Diego cells
1. Run `sudo -i` to run as root
1. Run `monit summary` (need root access to do this)
1. Run `monit stop all`
1. Observe the state of the bosh VM in question.
1. Run `monit start all`

### Expected Result
The Diego job for the cell you SSHed to should be listed as `failing` as soon as you stop the Monit jobs. When you run `monit start all` it should return to the `running` state.

### Resources
[Docs: BOSH CLI health commands](https://bosh.io/docs/sysadmin-commands.html#health)
[Docs: BOSH Job Lifecycle](http://bosh.io/docs/job-lifecycle.html)
[Docs: Monit](https://mmonit.com/monit/)
L: bosh operator

---

Watch the BOSH Resurrector resurrect Bosh jobs

### What?
The Resurrector is a plugin to the BOSH Health Monitor that is responsible for automatically recreating VMs that become inaccessible. It continuously cross-references VMs expected to be running against the VMs that are sending heartbeats. When the resurrector does not receive heartbeats for a VM for a certain period of time, it will kick off a task on the Director to try to “resurrect” that VM. The Director may do one of two things:

* create a new VM if the old VM is missing
* replace a VM if the Agent on that VM is not responding to commands

### How?
1. Run `watch bosh vms` so you can keep an eye on the effect you're having on VM state.
1. Open a second terminal buffer and `bosh ssh` into one of the Diego cells.

Killing off a BOSH agent is a little harder than it looks. This is a great thing for CF operators, but less of a good thing when creating exercises to learn about the system. For instance, try killing off an agent process:

1. Run `ps aux | grep bosh-agent` to find the BOSH agent.
1. Kill it mercilessly, `kill -9 <process id>`
1. Run `ps aux` again. Grep for the agent again. Discover that phoenix-like, there is already a new agent process with a new process id. The VM's listed state won't have even flickered. Don't quote me here, but I'm pretty sure [upstart](http://upstart.ubuntu.com/) is responsible for this sorcery.

Looks like we'll have to get creative if we're ever going to see this resurrector at work.

1. While still SSHed into a VM, notice the path to `agent.json` in the output of `ps aux` and throw some un-parseable junk in there.

### Expected Result
Watch the process choke, the VM fail, and the resurrector bring it back! If it doesn't come back within a few minutes, check to see if resurrection is turned on.

### Resources
[Docs: Configuring Health Monitor](https://bosh.io/docs/hm-config.html)
[Docs: BOSH Resurrector](http://bosh.io/docs/resurrector.html)
L: bosh operator

---

Update your BOSH deployment's properties (a.k.a. let's break some stuff)

### What?
Each BOSH job can specify customizable properties in a **[spec file](http://bosh.io/docs/jobs.html#spec)**. Some will be required, others will have defaults. We're going to fiddle with the values and see what happens.

### How?
1. Run `bosh releases`
1. Select a job, your pick. Ooh, roulette. Let's get risky. Or not. This is a good time to revisit the [diego-design-notes](https://github.com/cloudfoundry/diego-design-notes) if you'd like to be incisive about this.
1. Find the spec file for the job on Github. A spec file resides at `some-release/jobs/some-job-name/spec`. For instance, the spec path of the cf-release uaa job is `https://github.com/cloudfoundry-attic/cf-release/blob/master/jobs/uaa` which points to `https://github.com/cloudfoundry/uaa-release/blob/master/jobs/uaa/spec`.
1. Choose a spec property where you expect to be able to tell if the value changes. An example in uaa might be `login.logout.redirect.url` (read the property description in the spec to learn more). A more straight-forward option might be a port. Whatever it is, establish how you plan to test the change before implementing it.
1. Update your Cloud Foundry manifest to reflect a change in that job's property. This will be under jobs: some-job-name: properties: some-property. If you don't see the property you're looking for it's because it has been left as its default value.
1. `bosh deploy` !!

### Expected Result
A few things could happen at this point. The deployment could run successfully (woot!) and you are able to demonstrate that the change took effect (double woot). The update could fail, but now you have something to diagnose (hint: it might have something to do with the user permissions of the user responsible for making changes to that job). Orrrrr you could break something! That one is a super exciting option, chalk full of investigative opportunities.

### Resources
[Docs: What is a BOSH spec file?](http://bosh.io/docs/jobs.html#spec)
[Tool: Chaos Lemur](https://github.com/strepsirrhini-army/chaos-lemur)
[YAML Validator](http://codebeautify.org/yaml-validator)
BOSH's Wikipedia entry is also good, but it has parentheses in the URL and that's well past the limits of my markdown skills at this hour of night
L: bosh operator

---

[RELEASE] BOSH Operator Troubleshooting ⇧