We use Buildkite to power our CI/CD. This repo contains the configuration for our Buildkite setup.
When you open a pull request or push a commit to main in our repositories, Buildkite parses the .buildkite/pipeline.yml
file.
This YAML file defines the pipeline steps -- it tells Buildkite what jobs to run, e.g. run yarn test
in the catalogue
folder.
Buildkite distributes these jobs across one or more queues. Each queue is attached to an EC2 autoscaling group; the EC2 instances in this group pull jobs from the queue. These are called Buildkite agents.
The job runs on the instance, and the results are reported back to Buildkite. We can see the results in the Buildkite dashboard.
We have several different queues, each backed by a different instance class. The instance classes are matched to the sort of work that goes on that queue: for example, we have one queue with powerful instances that runs short-lived, CPU-intensive tasks, and another queue with small instances that run long-lived, CPU-light tasks.
All our EC2 instances are spot instances – we use spare EC2 capacity at a cheaper price, with the small risk that instances (and thus build jobs) might be stopped unexpectedly.
We use Buildkite so that we can use EC2 instances in our own account as Buildkite agents, which has several benefits:
-
We can choose which instances to use. We can run as many parallel EC2 instances as we're willing to pay for, and we can choose instance types that are best suited to our particular tasks.
-
The build runners are inside our AWS account. This means we can use EC2 instance roles for permissions management, rather than creating permanent credentials that we have to trust to a third-party CI provider. We're more comfortable doing powerful things in CI (e.g. deploying to prod) because it's within our account boundary.
-
We can reuse the same instance in multiple builds. Because we're the only organisation using our build runners, we don't have to worry about information leaking from another tenant's builds, or our information leaking into their builds. This means we can reuse build instances, which gets us more caching and faster builds.
We use the Elastic CI Stack for AWS, which is provided by Buildkite and configured in Terraform.
We run multiple instances of the stack, mostly varying by EC2 instance type, to get a good balance of fast builds and a reasonable build.
You can see our Buildkite pipelines at https://buildkite.com/wellcomecollection.
If you're a Wellcome developer who wants to log in to Buildkite, visit https://buildkite.com/sso/wellcomecollection
-
When a job starts, you may see Waiting for agent in the Buildkite dashboard -- this means it's waiting for an EC2 instance in the autoscaling group.
Usually this disappears in a minute or so, once an instance becomes available in the autoscaling group, but occasionally it will hang.
If this message doesn't disappear quickly, there may be an issue starting new EC2 instances – check the autoscaling group in the AWS console.
One issue we've seen several times is when the current EC2 Spot price is higher than our maximum configured Spot bid -- check the spot pricing history for the instance class in the AWS console, and compare it to the
SpotPrice
parameter in our Terraform. If the current spot price is too high, consider increasing theSpotPrice
parameter and plan/applying the Terraform. -
When a job starts, Buildkite does a health check on the agent to ensure there's enough disk space available.
If an agent has been running for a long time, it may have a lot of images in the local Docker image cache, and the job will fail an error like:
Not enough disk space free, cutoff is 5242880 🚨
Disk health checks failedUsually it should be enough to restart the failed job -- it will be assigned to a different agent with more disk space.
If this error persists (which suggests it's affecting multiple agents), consider increasing the
RootVolumeSize
parameter in Terraform.