randomize initial zone selection #583

joshk · 2019-04-01T09:49:55Z

What is the problem that this PR is trying to fix?

Workers, when deployed, are bound to a primary zone. Whenever a job (and VM) is started, the primary zone is used first, then an alternate zone is selected.

The core problem is when we start hitting zone exhaustion errors for a zone. Since there are pools of workers per zone, and each worker will try its primary zone first, which just puts more pressure on the api, raising the risk of api rate limit issues.

What approach did you choose and why?

This is a first step towards a fix, but not yet the full fix.

This changes the concept of a primary zone, and instead has a random zone picked each try. A zone can still be defined in the config, but this zone will be used each and every time.

The follow up work to this PR is to block a zone for 10mins across all workers (possibly using Redis), which will help reduce api usage, and a better self healing worker setup.

How can you test this?

I've tested this locally by connecting it to staging, and it worked a treat.

What feedback would you like, if any?

This is for general discussion

I also need help understanding how to fix the test failures

soulshake · 2019-04-01T10:38:59Z

Note: this comment has been redacted; see followup below

The core problem is when we start hitting zone exhaustion errors for a zone.

~~Which zone exhaustion errors are you referring to? IIUC, most resource quotas (as well as API quotas) are per project and/or per region rather than per zone. (See GCE Quotas page)~~

Example from recent outage:

Error
QUOTA_EXCEEDED: Quota 'SSD_TOTAL_GB' exceeded. Limit: 800000.0 in region us-central1.

^ note the exhaustion is in the region us-central1, not a zone like us-central1-c.

(Note that we have occasionally hit ZONE_RESOURCE_POOL_EXHAUSTED errors in the past, but that has to do with the global resource usage for that zone rather than our projects specifically. ~~And it was not the case in the recent outage.)~~

Since there are pools of workers per zone, and each worker will try its primary zone first, which just puts more pressure on the api, raising the risk of api rate limit issues.

I think this is only the case when a zone is specified in the worker config. In our case it looks like instances are pretty well distributed across zones already, no?

I wonder if it would make more sense to have each worker create job instances in its own zone, if possible, since those are already automatically distributed by the managed instance group.

Also, not all resource types (in particular GPUs) are available in all zones. So I'm not sure it makes sense to make zone pinning an all-or-nothing thing, because I believe that will cause problems when GPUs (and perhaps some specific CPUs) are specified.

Disclaimer: I still don't understand what actually triggered the recent outage, beyond the issue of not being able to delete instances because we had exceeded the API rate limits, thus causing the resource exhaustion. We still don't know what exactly caused us to exceed the API limits in the first place, do we?

soulshake · 2019-04-01T10:50:20Z

Edit: I take it all back, I see there were plenty of these errors during the last outage:

Mar 29 23:32:53 production-2-worker-org-gce-4mr9 travis-worker-wrapper: 
time="2019-03-30T04:32:53Z" level=error msg="couldn't start instance, attempting 
requeue" err="code=ZONE_RESOURCE_POOL_EXHAUSTED location= 
message=The zone 'projects/travis-ci-prod-2/zones/us-central1-c' does not have 
enough resources available to fulfill the request.  Try a different zone, or try again later." 
job_id=123456798 job_path=xyz/xyz/jobs/123456798 pid=1 
processor=ed8d48ea-5209-4f2b-b595-bfda4c06ce13@1.production-2-worker-org-gce-4mr9 
repository=xyz/xyz self=step_start_instance start_timeout=8m0s uuid=84c6f99e-f021-4b13-baf1-ba101c22e3ab

joshk · 2019-04-01T10:57:04Z

Thanks for the feedback AJ. I'm just about to hit the hay, but I thought I would also add that our Google reps recommended temporarily not using a zone when we hit these errors. I'll write up a more detailed reply when I wake up.

…

On Mon, Apr 1, 2019 at 11:50 PM, AJ Bowen ***@***.***> wrote: Edit: I take it all back, I see there were plenty of these errors during the last outage: Mar 29 23:32:53 production-2-worker-org-gce-4mr9 travis-worker-wrapper: time="2019-03-30T04:32:53Z" level=error msg="couldn't start instance, attempting requeue" err="code=ZONE_RESOURCE_POOL_EXHAUSTED location= message=The zone 'projects/travis-ci-prod-2/zones/us-central1-c' does not have enough resources available to fulfill the request. Try a different zone, or try again later." job_id=123456798 job_path=xyz/xyz/jobs/123456798 pid=1 ***@***.*** repository=xyz/xyz self=step_start_instance start_timeout=8m0s uuid=84c6f99e-f021-4b13-baf1-ba101c22e3ab — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#583 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAh_QtAMyI8vI5Goe_0HV_clixYcSVkks5vceRtgaJpZM4cVTDf> .

soulshake · 2019-04-01T10:58:19Z

Thanks for the feedback AJ. I'm just about to hit the hay, but I thought I would also add that our Google reps recommended temporarily not using a zone when we hit these errors. I'll write up a more detailed reply when I wake up.

This definitely makes more sense now that I know we had hit ZONE_RESOURCE_POOL_EXHAUSTED errors during the last incident. 👍

joepvd · 2019-04-05T09:54:26Z

backend/gce.go

 "NETWORK": fmt.Sprintf("network name (default %q)", defaultGCENetwork),
 "PREEMPTIBLE": "boot job instances with preemptible flag enabled (default false)",
 "PREMIUM_MACHINE_TYPE": fmt.Sprintf("premium machine type (default %q)", defaultGCEPremiumMachineType),
- "PROJECT_ID": "[REQUIRED] GCE project id",
+ "PROJECT_ID": "[REQUIRED] GCE project id (will try to auto detect it if not set)",


In that case, [REQUIRED] can be dropped

soulshake · 2019-04-08T09:26:47Z

backend/gce.go

@@ -119,7 +119,7 @@ var (
 "WARMER_URL": "URL for warmer service",
 "WARMER_TIMEOUT": fmt.Sprintf("timeout for requests to warmer service (default %v)", defaultGCEWarmerTimeout),
 "WARMER_SSH_PASSPHRASE": fmt.Sprintf("The passphrase used to decipher instace SSH keys"),
- "ZONE": fmt.Sprintf("zone name (default %q)", defaultGCEZone),
+ "ZONE": "zone in which to deploy job instaces into (default is to use all zones in the region)",


typo (instaces)

ghost

Found some fixes!

P.S. share your ideas, feedbacks or issues with us at https://github.com/fixmie/feedback (this message will be removed after the beta stage).

Co-Authored-By: fixmie[bot] <44270338+fixmie[bot]@users.noreply.github.com>

joshk added 4 commits April 1, 2019 22:01

suggested change from spellcheck

8a433c1

raise early if there isn't an ACCOUNT_JSON env var

014f849

random zone selection within a region

5151ff1

fix a spec by defining the region

df04750

add support for specifying a minimum cpu platform

e745dad

anarosas requested review from soulshake, joepvd and emdantrim April 4, 2019 19:29

joshk added 7 commits April 5, 2019 14:43

throw an error if we can't fetch the zones

c48a622

a little more logging

fc073c5

some smarter region extraction

2d15bc8

fix the index for the regex match

d64695a

changes to the regex

a029f46

ReplaceAll is a 1.12 method

fbd0309

urg, use [1] not [0]

6be50c6

joepvd reviewed Apr 5, 2019

View reviewed changes

soulshake reviewed Apr 8, 2019

View reviewed changes

Merge branch 'master' into joshk/random-zone-selection

0b5b6fe

ghost reviewed Jun 25, 2019

View reviewed changes

joshk and others added 3 commits June 25, 2019 19:16

remove whitespace

8d1df6c

Co-Authored-By: fixmie[bot] <44270338+fixmie[bot]@users.noreply.github.com>

missing a : can mean a lot

0810378

fix a spelling mistake

2472f02

joshk removed the request for review from emdantrim June 25, 2019 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

randomize initial zone selection #583

randomize initial zone selection #583

joshk commented Apr 1, 2019

soulshake commented Apr 1, 2019 •

edited

Loading

soulshake commented Apr 1, 2019

joshk commented Apr 1, 2019 via email

soulshake commented Apr 1, 2019

joepvd Apr 5, 2019

soulshake Apr 8, 2019

ghost left a comment

randomize initial zone selection #583

Are you sure you want to change the base?

randomize initial zone selection #583

Conversation

joshk commented Apr 1, 2019

What is the problem that this PR is trying to fix?

What approach did you choose and why?

How can you test this?

What feedback would you like, if any?

soulshake commented Apr 1, 2019 • edited Loading

soulshake commented Apr 1, 2019

joshk commented Apr 1, 2019 via email

soulshake commented Apr 1, 2019

joepvd Apr 5, 2019

Choose a reason for hiding this comment

soulshake Apr 8, 2019

Choose a reason for hiding this comment

ghost left a comment

Choose a reason for hiding this comment

soulshake commented Apr 1, 2019 •

edited

Loading