Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: make pod spec choice consistent #9531

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

carolinaecalderon
Copy link
Contributor

@carolinaecalderon carolinaecalderon commented Jun 17, 2024

Ticket

Description

We choose what pod spec to apply to a resource pool/experiment in two places, and two different ways. Let's try to harmonize this choice such that we will (more) consistently choose the same pod spec to define a pool and for experiments requested to run on it.

For example, if you're running a CPU cluster, you would intuitively want to define a cpuPodSpec for pods that experiments run on, and pods included in resource pools. Right now, GetAgents chooses which pod spec to use based on which one is defined AND device type (CPU vs CUDA). However, MergeExpConf chooses which pod spec to use based on how many slots an experiment wants (<=1 slot defaults to GPU pod spec). This makes it impossible to run experiments requiring more than 1 slot on a CPU cluster with a CPU podSpec defined.

Right now I'm merging this into my node selectors feature branch, but I can merge this into main after some rebasing. I will wait to merge #9428 until this PR is merged.

Test Plan

Pass CircleCI + i'll write a test (in progress)

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@cla-bot cla-bot bot added the cla-signed label Jun 17, 2024
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Jun 17, 2024
@determined-ci determined-ci requested a review from a team June 17, 2024 14:17
Copy link

netlify bot commented Jun 17, 2024

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit 7280765
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/66704551ac39b800080ecac5
😎 Deploy Preview https://deploy-preview-9531--determined-ui.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@carolinaecalderon carolinaecalderon changed the base branch from main to carolinac/rm-255 June 17, 2024 14:17
@determined-ci determined-ci removed the documentation Improvements or additions to documentation label Jun 17, 2024
Copy link

codecov bot commented Jun 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 51.36%. Comparing base (01c9b1d) to head (829c3f3).

Additional details and impacted files
@@                 Coverage Diff                  @@
##           carolinac/rm-255    #9531      +/-   ##
====================================================
+ Coverage             49.30%   51.36%   +2.06%     
====================================================
  Files                  1242      750     -492     
  Lines                161471   112195   -49276     
  Branches               2868     2867       -1     
====================================================
- Hits                  79606    57631   -21975     
+ Misses                81693    54392   -27301     
  Partials                172      172              
Flag Coverage Δ
backend 34.25% <ø> (-9.70%) ⬇️
harness 63.61% <ø> (-0.20%) ⬇️
web 44.87% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

see 496 files with indirect coverage changes

@carolinaecalderon carolinaecalderon marked this pull request as ready for review June 17, 2024 14:22
@carolinaecalderon carolinaecalderon requested a review from a team as a code owner June 17, 2024 14:22
@carolinaecalderon carolinaecalderon requested review from AmanuelAaron and removed request for a team June 17, 2024 14:22
Base automatically changed from carolinac/rm-255 to main June 17, 2024 16:47
@carolinaecalderon carolinaecalderon requested a review from a team as a code owner June 17, 2024 16:47
@carolinaecalderon carolinaecalderon marked this pull request as draft June 17, 2024 16:47
- WebUI: Allow resource pool slot counts to reflect the state of the entire cluster. Allow slot
counts and scheduling to respect node selectors and affinities. This impacts Determined clusters
deployed on Kubernetes with multiple resource pools defined in terms of node selectors and/or
affinities.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:orphan:

New Features

  • Kubernetes Configuration: Allow Cluster administrators to define Determined resource pools on
    Kubernetes clusters using node selectors and/or affinities. This feature allows a single cluster to be divided into multiple resource pools using node labels. To configure resource pools, modify the default pod spec settings under task_container_defaults.cpu_pod_spec or
    task_container_defaults.gpu_pod_spec.

  • WebUI: Reflect the resource pool slot counts for the entire Kubernetes cluster. This ensures that slot counts and scheduling decisions respect the defined node selectors and affinities. This impacts Determined clusters
    deployed on Kubernetes with multiple resource pools defined using node selectors and/or affinities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants