Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU/TPU-resource-usage-collect-addon #20

Conversation

z1ens
Copy link
Contributor

@z1ens z1ens commented Jul 9, 2024

  • Add GPU/TPU addonplacementscore to each node in the managed clusters.
  • Develop the add-on in addontemplate mode.
  • Create a comprehensive user guide and documentation.
  • Use sdk to generate scores, remove redundant code.
  • Change the docker image name, remove the template tag.
  • Minimise the rbac permission in cluster-role and addon-template.
  • Put placement in a separated file.
  • Delete role and rolebinding in addontemplate.go

Update on Aug.15th:

  • Add node scope and cluster scope scores, the explanation and use scenarios details please refer to readme.
  • Refine the Doc.
  • Fix the problem of completed pod not filtered out.

REF: open-cluster-management-io/ocm#369
Old PR: #16 (comment)

@z1ens z1ens changed the title GPU/TPU-resource-usage-collect-addon [WIP]GPU/TPU-resource-usage-collect-addon Jul 9, 2024
@mikeshng
Copy link
Member

mikeshng commented Jul 9, 2024

/assign @zhujian7 @haoqing0110 @qiujian16

@z1ens z1ens force-pushed the gpu-tpu-addon-template branch 2 times, most recently from caa40ee to 83f79ab Compare July 10, 2024 08:16
}

func (s *Score) calculateClusterAllocateable(resourceName clusterv1.ResourceName) (float64, error) {
// Iterate every node, find the node with maximum allocatable resource, return the number and node name.
func (s *Score) calculateClusterAllocatable(resourceName string) (float64, string, error) {
Copy link
Member

@haoqing0110 haoqing0110 Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calculateMaxAllocatableNode?


After the deployment is complete, addon will create an addonplacementscore in its own namespace for each managedcluster in the hub.
On the hub cluster, you can see the `addonTemplate`, and check the `managedClusterAddon` status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddOnTemplate, ManagedClusterAddon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I have changed that.

},
{
Name: "tpuAvailable",
Value: tpuScore,
},
Copy link
Member

@haoqing0110 haoqing0110 Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm thinking should we keep the current xxxAvailable as cluster level score and add a new score based on max available node.

Copy link
Member

@haoqing0110 haoqing0110 Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at least explain the score is based on the cluster max available node in the readme.

Copy link
Contributor Author

@z1ens z1ens Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I should add an explanation in the Documentation.
For an extra score, I think we can do that too, like for example for each cluster, we have cpuClusterAvailable, cpuNodeAvaliable, gpuClusterAvaliable, gpuNodeAvaliable, etc.

}

func (s *Score) calculateNodeResourceUsage(nodeName string, resourceName string) (float64, error) {
list, err := s.podLister.List(labels.Everything())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed pods should be filtered out. Eg,

NAME                   READY   STATUS      RESTARTS   AGE
demo1-job8n84t-bgzhs   0/1     Completed   0          19m

Copy link
Contributor Author

@z1ens z1ens Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed that.

@haoqing0110
Copy link
Member

@qiujian16 could you help take a final review?

@qiujian16
Copy link
Member

/approve
need dco signoff for each commit

Copy link

openshift-ci bot commented Aug 27, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: qiujian16, z1ens

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…ach managed clusters, develop the add-on in addontemplate mode.

Signed-off-by: z1ens <xxtale02591@gmail.com>
@haoqing0110
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Aug 28, 2024
@haoqing0110
Copy link
Member

/unhold

1 similar comment
@z1ens
Copy link
Contributor Author

z1ens commented Aug 28, 2024

/unhold

@z1ens z1ens changed the title [WIP]GPU/TPU-resource-usage-collect-addon GPU/TPU-resource-usage-collect-addon Aug 28, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit ebebf50 into open-cluster-management-io:main Aug 28, 2024
2 checks passed
z1ens added a commit to z1ens/addon-contrib that referenced this pull request Aug 28, 2024
…ach managed clusters, develop the add-on in addontemplate mode. (open-cluster-management-io#20)

Signed-off-by: z1ens <xxtale02591@gmail.com>
z1ens added a commit to z1ens/addon-contrib that referenced this pull request Aug 28, 2024
…ach managed clusters, develop the add-on in addontemplate mode. (open-cluster-management-io#20)

Signed-off-by: z1ens <xxtale02591@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants