Skip to content

Latest commit

 

History

History
215 lines (176 loc) · 14 KB

README.EN.md

File metadata and controls

215 lines (176 loc) · 14 KB

HCCL-Controller.EN

Description

  • A controller tracks at least one Kubernetes resource type. These objects have a specified field that represents the desired state. The controllers for that resource are responsible for making the current state come closer to the desired state.
  • Controller Manager is the management and control center in a cluster. It consists of multiple controllers responsible for different resources to manage all resources such as nodes and pods in the cluster.
  • Controller Manager provides the event dispatching capability. Different controllers only need to register corresponding handlers to wait for receiving and processing events.
  • Each specific resource is maintained and managed by a specific controller to retain the desired state.

Figure 1 Controller interaction process

HCCL-Controller

HCCL-Controller is a Huawei-developed component used for NPU training jobs. It uses the Kubernetes informer mechanism to continuously monitor NPU training jobs and various events of pods, read the NPU information of pods, and generate the corresponding ConfigMap. The ConfigMap contains the hccl.json configuration file required for NPU training jobs, enabling the NPU training jobs to better collaborate with and schedule the underlying Ascend 910 AI Processor.

HCCL-Controller Process

Figure 1 shows HCCL-Controller process.

Figure 1 HCCL-Controller process

  1. Ascend Device Plugin periodically reports the DeviceID and health status of the Ascend 910 AI Processor node by using the list-and-watch API.

  2. After receiving a training job request, the scheduler creates a job and a ConfigMap. Use the Volcano scheduler to select the node where the job is to be deployed.

  3. The scheduler sends the pod creation information to the kubelet of the selected node.

  4. On the selected node, Ascend Device Plugin receives a device allocation request from kubelet and returns information, such as DeviceID, Volume, and environment variables, to kubelet. Kubelet allocates resources to the pod.

  5. Ascend Device Plugin can write the Ascend 910 AI Processor NIC IP address and the DeviceID allocated to the pod into the annotation field of the pod.

  6. HCCL-Controller continuously monitors changes of the volcano job and pod. If a new pod is created, HCCL-Controller obtains the value of annotation from the pod. After all pod information of the volcano job is obtained, HCCL-Controller updates the ConfigMap of rings-config.

  7. The container training job in the pod continuously checks the status of the ConfigMap. If the status is complete, the hccl.json file can be generated based on the ConfigMap.

HCCL-Controller Service Rules

HCCL-Controller is a component used to generate the hccl.json file of all pods of a training job. This component is dedicated for the Atlas 800 training server Kubernetes cluster.

  • For training jobs, the ring-controller.atlas: ascend-910 label needs to be set for pods and ConfigMaps. HCCL-Controller filters data using this label to distinguish the Ascend 910 scenario from other scenarios.
  • The mapping between volcano jobs and ConfigMaps is as follows: The ConfigMap name of volume (ascend-910-config) in volcano job.yaml is the ConfigMap corresponding to volcano jobs.
  • HCCL-Controller continuously monitors the changes of the volcano job, pod, and ConfigMap (the changes must carry the label in Training Job, Pod, and ConfigMap). The volcano job and ConfigMap of the same training job are associated through volume (ascend-910-config). If a new pod is created, HCCL-Controller obtains the value of annotation (ascend.kubectl.kubernetes.io/ascend-910-configuration) in the pod and creates a data cache information table for the volcano job. After all instance information of the volcano job is obtained, HCCL-Controller updates the ConfigMap of the corresponding rings-config.
  • The default file name of rings-config in the ConfigMap is hccl.json, and the default mounting path is /user/serverid/devindex/config.

Building HCCL-Controller

  1. Install the Go compilation environment and configure Goproxy.

  2. Run the following commands to build HCCL-Controller:

    cd build

    bash build.sh

    The files generated after building are stored in the output directory in the root directory of the source code, as shown in Table 1.

    Table 1 Files generated after building

    File

    Description

    hccl-controller

    HCCL-Controller binary file

    Dockerfile

    Image building text file for HCCL-Controller

    hccl-controller-{version}.yaml

    HCCL-Controller startup configuration file

    NOTE:

    • {version}: indicates the version number. Set it based on the actual situation.
    • The binary dependency of ARM is different from that of x86. Therefore, compilation needs to be performed on the corresponding architecture.

Prerequisites

Perform operations described in all sections except "Preparing Software Packages" in section "Preparing for Installation" in the MindX DL User Guide.

For details, see "Installation and Deployment > Preparations Before Installation" in the MindX DL User Guide.

Installing HCCL-Controller

For details, see "Installation and Deployment > Installing MindX DL > Installing HCCL-Controller" in the MindX DL User Guide.

Environment Dependencies

  • Kubernetes 1.16 or later
  • Go 1.13 or later

Directory Structure

hccl-controller                                              # HCCL-Controller component
├── build                                                  # Build folder
│   ├── build.sh
│   ├── Dockerfile
│   ├── hccl-controller.yaml
│   ├── rbac.yaml
│   └── test.sh
├── doc
│   └── images
│       ├── Controller-interaction-process.png
│       ├── HCCL-Controller-process.png
│       ├── icon-caution.gif
│       ├── icon-danger.gif
│       ├── icon-note.gif
│       ├── icon-notice.gif
│       ├── icon-tip.gif
│       └── icon-warning.gif
├── go.mod
├── go.sum
├── main.go
├── output
├── pkg                                                    # Source code file
│   ├── hwlog
│   │   └── logger.go
│   ├── resource-controller
│   │   └── signals
│   │       ├── signal.go
│   │       ├── signal_posix.go
│   │       └── signal_windows.go
│   └── ring-controller
│       ├── agent
│       │   ├── businessagent.go
│       │   ├── businessagent_test.go
│       │   ├── deploymentworker.go
│       │   ├── deploymentworker_test.go
│       │   ├── types.go
│       │   ├── vcjobworker.go
│       │   └── vcjobworker_test.go
│       ├── controller
│       │   ├── controller.go
│       │   ├── controller_test.go
│       │   └── types.go
│       ├── model
│       │   ├── deployment.go
│       │   ├── deployment_test.go
│       │   ├── types.go
│       │   ├── vcjob.go
│       │   └── vcjob_test.go
│       └── ranktable
│           ├── v1
│           │   ├── ranktable.go
│           │   ├── ranktable_test.go
│           │   └── types.go
│           └── v2
│               ├── ranktable.go
│               ├── ranktable_test.go
│               └── types.go
├── README_EN.md
└── README.md

Version Updates

Version

Date

Description

v2.0.2

2021-07-15

Added the interaction information check with the Kubernetes.

v2.0.1

2021-03-30

Supported the Deployment.

v20.2.0

2020-12-30

Updated the Directory Structure section.

v20.1.0

2020-09-30

This is the first official release.