Building a GPU Cluster with Kubernetes - the First Steps

Introduction

About a month ago, I wrote an article about building a GPU cluster using kubernetes in this article. At that time, the GPU pod was in Pending state and did not work. After that, I managed to get it to work thanks to the advice of a certain person, so I’ll summarize it here.

In my environment, there is a problem that the GPU pod does not start up until a certain node is started, and furthermore, I have not been able to specify GPUs in a node, specify nodes, and so on, which is what I assumed, so I decided to call it “the first step”.

<! –more–>

Sources.

[[What you need to know to migrate from NVIDIA Docker (NVIDIA Container Toolkit) to nvidia-container-runtime + containerd](https://blog.inductor.me/entry/ 2020/12/13/042319)] I got the GPU pod in Pending state to work after being told about this page by a person.

2.NVIDIA/k8s-device-plugin NVIDIA’s device plugin page for kubernetes.

3.NVIDIA Kubernetes Device Plugin This page has information on the latest version of the device plugin. I got some information on the latest version of the device plugin from this page.

Conclusion

The following summary is based on my own experience, so there is no guarantee that the following will apply in every environment.

“CsytemdCgroup = false” in /etc/containerd/config.toml in the procedure of Step 1: Install a Container Engine should be left false and not changed to true.
NVIDIA GPU Operator procedure did not work.
I changed the procedure in Step 4: Setup NVIDIA Software and installed it. Specifically, I did not use Helm, but used “kubectl apply -f” to install the device plugin. It worked with the latest version (v0.12.3).

Installation operation

Step 1: Install Container Engine

(1) Install necessary packages

sudo apt-get update\
> && sudo apt-get install -y apt-transport-https \
> ca-certificates curl software-properties-common

In my environment, all were the latest versions.

(2) set to load overlay, br_netfilter (kernel) modules

$ cat <<EOF | sudo tee /etc/modules-load.d/containerd.conf
> overlay
> br_netfilter
> EOF
$ sudo modprobe overlay \
> && sudo modprobe br_netfilter

(3) Set sysctl parameters in conf file

$ cat <<EOF | sudo tee /etc/sysctl.d/99-kubernetes-cri.conf
> net.bridge.bridge-nf-call-iptables = 1
> net.ipv4.ip_forward = 1
> net.bridge.bridge-nf-call-ip6tables = 1
> EOF

$ sudo sysctl --system

(4) Configure Docker repository

$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key --keyring /etc/apt/trusted.gpg.d/docker.gpg add -

$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
> $(lsb_release -cs) \
> stable"

(5) install containerd

$ sudo apt-get update \
> && sudo apt-get install -y containerd.io

In my environment, containerd.io had the latest version (1.6.7-1) already installed.

(6) Set default parameters for containerd by (creating) config.toml

$ sudo mkdir -p /etc/containerd \
> && sudo containerd config default | sudo tee /etc/containerd/config.toml

Since containerd.io was already installed, the /etc/containerd directory and config.toml already existed. config.toml was renamed and saved.

(7) Change config.toml so that (containerd) uses systemd cgroup driver.

As mentioned in the conclusion, “SystemdCgroup = false” should be left as it is. No changes are made to /etc/containerd/config.toml here.

(8) Restart containerd daemon

$ sudo systemctl restart containerd

Step 2: Install Kubernetes components

(1) Install some dependencies

$ sudo apt-get update \
> && sudo apt-get install -y apt-transport-https curl

The latest version was installed in my environment.

(2) Add repository key.

$ curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -

(3) Add repository

$ cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
> deb https://apt.kubernetes.io/ kubernetes-xenial main
> EOF

(4) Install kubelet

$ sudo apt-get update \
> && sudo apt-get install -y -q kubelet kubectl kubeadm

(5) Note 1: Configure cgroup driver for Kuberlet

In the NVIDIA documentation, section 1 of Note.

At this point, 10-kubeadm.conf already exists under /etc/systemd/system/kuberlet.service.d.

$ sudo cat << EOF | sudo tee /etc/systemd/system/kubelet.service.d/0-containerd.conf
> [Service]
> Environment="KUBELET_EXTRA_ARGS=--container-runtime=remote --runtime-request-timeout=15m --container-runtime-endpoint=unix:///run/containerd/containerd.sock --cgroup-driver='systemd'"
> EOF

(6) Note 2: Restart kubelet

$ sudo systemctl daemon-reload \
> && sudo systemctl restart kubelet

(7) Disable swap

$ swapon --show
NAME TYPE SIZE USED PRIO
/swapfile file 2G 0B -2
$ sudo swapoff -a
$ swapon --show
$

The above operation is temporary, and swap is enabled again when the server is restarted. To disable it permanently, insert # at the beginning of any line in /etc/fstab that contains “swap” to disable it. (In my environment, the line starts with /swapfile)

(8) Run kubeadm init

$ sudo kubeadm init --pod-network-cidr=192.168.0.0/16
[init] Using Kubernetes version: v1.24.3
... Omitted.
Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 192.168.11.3:6443 --token 7ffwr1.xm119vzqvmhqgevl \
	--discovery-token-ca-cert-hash sha256:5d2f3065e38020b668ba1b766d95aea197182e35143511db7062f247f12c81d3

Make a note of this part “kubeadm join … sha256…”. You can create a cluster as a woker node by executing the following:

$ sudo kubeadm join 192.168.11.3:6443 --token 7ffwr1.xm119vzqvmhqgevl \
> --discovery-token-ca-cert-hash sha256:5d2f3065e38020b668ba1b766d95aea197182e35143511db7062f247f12c81d3 
[preflight] Running pre-flight checks
... Omitted...
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

(9) Copy authentication files under $HOME

$ mkdir -p $HOME/.kube \
> && sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config \
> && sudo chown $(id -u):$(id -g) $HOME/.kube/config

Step 3: Configure the network

(1) Configure network in Calico

$ kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml

(2) Assign the worker role to master as well

As the NVIDIA documentation says “GPU Pods can be scheduled on the simplest single-node clusters”, it is possible to schedule a Pod on the master (control plane) node as well.

$ kubectl taint nodes --all node-role.kubernetes.io/master-

In my environment, I did not want to schedule a pod on the master node, so I did not run this.

Now, you have one master (control plane) node (kubeadm init operation) and one worker node (kubeadm join operation) each. The state of the node in my environment is as follows.

$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
europe Ready control-plane 2m57s v1.25.0
saisei Ready <none> 21s v1.25.0

In the previous article, another node was added to the cluster configuration, but I didn’t this time.

Step 4: Configure the NVDIA software

In my environment, the NVIDIA driver is already installed because I have a GPU plugged into the node of control plane. As mentioned in the conclusion, I did not use “helm” to install the software.

The specific steps are as follows.

(1) (for installing nvidia-container-runtime package) Set up nvidia-docker repository

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \.
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

(2) Install nvidia-container-runtime package

$ sudo apt-get update \
   && sudo apt-get install -y nvidia-container-runtime

(3) Edit config.toml

Edit /etc/containerd/config.toml as follows.

79c79
<       default_runtime_name = "nvida"
---
>       default_runtime_name = "runc"
125,132d124
<             SystemdCgroup = true
<        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
<           privileged_without_host_devices = false
<           runtime_engine = ""
<           runtime_root = ""
<           runtime_type = "io.containerd.runc.v1"
<           [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
<             BinaryName = "/usr/bin/nvidia-container-runtime"

Then restart the containerd daemon.

$ sudo systemctl restart containerd

(4) Install NVIDIA Device Plugin

$ kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml

Step 5: Check

Start the GPU pod, check the status, check the logs, and confirm that it is working as follows.

$ cat gpu-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-operator-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "nvidia/samples:vectoradd-cuda10.2"
    resources:
      limits:
         nvidia.com/gpu: 1
         
$ kubectl apply -f gpu-pod.yaml
pod/gpu-operator-test created
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
gpu-operator-test 0/1 Completed 0 8s
$ kubectl logs gpu-operator-test
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Summary

However, when another node (mokusei) with GPU is added to the cluster, it does not work on that node and remains in Pending state.

It works while saisei is running, but if I shutdown saisei and start the GPU pod on the cluster, running mokusei node only, it does not work as follows.

$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
europe Ready control-plane 68m v1.25.0
mokusei Ready <none> 66m v1.25.0
saisei NotReady <none> 9m51s v1.25.0

$ kubectl apply -f gpu-pod.yaml
pod/gpu-operator-test created

$ kubectl get pods gpu-operator-test
NAME READY STATUS RESTARTS AGE
gpu-operator-test 0/1 Pending 0 2m50s

Next, I’m going to resolve the above issues in due course.

Also, I will need to look into specifying the GPU on the node and specifying the certain node together to solve the above problem.

Translated with www.DeepL.com/Translator (free version)