GPU cluster construction with Kubernetes (not yet completed)

Motivation

I took on the challenge of creating a GPU cluster using multiple workstations I have set up at home to study Kubernetes.

It was a hurdle for me, as I was new to Kubernetes. This is because I had to learn from the documentation whether the operations were necessary only during installation or when building the cluster (during operation).

As it stands, the GPU cluster is not operational! This document is incomplete.

Sources.

Installation Summary

Policies

Proceed with the installation based on the above NVIDIA documentation. There are two parts to choose from, but I decided to install kubernetes with “kubeadm” and use “NVIDIA GPU Operator” for NVIDIA-related software. I have already installed driver and NVIDIA Container Toolkit in my environment.

Since the NVIDIA documentation is difficult to understand the connection (indentation), the following is a step-by-step procedure to see the whole installation process. (Each of the numbers roughly corresponds to each of the commands in the NVIDIA documentation.)

Environment

The environment in which kubernetes is installed is shown in the following figure.

! System Configuration

Installation Instructions

Step 1: Install Container Engine

(1) Install some pre-requisites packages for containerd
(2) Configure to load overlay, br_netfilter (kernel) modules
(3) Set sysctl parameters in conf file
(4) configure Docker repository
(5) install containerd
(6) Set containerd default parameters in config.toml (create)
(7) Modify config.toml so that (containerd) uses systemd cgroup driver
(8) Restart containerd daemon

Step 2: Install Kubernetes components

(1) Install some dependencies
(2) Add the package repository keys
(3) Add the repository
(4) Install kubelet
(5) Note no. 1: Configure cgroup driver for Kuberlet
(6) Note no. 2: Restart kubelet
(7) Disable swap
(8) Run kubeadm init
(9) Copy authentication files under $HOME

Step 3: Configure network

(1) Configure network in Calico
(2) Assign the worker role to master as well

Step 4: Setup NVDIA software (using NVIDIA GPU Operator)

(1) Install helm
(2) Add NVIDIA Helm repository
(3) Install GPU Operator

Now run the “Containerd” part of “Bare-metal/Passthrough with pre-installed drivers and NVIDIA Container Toolkit”.

1) Edit config.toml
2) Run helm install
(4) Confirm GPU Operator installation
(5) Run sample GPU application

Separate master (control plane) node and worker node

The following shows which node to execute each step of the installation described in the previous section.

  • Step 1: (1) to Step 2: (7) Perform on any node, master or worker node.

  • Step 2: (8) - Step 2: (9) Execute only on the master node; on the worker node, execute “kuberadm join”.

  • Step 3 and Step 4 Perform only on the master node.

In the installation procedure, there is also a procedure to build a cluster in the kuberadm environment. This is described later.

Insul operation

Step 1: Install Container Engine

(1) Install required packages
sudo apt-get update\
> && sudo apt-get install -y apt-transport-https \
> ca-certificates curl software-properties-common

In my environment, all were the latest versions.

(2) set to load overlay, br_netfilter (kernel) modules
$ cat <<EOF | sudo tee /etc/modules-load.d/containerd.conf
> overlay
> br_netfilter
> EOF
$ sudo modprobe overlay \filter
> && sudo modprobe br_netfilter
(3) Set sysctl parameters in conf file
$ cat <<EOF | sudo tee /etc/sysctl.d/99-kubernetes-cri.conf
> net.bridge.bridge-nf-call-iptables = 1
> net.ipv4.ip_forward = 1
> net.bridge.bridge-nf-call-ip6tables = 1
> EOF

$ sudo sysctl --system
(4) Configure Docker repository
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key --keyring /etc/apt/trusted.gpg.d/docker.gpg add -

$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
> $(lsb_release -cs) \
> stable"
(5) install containerd
$ sudo apt-get update \
> && sudo apt-get install -y containerd.io

In my environment, containerd.io had the latest version (1.6.7-1) already installed.

(6) Set default parameters for containerd by (creating) config.toml
$ sudo mkdir -p /etc/containerd \
> && sudo containerd config default | sudo tee /etc/containerd/config.toml

Since containerd.io was already installed, the /etc/containerd directory and config.toml already existed. config.toml was renamed and saved.

(7) Modify config.toml so that (containerd) uses systemd cgroup driver

Modify /etc/containerd/config.toml created above as follows.

125c125
< SystemdCgroup = true
---
> SystemdCgroup = false
(8) Restart containerd daemon
$ sudo systemctl restart containerd

Step 2: Install Kubernetes components

(1) Install some dependencies
$ sudo apt-get update \
> && sudo apt-get install -y apt-transport-https curl

The latest version was installed in my environment.

$ curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
(2) Add repository key.
$ curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
(3) Add repository
$ cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
> deb https://apt.kubernetes.io/ kubernetes-xenial main
> EOF
(4) Install kubelet
$ sudo apt-get update \
> && sudo apt-get install -y -q kubelet kubectl kubeadm
(5) Note 1: Configure cgroup driver for Kuberlet

In the NVIDIA documentation, section 1 of Note.

At this point, 10-kubeadm.conf already exists under /etc/systemd/system/kuberlet.service.d.

$ sudo cat << EOF | sudo tee /etc/systemd/system/kubelet.service.d/0-containerd.conf
> [Service]
> Environment="KUBELET_EXTRA_ARGS=--container-runtime=remote --runtime-request-timeout=15m --container-runtime-endpoint=unix:///run/containerd/containerd.sock --cgroup-driver='systemd'"
> EOF
(6) Note 2: Restart kubelet
$ sudo systemctl daemon-reload \
> && sudo systemctl restart kubelet
(7) Disable swap
$ swapon --show
NAME TYPE SIZE USED PRIO
/swapfile file 2G 0B -2
$ sudo swapoff -a
$ swapon --show
$

The above operation is temporary, and swap is enabled again when the server is restarted. To disable it permanently, insert # at the beginning of any line in /etc/fstab that contains “swap” to disable it. (In my environment, the line starts with /swapfile)

(8) Run kubeadm init
$ sudo kubeadm init --pod-network-cidr=192.168.0.0/16
[init] Using Kubernetes version: v1.24.3
... Omitted. ...
Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 192.168.11.3:6443 --token 7ffwr1.xm119vzqvmhqgevl \
	--discovery-token-ca-cert-hash sha256:5d2f3065e38020b668ba1b766d95aea197182e35143511db7062f247f12c81d3 
$

Make a note of this part “kubeadm join … sha256…”. You can create a cluster as a woker node by executing the following:

$ sudo kubeadm join 192.168.11.3:6443 --token 7ffwr1.xm119vzqvmhqgevl \
> --discovery-token-ca-cert-hash sha256:5d2f3065e38020b668ba1b766d95aea197182e35143511db7062f247f12c81d3 
[preflight] Running pre-flight checks
... Omitted...
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.
(9) Copy authentication files under $HOME
$ mkdir -p $HOME/.kube \
> && sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config \
> && sudo chown $(id -u):$(id -g) $HOME/.kube/config

Step 3: Set up your network

(1) Configure network in Calico
$ kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
(2) Assign the worker role to master as well

As the NVIDIA documentation says “GPU Pods can be scheduled on the simplest single-node clusters”, it is possible to schedule a Pod on the master (control plane) node as well.

$ kubectl taint nodes --all node-role.kubernetes.io/master-

In my environment, I did not want to schedule a pod on the master node, so I did not run this.

Now, you have one master (control plane) node (kubeadm init operation) and one worker node (kubeadm join operation) each. The state of the node in my environment is as follows.

$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
jupiter Ready control-plane 109m v1.24.3
saisei Ready <none> 2m42s v1.24.3

And one more kubeadmin join operation, the status is as follows.

$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
jupiter Ready control-plane 168m v1.24.3
mokusei Ready <none> 2m3s v1.24.3
saisei Ready <none> 62m v1.24.3

Step 4: Configure NVDIA software (with GPU Operator)

(1) Install helm
$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
> && chmod 700 get_helm.sh \
> && . /get_helm.sh
(2) Add NVIDIA Helm repository
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
> && helm repo update
(3) Install GPU Operator

As mentioned in the overview section, run the “Containerd” part of “Bare-metal/Passthrough with pre-installed drivers and NVIDIA Container Toolkit”.

1) Edit config.toml

Edit /etc/containerd/config.toml as follows

79c79
< default_runtime_name = "nvida"
---
> default_runtime_name = "runc"
125,132d124
< SystemdCgroup = true
< [plugins. "io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
< privileged_without_host_devices = false
< runtime_engine = ""
< runtime_root = ""
< runtime_type = "io.containerd.runc.v1"
< [plugins. "io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
< BinaryName = "/usr/bin/nvidia-container-runtime"

Then restart the containerd daemon.

$ sudo systemctl restart containerd
2) Run helm install
$ helm install --wait --generate-name \
> -n gpu-operator --create-namespace \
> nvidia/gpu-operator \
> --set driver.enabled=false \
> --set toolkit.enabled=false
(4) Confirm GPU Operator installation
$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-1660956347-node-feature-discovery-master-78498fqrv 0/1 ContainerCreating 0 22m
gpu-operator-1660956347-node-feature-discovery-worker-d7z25 0/1 ContainerCreating 0 66m
gpu-operator-569d9c8cb-r5d6x 0/1 ContainerCreating 0 70m

Compared to the output results from the NVIDIA documentation, there is no pod with nvidia-*.

(5) Run sample GPU app.
$ cat sample-gpu.yaml
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec: cuda-vectoradd
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvidia/samples:vectoradd-cuda11.2.1"
    resources:
      limits:
         nvidia.com/gpu: 1
$ kubectl apply -f sample-gpu.yaml
pod/cuda-vectoradd created
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
cuda-vectoradd 0/1 Pending 0 28s
nginx-deployment-6595874d85-88x8d 1/1 Running 0 10m
nginx-deployment-6595874d85-nctbg 1/1 Running 0 10m
nginx-deployment-6595874d85-v7x4n 1/1 Running 0 10m

As shown above, the sample gpu pod, cuda-vectoradd, is still Pending.

Incidentally, the pod created in nginx deployment was in Running status as follows Running status as shown below. (with replicas=3)

$ kubectl apply -f nginx-deployment.yaml
deployment.apps/nginx-deployment created
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
nginx-deployment-6595874d85-88x8d 1/1 Running 0 2m55s
nginx-deployment-6595874d85-nctbg 1/1 Running 0 2m55s
nginx-deployment-6595874d85-v7x4n 1/1 Running 0 2m55s

Summary

As explained so far, we have reached the point where we can install kubernetes with kubeadm and build a cluster, but we have not been able to build the GPU cluster, which was our initial goal. Even after gpu deployment, the pod stays in Pending, not Running state!

Once resolved, I’ll re-update the article with the fix.

Translated with www.DeepL.com/Translator (free version)