Motivation
I have been interested in Distributed Training for about a year. I have been experimenting with a distributed learning framework called Horovod on multiple TITAN-V-capable machines. I finally got a distributed training sample working, so I am posting it here.
Sources.
- The original manual - Horovod in Docker, the original manual. The command line description for running docker is outdated, but it is a valuable A valuable source.
- Manual of Docker (Japanese) - Manual of Rootless Docker, since my Docker environment was built with Rootless.
- Docker’s manual (English) - English version of the above manual.
Overview / Background
I started this project in August this year with the policy of running Horovod in a Docker container instead of installing it directly on a physical machine, because Horovod has a lot of related middleware. As you can see from that page, it looks very easy.
As I mentioned in the sources, my Docker environment was realized with Rootless Docker. This is also the reason why it took me a long time to get the sample working in Horovod in Docker.
Conditions for ssh operation in Horovod in Docker
In Horovod in Docker, it is necessary to authenticate from the first machine (hereinafter referred to as “Primary”) and from the second machine (hereinafter referred to as “Secondary”) using ssh communication without a password. To do this, perform ssh-keygen on the Primary and distribute the public key (id_rsa.pub) to the Secondarys. The specific method is described below.
How to start Horovod in Docker
Initially, I ran Horovod on two machines, one as Primary and the other as Secondary, as follows
As described in source 1, for the ssh public key, I placed it on nfs and used -v to make it visible as /root/.ssh in the container. The specific command line is as follows: docker is not started with sudo because it is a rootless docker environment.
Secondary
On the secondary machine, start the horovod container and wait for ssh communication on port 12345 as follows
$ docker run -it --gpus all --net=host -v /mnt/nfs2/ssh:/root/.ssh horovod/horovod:latest bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
Primary
On the Primary machine, start the horovod container and run the code written for horovod (pytorch_mnist.py) with horovodrun.
$ docker run -it --gpus all --net=host -v /mnt/nfs2/ssh:/root/.ssh horovod/horovod:latest
root@ganymede:/horovod/examples/pytorch# horovodrun -np 2 -H 192.168.11.4:1,192.168.11.3:1 -p 12345 python pytorch_mnist.py
(Abbreviation in the middle)
raise RuntimeError('could not connect to some hosts via ssh')
RuntimeError: could not connect to some hosts via ssh
As shown above, ssh did not seem to be connected, and this is where the struggle with ssh began. As you can see from the horovodrun startup command, the /etc/hosts in the container is not modified, and the IP addresses specify the Primary and Secondary machines.
Findings
I installed the iproute2 package in a Secondary container and confirmed that the ss command (formerly netstat) LISTENs port number 12345 in the container. At this time, ganymede is Secondary.
root@ganymede:/horovod/examples# apt update
root@ganymede:/horovod/examples# apt install iproute2
root@ganymede:/horovod/examples# apt install iputils-ping
root@ganymede:/horovod/examples# ip -br -4 address
root@ganymede:/etc/ssh# ss -nltu
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
udp UNCONN 0 0 *:7946 *:*
tcp LISTEN 0 128 0.0.0.0:12345 0.0.0.0:*
tcp LISTEN 0 128 *:7946 *:*
tcp LISTEN 0 128 *:2377 *:*
tcp LISTEN 0 128 [::]:12345 [::]:*
On the ganymede host, I checked to see if it was LISTENing on port number 12345 using ss, but found that it was not LISTENing.
In summary, the container LISTENs on port 12345, but the host does not LISTEN on that port number.
What I found out from the manual
Having reached this point, a careful reading of sources 2. and 3. reveals the following statement!
--net=host doesn't listen ports on the host network namespace
This is an expected behavior, as the daemon is namespaced inside RootlessKit's network namespace. Use docker run -p instead.
Connecting host and container ports with -p instead of –net=host
I followed the manual and started the following without –net=host and with -p to reconnect the ports.
$ docker run -it --gpus all -p12345:22 -v /mnt/nfs2/ssh:/root/.ssh horovod/horovod:latest
When I connect to ssh from another host, I get the following error.
$ ssh 192.168.11.4 -p 12345
kex_exchange_identification: Connection closed by remote host
Connection closed by 192.168.11.4 port 12345
I’m running out of steam here.
I finally decided to give up on rootless docker and I decided to uninstall rootless docker on all machines.
The following is a run in a rootful docker environment.
Run Horovod in Docker
Create ssh public and private keys
Create the public and private keys on the nfs mounted area as follows
# ssh-keygen -t rsa
Enter file in which to save the key (/root/.ssh/id_rsa): /mnt/nfs2/ssh/id_rsa
Enter passphrase (empty for no passphrase): /mnt/nfs2/ssh/id_rsa
Enter same passphrase again:: /mnt/nfs2/ssh/id_rsa
(abbreviated below)
# cat id_rsa.pub >> authorized_keys
# chmod 600 authorized_keys
# ls -l authorized_keys
-rw------- 1 root root 566 Oct 13 15:33 authorized_keys
Horovod in Docker to start
Secondarys
$ sudo docker run -it --gpus all --net=host -v /mnt/nfs2/ssh:/root/.ssh horovod/horovod:latest bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
Primary
$ sudo docker run -it --gpus all --net=host -v /mnt/nfs2/ssh:/root/.ssh horovod/horovod:latest
# cd pytorch
# horovodrun -np 2 -H 192.168.11.3:1,192.168.11.4:1 -p 12345 python pytorch_mnist.py
Measure execution time
After running Horovod in Docker (as per the command line above) in Secondarys, we measured the execution time (Real) on one to four machines as follows.
# time horovodrun -np 1 -H localhost:1 python pytorch_mnist.py
(略)
[1,0]<stdout>:Test set: Average loss: 0.0551, Accuracy: 98.38%
[1,0]<stdout>:
real 5m19.646s
user 6m14.697s
sys 0m28.747s
# time horovodrun -np 2 -H 192.168.11.4:\
1,192.168.11.3:1 -p 12345 python pytorch_mnist.py
(Abbreviation)
[1,0]<stdout>:Test set: Average loss: 0.0546, Accuracy: 98.26%
[1,0]<stdout>:
real 2m54.234s
user 3m46.072s
sys 0m36.676s
# time horovodrun -np 3 -H 192.168.11.4:1,192.168.11.3:1,192.168.11.5:1 -p 12345 python pytorch_mnist.py
(Abbreviation)
[1,0]<stdout>:Test set: Average loss: 0.0576, Accuracy: 98.03%
[1,0]<stdout>:
real 2m7.043s
user 2m52.191s
sys 0m40.745s
# time horovodrun -np 4 -H 192.168.11.4:1,192.168.11.3:1,192.168.11.5:1,192.168.11.6:1 -p 12345 python pytorch_mnist.py
(Abbreviation)
[1,0]<stdout>:Test set: Average loss: 0.0542, Accuracy: 98.27%
[1,0]<stdout>:
real 1m34.386s
user 2m33.140s
sys 1m3.089s
The following graph shows the results of the two measurements.
Summary - For the future
This time, I got into trouble trying to use docker’s host network in a rootless docker environment. However, I was able to learn how to do ssh authentication, ssh logging (ssh -vvv IP address -p port number), and network commands (ss, etc.).
The evaluation of the measured execution results is as follows. Since the four machines we measured have different CPU/GPU, I cannot strictly discuss the relationship between the number of machines and execution time, but it is true that execution with four machines is faster than with only one machine.
In the future, I would like to learn how to write code for horovod, take up issues that take a little longer to execute, and write and execute code for horovod myself.