Run Docker containers implementing OpenMPI on multiple nodes

Motivation

As previously mentioned in this post, I am moving forward with the goal of running Athena++ on multiple nodes. As a preliminary step, I have attempted to run a Docker container with OpenMPI configured on multiple nodes. I will post a summary of what I have done, as I had some difficulties and it may be helpful to others.

Sources.

  1. Try Horovod in Docker My own article posted a few months ago. Horovod also uses OpenMPI, and in doing so, without a password. I learned how to connect to OpenMPI without a password in that case.
  2. Horovod in Docker The original manual which I also referred to in the above article.
  3. Parallel Computing with mpi The sample program from this page was adapted. This page also describes how to use ssh for node-to-node communication without password when using OpenMPI.
  4. Horovod with MPI This page describes how to specify the ssh port on the mpirun command line, which was the most difficult part of this project.
  5. mpirun(1) man page (verison 4.1.6) I referred to this page when checking the mpirun command line description.

Procedure

Creating a Docker container

As you can see in the following Dockerfile, ssh and openmpi were installed from Ubuntu 22.04 packages. openssh is 8.9p1 and openmpi is 4.1.2. libopenmpi-dev must be installed, If libopenmpi-dev is not installed, the error message “mpi.h: No such file or directory” will appear when compiling with mpicc.

# Based on the Dockefile for creating a Docker image that JupyterLab can use,
# Create a Docker container that can be used for athena++ development and execution.

# Based on the latest version of ubuntu 22.04.
FROM ubuntu:jammy-20240111

# Set bash as the default shell
ENV SHELL=/bin/bash

# Build with some basic utilities
RUN apt update && apt install -y \
        build-essential \
    python3-pip apt-utils vim \
    git git-lfs \
    curl unzip wget gnuplot \
        openmpi-bin libopenmpi-dev \
        openssh-client openssh-server

# alias python='python3'
RUN ln -s /usr/bin/python3 /usr/bin/python

# install python package to need
RUN pip install -U pip setuptools \
        && pip install numpy scipy h5py mpmath

# The following stuff is derived for horovod in docker.
# Allow OpenSSH to talk to containers without asking for confirmation
RUN mkdir -p /var/run/sshd
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
    echo "    StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
    mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config

# --allow-run-as-root
ENV OMPI_ALLOW_RUN_AS_ROOT=1
ENV OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1

# Create a working directory
WORKDIR /workdir

# command prompt
CMD ["/bin/bash"]

The part of modifying /etc/ssh/ssh_config was taken from the Dockerfile in Horovod in Docker.

I also specified the environment variables in the Dockerfile when running openMPI as root. Without this part, starting as root (in a Docker container) would result in the following message.

--------------------------------------------------------------------------
mpirun has detected an attempt to run as root.

Running as root is *strongly* discouraged as any mistake (e.g., in
defining TMPDIR) or bug can result in catastrophic damage to the OS
file system, leaving your system in an unusable state.

We strongly suggest that you run mpirun as a non-root user.

You can override this protection by adding the --allow-run-as-root option
to the cmd line or by setting two environment variables in the following way:
the variable OMPI_ALLOW_RUN_AS_ROOT=1 to indicate the desire to override this
protection, and OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 to confirm the choice and
add one more layer of certainty that you want to do so.
We reiterate our advice against doing so - please proceed at your own risk.
--------------------------------------------------------------------------

The following steps were used to build the container image.

$ sudo docker build -t openssh ./

Execution Environment

Here is a brief description of the execution environment.

Primary

Under /ext/nfs/ssh on the primary node (hostname: europe), id_rsa.pub created by ssh-keygen is appended to authorized_keys and stored. The permission of authorized_keys must be 600.

The /ext/nfs directory including /ext/nfs/ssh and /ext/nfs/athena++ created in this article is NFS mounted from other nodes.

The program for testing is placed in /ext/nfs/athena++/tmp, which is NFS-mounted and accessible from other nodes.

The container created is started as follows.

sudo docker run -it --rm --net=host \
-v /ext/nfs/ssh:/root/.ssh \
-v /ext/nfs/athena++:/workdir \
openmpi:latest
Secondary

The following line is added to /etc/fstab on the secondary node (hostname: ganymede). This mounts /ext/nfs on the Primary node as /mnt/nfs2, where xxx.xxx.xx.xx.xx is the IP address of the primary node.

xxx.xxx.xx.xx:/ext/nfs	/mnt/nfs2	nfs

On the secondary node, the container is started as follows.

sudo docker run -it --rm --net=host \
-v /mnt/nfs2/ssh:/root/.ssh \
-v /mnt/nfs2/athena++:/workdir \
openmpi:latest

For both primary and secondary nodes, /etc/hosts contains the IP adress and hostname of all connected nodes. When starting the container, –net=host is used, so the network of the physical machine is transparent to the container.

Compile the sample program

Create the following sample program (hello.c) in /ext/nfs/athena++/tmp on the primary node in advance.

#include <stdio.h>
#include "mpi.h"
 
int main( int argc, char *argv[] )
{
    int     rank, size, len;
    char    name[MPI_MAX_PROCESSOR_NAME];
 
    MPI_Init( &argc, &argv );
    MPI_Comm_rank( MPI_COMM_WORLD, &rank );
    MPI_Comm_size( MPI_COMM_WORLD, &size );
    MPI_Get_processor_name( name, &len );
    name[len] = '\0';
 
    printf( "Hello world: rank %d of %d running on %s\n", rank, size, name );
 
    MPI_Finalize();
    return 0;
}

Start the container, move to tmp, and compile the sample program.

# pwd
/workdir/tmp
# mpicc -o hello hello.c

First, let’s run it on the Primary node only.

# mpirun -n 4 ./hello
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Hello world: rank 1 of 4 running on europe
Hello world: rank 2 of 4 running on europe
Hello world: rank 3 of 4 running on europe
Hello world: rank 0 of 4 running on europe

Run the sample program on two nodes

Secondary

Start the container on the secondary node and start sshd as follows. The port number is 12345, the same as in Horovod in Docker.

# /usr/sbin/sshd -p 12345

Now, create a list of machines to be connected to mpi in host.txt.

# cat hosts.txt
europe
ganymede

Then, execute the following.

# mpirun -hostfile hosts.txt -mca plm_rsh_args "-p 12345" -n 8 ./hello
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Hello world: rank 2 of 8 running on europe
Hello world: rank 3 of 8 running on europe
Hello world: rank 0 of 8 running on europe
Hello world: rank 1 of 8 running on europe
Hello world: rank 5 of 8 running on ganymede
Hello world: rank 6 of 8 running on ganymede
Hello world: rank 7 of 8 running on ganymede
Hello world: rank 4 of 8 running on ganymede

I had a hard time getting to the above command line “-mca plm_rsh_args “-p 12345"".

In Horovod in Docker, the port number was specified in the command line of horovodrun, so I was looking for some way to do it and came across source 4.

The future

As mentioned above, we can now launch docker containers on multiple nodes and distribute the programs in the containers. In the future, I would like to perform distributed processing with MPI in Athena++. Before that, I would like to investigate the performance of distributed processing with OpenMPI.

The experience with Horovod in Docker was very useful this time.