OpenMPIを実装したDockerコンテナを複数ノードで実行する -

モチベーション

以前このポストで述べたように、Athena++をマルチノードで動かすことをゴールに進めている。その前段階として、OpenMPIを設定したDockerコンテナを複数ノードで実行させることを試みた。自分としては、少し苦労した点もあり、他の方にも参考になるかもしれないので、今回実施したことをまとめ投稿する。

情報源

Horovod in Dockerを試す数ヶ月前に投稿した自分の記事。HorovodもOpenMPIを使っており、その際にパスワードなしにssh接続するやり方を学んだことが、今回も役に立った。
Horovod in Docker 上記記事でも参考にした本家のマニュアル。
mpiによる並列計算このページのサンプルプログラムを流用した。このページでもOpenMPIを使う際に、sshをパスワードなしにノード間通信を行うやり方が記載されている。
Horovod with MPI 今回最も苦労した箇所である、mpirunのコマンドラインにsshのポートを指定する方法が記載されている。
mpirun(1) man page (verison 4.1.6) mpirunのコマンドラインの記載を調べる際に参照した。

手順

Dockerコンテナの作成

次のDockerfileを見てもらえば分かる通り、sshとopenmpiはUbuntu 22.04のパッケージからインストールした。opensshは8.9p1が、openmpiは4.1.2がインストールされた。libopenmpi-devをインストールしないと、mpiccでコンパイルする際に、mpi.hが無いとエラーとなる。

# JupyterLabが使えるDockerイメージ作成用のDockefileを元にして、
# athena++開発・実行で使えるDockerコンテナを作成する。

# ubuntu 22.04の最新版をベースとする。
FROM ubuntu:jammy-20240111

# Set bash as the default shell
ENV SHELL=/bin/bash

# Build with some basic utilities
RUN apt update && apt install -y \
        build-essential \
    python3-pip apt-utils vim \
    git git-lfs \
    curl unzip wget gnuplot \
        openmpi-bin libopenmpi-dev \
        openssh-client openssh-server

# alias python='python3'
RUN ln -s /usr/bin/python3 /usr/bin/python

# install python package to need
RUN pip install -U pip setuptools \
        && pip install numpy scipy h5py mpmath

# The following stuff is derived for horovod in docker.
# Allow OpenSSH to talk to containers without asking for confirmation
RUN mkdir -p /var/run/sshd
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
    echo "    StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
    mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config

# --allow-run-as-root
ENV OMPI_ALLOW_RUN_AS_ROOT=1
ENV OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1

# Create a working directory
WORKDIR /workdir

# command prompt
CMD ["/bin/bash"]

/etc/ssh/ssh_configを修正する部分は、Horovod in DockerのDockerfileを参考にした。

openMPIをrootで動かすときに環境変数もDockerfileで指定した。この部分が無いと、（Dockerコンテナ内で）rootで起動すると、次のようなメッセージとなる。

--------------------------------------------------------------------------
mpirun has detected an attempt to run as root.

Running as root is *strongly* discouraged as any mistake (e.g., in
defining TMPDIR) or bug can result in catastrophic damage to the OS
file system, leaving your system in an unusable state.

We strongly suggest that you run mpirun as a non-root user.

You can override this protection by adding the --allow-run-as-root option
to the cmd line or by setting two environment variables in the following way:
the variable OMPI_ALLOW_RUN_AS_ROOT=1 to indicate the desire to override this
protection, and OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 to confirm the choice and
add one more layer of certainty that you want to do so.
We reiterate our advice against doing so - please proceed at your own risk.
--------------------------------------------------------------------------

以下の手順でコンテナイメージをビルドした。

$ sudo docker build -t openssh ./

実行環境

ここで、実行環境を少し説明する。

Primary

primaryノード（ホスト名：europe）の/ext/nfs/ssh配下において、ssh-keygenで作成されたid_rsa.pubをauthorized_keysにアペンドしたものを格納している。なおauthorized_keysのパーミッションは、600でなければならない。

/ext/nfs/sshおよびこの記事で作成した/ext/nfs/athena++を含む/ext/nfs配下は他のノードからNFSマウントされている。

テスト用のプログラムは、/ext/nfs/athena++/tmpに置き、NFSマウントされているので、他のノードからアクセスできる。

作成したコンテナ次のように起動する。

sudo docker run -it --rm --net=host \
-v /ext/nfs/ssh:/root/.ssh \
-v /ext/nfs/athena++:/workdir \
openmpi:latest

Secondary

secondaryノード（ホスト名：ganymede）の/etc/fstabには次の行を追記している。これによりPrimaryノードの/ext/nfsを/mnt/nfs2としてNFSマウントしている。xxx.xxx.xx.xxはPrimaryノードのIPアドレス。

xxx.xxx.xx.xx:/ext/nfs	/mnt/nfs2	nfs

secondaryノードでは次のようにコンテナを起動している。

sudo docker run -it --rm --net=host \
-v /mnt/nfs2/ssh:/root/.ssh \
-v /mnt/nfs2/athena++:/workdir \
openmpi:latest

primary、secondaryのいずれのノードでも/etc/hostsは接続されている全てのノードのIPアドレスとhost名が記載してある。コンテナを起動する際に、–net=hostとしているので、物理マシンのネットワークがコンテナにも透過している。

サンプルプログラムをコンパイル

Primaryノードの/ext/nfs/athena++/tmpに次のサンプルプログラム（hello.c）を事前に作成しておく。

#include <stdio.h>
#include "mpi.h"
 
int main( int argc, char *argv[] )
{
    int     rank, size, len;
    char    name[MPI_MAX_PROCESSOR_NAME];
 
    MPI_Init( &argc, &argv );
    MPI_Comm_rank( MPI_COMM_WORLD, &rank );
    MPI_Comm_size( MPI_COMM_WORLD, &size );
    MPI_Get_processor_name( name, &len );
    name[len] = '\0';
 
    printf( "Hello world: rank %d of %d running on %s\n", rank, size, name );
 
    MPI_Finalize();
    return 0;
}

コンテナを起動し、tmpに移動し、サンプルプログラムをコンパイルする。

# pwd
/workdir/tmp
# mpicc -o hello hello.c

先ずはprimaryノードのみで実行してみる。

# mpirun -n 4 ./hello
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Hello world: rank 1 of 4 running on europe
Hello world: rank 2 of 4 running on europe
Hello world: rank 3 of 4 running on europe
Hello world: rank 0 of 4 running on europe

2つのノードでサンプルプログラムを実行

Secondary

secondaryノードでコンテナを起動し、次のとおりsshdを起動する。ポート番号は、Horovod in Dockerの時と同じ12345とした。

# /usr/sbin/sshd -p 12345

Primary

ここで、mpi接続するマシンのリストをhost.txtで作成しておく。

# cat hosts.txt
europe
ganymede

その上で、以下の通り実行。

# mpirun -hostfile hosts.txt -mca plm_rsh_args "-p 12345" -n 8 ./hello
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Hello world: rank 2 of 8 running on europe
Hello world: rank 3 of 8 running on europe
Hello world: rank 0 of 8 running on europe
Hello world: rank 1 of 8 running on europe
Hello world: rank 5 of 8 running on ganymede
Hello world: rank 6 of 8 running on ganymede
Hello world: rank 7 of 8 running on ganymede
Hello world: rank 4 of 8 running on ganymede

上記のコマンドライン「-mca plm_rsh_args “-p 12345”」にたどり着くまで、苦労した。

Horovod in Dockerにおいては、horovodrunのコマンドラインでポート番号を指定していたので、何らか方法があると思って探していたら、情報源4.に行き着いた。

今後について

上記の通り、dockerコンテナを複数のノードで起動し、コンテナ内のプログラムを分散処理できるようになった。今後は、Athena++でのMPIによる分散処理を行なっていきたい。その前にOpenMPIでの分散処理の性能についても調べてみたい。

今回、Horovod in Dockerの経験が大いに役に立った。