easuring OpenMPI performance using the HIMENO benchmark -

Motivation

As I stated in this post yesterday, I was able to run a program using OpenMPI on a Docker container running on multiple nodes. I wanted to find out how much performance I could improve by using OpenMPI, so I decided to benchmark it. Actually, I had some difficulties this time as well, and I would be happy if that part is helpful for others.

Sources

Himeno Benchmark This is the home page of the HIMENO Benchmark.
Horovod with MPI This page has answers to the points I struggled with this time.
Solution to the problem of OpenMPI(v3.1.3) not being able to parallelize across multiple nodes I think I found the same phenomenon and solution. However, this page is a solution on a physical machine.

Download, Running on a single machine

Download

Download “C+MPI, static allocate version” from the download page. lzh compressed, install the jlha-utils package and extract the downloaded lzh file.

$ sudo apt install jlha-utils
$ lha x cc_himenobmtxp_mpi.lzh

From unzipping to running on a single machine, follow the instructions in the “How to use” section of the download page.

I copied the unzipped file to /ext/nfs/athena++/himeno and worked from there. As explained yesterday, this area is NFS mounted from other nodes.

From now on, the work was done in a docker container. paramset.sh was granted execution rights.

# pwd
/workdir/himeno
# mv param.h param.h.org
# ./paramset.sh M 1 1 2
# cp Makefile.sample Makefile
# ls -l
total 44
-rw-rw-r-- 1 1000 1000   251 Feb 22 12:24 Makefile
-rw-rw-r-- 1 1000 1000   251 Feb 22 12:20 Makefile.sample
-rw-rw-r-- 1 1000 1000  4414 Feb 22 12:19 cc_himenobmtxp_mpi.lzh
-rw-rw-r-- 1 1000 1000 13099 Feb 22 12:20 himenoBMTxps.c
-rw-rw-r-- 1 root root   196 Feb 22 12:23 param.h
-rw-rw-r-- 1 1000 1000   202 Feb 22 12:20 param.h.org
-rwxrwxr-x 1 1000 1000  2079 Feb 22 12:20 paramset.sh
# make

Make will output warnings, but a.out called bmt can be created and run on a single machine.

# mpirun -np 2 ./bmt
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Sequential version array size
 mimax = 129 mjmax = 129 mkmax = 257
Parallel version array size
 mimax = 129 mjmax = 129 mkmax = 131
imax = 128 jmax = 128 kmax =129
I-decomp = 1 J-decomp = 1 K-decomp =2
 Start rehearsal measurement process.
 Measure the performance in 3 times.

 MFLOPS: 9847.792258 time(s): 0.041767 1.667103e-03

 Now, start the actual measurement process.
 The loop will be excuted in 4309 times
 This will take about one minute.
 Wait for a while

cpu : 64.531007 sec.
Loop executed for 4309 times
Gosa : 1.978532e-04 
MFLOPS measured : 9155.072245
Score based on Pentium III 600MHz : 110.515116

After all this time, and with yesterday’s experience, I thought it would be easy to run BENCHMARK on multiple nodes.

More Processes Running Concurrently

Now that I was able to run two processes on a single machine, I decided to run four processes. To do so, I ran the following. For the meaning of the parameters, please refer to the download page.

# cat hosts.txt
europe
ganymede
jupiter
# ./paramset.sh M 1 2 2
# make
# mpirun --hostfile hosts.txt -mca plm_rsh_args "-p 12345" -np 4 $(pwd)/bmt

The results are as follows

Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Invalid number of PE
Please check partitioning pattern or number of PE

To find out what the PE is at this time (expected value is 1x2x2=4), I added the following to the part that is giving the error.

	printf("PE=%d\n",npe);

Then I noticed when I made it!　Even though param.h was updated, himenoBMTxps.c was not compiled (himenoBMTxps.o was not updated) and bmt was created. The following will work.

# ./paramset.sh M 1 2 2
# make clean
# make
# mpirun --hostfile hosts.txt -mca plm_rsh_args "-p 12345" -np 4 $(pwd)/bmt

Running on multiple nodes

I decided to run the BENCHMARK by varying the number of processes to 2, 4, 8, 16, 32, and 64. The computation size is set to L.

First, the number of processes was set to 8, which is the middle number of processes. The file “. /maramset.sh L 2 2 2 2” to update param.h, and then “make clean” and “make” were executed. Also, myhosts at this time is the same as the hosts.txt file mentioned above.

# mpirun -np 8 --hostfile myhosts -mca plm_rsh_args "-p 12345" -oversubscribe $(pwd)/bmt
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
[ganymede][[22394,1],4][../../../../../../opal/mca/btl/tcp/btl_tcp_endpoint.c:625:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[22394,1],7]

I searched for “received unexpected process identifier” and found a hit on the page of source 3.

After reading the solution “Add --mca btl_tcp_if_include eth0 to the runtime options” on this page, I remembered Source 2 and was able to solve the problem as follows.

# mpirun -np 8 --hostfile myhosts -mca plm_rsh_args "-p 12345" -mca btl_tcp_if_exclude lo,docker0 $(pwd)/bmt
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Sequential version array size
 mimax = 257 mjmax = 257 mkmax = 513
Parallel version array size
 mimax = 131 mjmax = 131 mkmax = 259
imax = 129 jmax = 129 kmax =257
I-decomp = 2 J-decomp = 2 K-decomp =2
 Start rehearsal measurement process.
 Measure the performance in 3 times.

 MFLOPS: 19912.780454 time(s): 0.168541 8.841949e-04

 Now, start the actual measurement process.
 The loop will be excuted in 1067 times
 This will take about one minute.
 Wait for a while
（Omitted below）

Again, the Horovod in Docker experience was helpful.

Performance measurements

The following measurements were taken on five machines. The hostfile at this time was as follows.

# cat myhosts
europe
jupiter
ganymede
saisei
mokusei

No slot specified

np	MFLOPS
2	9,177
4	11,267
8	20,501
16	20,111
32	20,607
64	8,411

I thought it might be strange that they are saturated at np=8. As you can see from the table below (an excerpt of the specs for each machine), the total number of physical cores (# of CPU x # of cores) for the 5 machines is 26, so I thought 16 would be the peak to saturate.

	jupiter	ganymede	saisei	mokusei	europe
CPU	Xeon(R) CPU E5-1620 @ 3.60GHz	Xeon(R) CPU E5-2620 @ 2.00GHz	Xeon(R) CPU E5-2643 @ 3.30GHz	Xeon(R) CPU E5-2609 @ 2.40GHz	eon(R) CPU E3-1270 v5 @ 3.60GHz
# of CPU	1	1	1	2	1
# of core	4	6	4	4	4

with slot specified

I decided to list the number of physical cores of each machine as slots in the hostfile.

# cat myhosts
europe slots=4
jupiter slots=4
ganymede slots=6
saisei slots=4
mokusei slots=8

The results are as follows. The results are in line with the expected trend.

np	MFLOPS
2	9,171
4	11,234
8	20,413
16	29,589
32	23,018
64	9,216

The measurement results are graphed as follows.

HimenoBench

Summary

I thought that the performance would improve as the number of parallel processes increased up to the number of physical cores, and I obtained such results. I do not know the reason for the difference in the peak number of processes depending on whether the slot specification is used or not.

I measured the calculation size as “L” this time, but I would like to investigate how the difference in calculation size affects the results in the future.