Measuring OpenMPI performance again using the HIMENO benchmark

Introduction

I have changed the hostfile that determines the order of OpenMPI execution nodes and re-measured OpenMPI performance on the Himeno benchmark as this article I posted it. After posting, I thought about it again and decided to use objective figures instead of my own judgments based on CPU and clock performance.

So this time, I decided to measure the performance of each individual workstation (node), and then decide the order of hostfile according to the results, and measure them again.

Performance measurements for each workstation

For the performance measurement, it was decided to use a calculation size of L for C and static allocate version in order to align the conditions with the MPI version.

The results were as follows. The results of this measurement were added to the list of workstations in this post.

jupiter ganymede saisei mokusei europe
CPU Xeon(R) CPU E5-1620 @ 3.60GHz Xeon(R) CPU E5-2620 @ 2.00GHz Xeon(R) CPU E5-2643 @ 3.30GHz Xeon(R) CPU E5-2609 @ 2.40GHz eon(R) CPU E3-1270 v5 @ 3.60GHz
# of CPU 1 1 1 2 1
# of core 4 6 4 4 4
MFLOPS 4,775 3,312 4,616 3,109 5,438

hostfile

From the above results, hostfile was changed as follows.

# cat myhosts
europe slots=4
jupiter slots=4
saisei slots=4
ganymede slots=6
mokusei slots=8

Measurement Results

The number of boot processes and MFLOPS values without/with slots were as follows.

np MFLOPS(without slot) MFLOPS(with slot)
2 9,186 9,179
4 11,294 11,260
8 20,411 20,433
16 36,907 36,056
32 23,041 21,068
64 8,734 9,040

ejsgm

So far, the graphs of the results of the three measurements are as follows. The name of this time’s results shall be ejsgm.

comparison

Summary

Compared to the last time (results of the March 17 article), I have determined that the current results have not changed overall, although the numbers are a few percent better at 16 processes.

I don’t even know if I was incorrect for some reason about the results measured at 16 processes in the first (February 23 article).

Re-measurement for verification (additional details at a later date)

In the above, I suspected that the measurement results for 16 processes in this article were not being measured correctly, so I measured them again. I decided to try it.

First, I changed the hostfile as follows.

# cat myhosts
europe slots=4
jupiter slots=4
ganymede slots=6
saisei slots=4
mokusei slots=8

After that, we varied np from 2 to 64 and recorded the output MFLOPS. The results are as follows, and the values for February 23 are also listed for reference.

np MFLOPS(Feb. 23rd article) MFLOPS(currently measured) prev/curr comparison
2 9,171 9,170 1.000
4 11,234 11,200 1.000
8 20,413 20,437 0.999
16 29,589 29,660 0.998
32 23,018 23,179 0.993
64 9,216 8,290 1.112

As mentioned above, there is some variation in the measured values when using 64 processes, but when using 2 to 32 processes, they match very well. Therefore, I judge that the data in the article posted on February 23rd is reliable.