Keneun's Outerspace: January 2017

smash-2.1.0 DFT on q183 Phi: CentOS 6.6 (Final)

Environment

Intel Compiler Version 17.0.1.132 Build 20161005
Python v2.6.6

. /opt/intel/compilers_and_libraries_2017/linux/bin/compilervars.sh intel64
. /opt/intel/impi/2017.1.132/bin64/mpivars.sh intel64

echo $MKLROOT
/opt/intel/compilers_and_libraries_2017.1.132/linux/mkl

echo $MIC_LD_LIBRARY_PATH
/opt/intel/compilers_and_libraries_2017.1.132/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/mic:/opt/intel/compilers_and_libraries_2017.1.132/linux/ipp/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64_lin_mic:/opt/intel/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64_lin_mic:/opt/intel/compilers_and_libraries_2017.1.132/linux/tbb/lib/mic

Compilation

Compilation with Intel compiler with both OpenMP and Intel MPI (impi) interfaces can make one single binary run in parallel via the control using OMP_NUM_THREADS and mpirun
Compile on host with for the intel64 architecture
cd /phi/pkg/smash-2.1.0
cp Makefile Makefile.mpiifort

Edit Makefile.mpiifort and set
F90 = mpiifort -DILP64 # <---Note here it is "mpiifort", not "mpifort" !! Two i !!
LIB = -mkl=parallel
OPT = -qopenmp -i8 -xHOST -ilp64 -O3

Then compile with
make -f Makefile.mpiifort

Do not use parallel make -j. After successful build, rename the outcoming binary executable:
mv /phi/pkg/smash-2.1.0/bin/smash /phi/pkg/smash-2.1.0/bin/smash.intel64.impi

Cleanup the object files for the next build:
make -f Makefile.mpiifort clean

Compile another version of binary for mic0 using -mmic
cp Makefile.mpiifort Makefile.mpiifort.mic

Edit Makefile.mpiifort.mic and set
F90 = mpiifort -DILP64
LIB = -mkl=parallel
OPT = -qopenmp -i8 -xHOST -ilp64 -O3 -mmic

Then compile with
make -f Makefile.mpiifort.mic
mv /phi/pkg/smash-2.1.0/bin/smash /phi/pkg/smash-2.1.0/bin/smash.mic.impi

Now we have binaries for both architectures under /phi/pkg/smash-2.1.0/bin
ls -al /phi/pkg/smash-2.1.0/bin/smash*.impi
-rwxr-xr-x 1 jsyu ccu 5469540 Jan 27 02:35 /phi/pkg/smash-2.1.0/bin/smash.intel64.impi
-rwxr-xr-x 1 jsyu ccu 7612438 Jan 28 02:46 /phi/pkg/smash-2.1.0/bin/smash.mic.impi

Running test molecule (taxol, C47H51NO14) from the example file
/phi/pkg/smash-2.1.0/example/large-memory.inp but change to DFT instead of MP2:
cp /phi/pkg/smash-2.1.0/example/large-memory.inp large-memory-b3.inp
Edit the input file large-memory-b3.inp and change the first line method=MP2 into method=B3LYP and reduce memory=7GB

OpenMP Run

Using OpenMP parallel on host (using 20 threads):
OMP_NUM_THREADS=20 /phi/pkg/smash-2.1.0/bin/smash.intel64.impi < large-memory-b3.inp > large-memory-b3.q183.openmp &
After finish, grep the timing data
grep -A 1 "Step CPU :" large-memory-b3.q183.openmp
The third step computes B3LYP/STO-3G energy from huckel guess; its timing is highlighted:
Step CPU : 8.3, Total CPU : 8.3 of Master node
Step Wall : 0.2, Total Wall : 0.2 at Sat Jan 28 14:30:30 2017
--
Step CPU : 354.0, Total CPU : 362.3 of Master node
Step Wall : 9.1, Total Wall : 9.3 at Sat Jan 28 14:30:39 2017
--
Step CPU : 3286.7, Total CPU : 3649.0 of Master node
Step Wall : 84.3, Total Wall : 93.6 at Sat Jan 28 14:32:03 2017
--
Step CPU : 2.4, Total CPU : 3651.5 of Master node
Step Wall : 0.1, Total Wall : 93.6 at Sat Jan 28 14:32:03 2017

Using OpenMP parallel on mic0 (using 60 threads), native mode:

export \ LD_LIBRARY_PATH="/opt/intel/compilers_and_libraries_2017.1.132/linux/mkl/lib/mic:/opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/mic"

ulimit -s unlimited

OMP_NUM_THREADS=20 /phi/pkg/smash-2.1.0/bin/smash.mic.impi < large-memory-b3.inp > large-memory-b3.mic0.openmp.60 &
Timing data of 60 OMP threads in large-memory-b3.mic0.openmp.60
Step CPU : 66.3, Total CPU : 66.3 of Master node
Step Wall : 1.2, Total Wall : 1.2 at Sat Jan 28 14:12:06 2017
--
Step CPU : 1759.9, Total CPU : 1826.1 of Master node
Step Wall : 29.3, Total Wall : 30.5 at Sat Jan 28 14:12:35 2017
--
Step CPU : 15690.6, Total CPU : 17516.7 of Master node
Step Wall : 263.2, Total Wall : 293.7 at Sat Jan 28 14:16:59 2017
--
Step CPU : 6.7, Total CPU : 17523.4 of Master node
Step Wall : 0.1, Total Wall : 293.8 at Sat Jan 28 14:16:59 2017

Timing data of 240 OMP threads in large-memory-b3.mic0.openmp.240
Step CPU : 488.2, Total CPU : 488.2 of Master node
Step Wall : 2.3, Total Wall : 2.3 at Sat Jan 28 04:37:43 2017
--
Step CPU : 4645.1, Total CPU : 5133.3 of Master node
Step Wall : 19.6, Total Wall : 21.8 at Sat Jan 28 04:38:03 2017
--
Step CPU : 43990.2, Total CPU : 49123.5 of Master node
Step Wall : 184.6, Total Wall : 206.4 at Sat Jan 28 04:41:07 2017
--
Step CPU : 55.5, Total CPU : 49179.0 of Master node
Step Wall : 0.2, Total Wall : 206.6 at Sat Jan 28 04:41:08 2017

Intel MPI Run (impi)

@q183
Using impi parallel on host (1 process x 20 threads):
OMP_NUM_THREADS=20 mpiexec.hydra -np 1 -host q183 /phi/pkg/smash-2.1.0/bin/smash.intel64.impi < large-memory-b3.inp > large-memory-b3.q183.impi.1p20t &
Timing data of 20 MPI threads in large-memory-b3.q183.impi.1p20t
Step CPU : 2.4, Total CPU : 2.4 of Master node
Step Wall : 0.1, Total Wall : 0.1 at Sat Jan 28 19:31:21 2017
--
Step CPU : 224.1, Total CPU : 226.4 of Master node
Step Wall : 11.2, Total Wall : 11.3 at Sat Jan 28 19:31:33 2017
--
Step CPU : 1947.4, Total CPU : 2173.8 of Master node
Step Wall : 97.6, Total Wall : 108.9 at Sat Jan 28 19:33:10 2017
--
Step CPU : 0.2, Total CPU : 2174.0 of Master node
Step Wall : 0.0, Total Wall : 108.9 at Sat Jan 28 19:33:10 2017

Using impi parallel on host (1 process x 40 threads):
mpiexec.hydra -np 1 -ppn 1 -host q183 /phi/pkg/smash-2.1.0/bin/smash.intel64.impi < large-memory-b3.inp > large-memory-b3.q183.impi.1p1t &

Timing data of 40 MPI threads in large-memory-b3.q183.impi.1p1t
Step CPU : 7.8, Total CPU : 7.8 of Master node
Step Wall : 0.2, Total Wall : 0.2 at Sat Jan 28 15:25:09 2017
--
Step CPU : 355.3, Total CPU : 363.1 of Master node
Step Wall : 9.1, Total Wall : 9.3 at Sat Jan 28 15:25:18 2017
--
Step CPU : 3241.8, Total CPU : 3604.9 of Master node
Step Wall : 85.4, Total Wall : 94.8 at Sat Jan 28 15:26:43 2017
--
Step CPU : 1.3, Total CPU : 3606.2 of Master node
Step Wall : 0.0, Total Wall : 94.8 at Sat Jan 28 15:26:43 2017

Using impi parallel on host (2 process x 20 threads):
OMP_NUM_THREADS=20 mpiexec.hydra -np 2 -host q183 /phi/pkg/smash-2.1.0/bin/smash.intel64.impi < large-memory-b3.inp > large-memory-b3.q183.impi.2p20t &

Timing data of 40 MPI threads in large-memory-b3.q183.impi.2p20t
Step CPU : 4.0, Total CPU : 4.0 of Master node
Step Wall : 0.2, Total Wall : 0.2 at Sat Jan 28 19:36:12 2017
--
Step CPU : 177.0, Total CPU : 181.0 of Master node
Step Wall : 8.9, Total Wall : 9.1 at Sat Jan 28 19:36:21 2017
--
Step CPU : 1643.5, Total CPU : 1824.6 of Master node
Step Wall : 82.2, Total Wall : 91.4 at Sat Jan 28 19:37:43 2017
--
Step CPU : 1.2, Total CPU : 1825.7 of Master node
Step Wall : 0.1, Total Wall : 91.4 at Sat Jan 28 19:37:43 2017

Using impi parallel on host (20 process x 2 threads):
OMP_NUM_THREADS=2 mpiexec.hydra -np 20 -host q183 /phi/pkg/smash-2.1.0/bin/smash.intel64.impi < large-memory-b3.inp > large-memory-b3.q183.impi.20p2t &

Timing data of 40 MPI threads in large-memory-b3.q183.impi.20p2t
Step CPU : 0.5, Total CPU : 0.5 of Master node
Step Wall : 0.3, Total Wall : 0.3 at Sat Jan 28 19:49:06 2017
--
Step CPU : 18.8, Total CPU : 19.3 of Master node
Step Wall : 9.7, Total Wall : 9.9 at Sat Jan 28 19:49:16 2017
--
Step CPU : 167.2, Total CPU : 186.6 of Master node
Step Wall : 83.8, Total Wall : 93.7 at Sat Jan 28 19:50:40 2017
--
Step CPU : 0.0, Total CPU : 186.6 of Master node
Step Wall : 0.0, Total Wall : 93.7 at Sat Jan 28 19:50:40 2017

@mic
Using impi parallel on mic0 (1 process x 244 threads):
Must export I_MPI_MIC=1 or export I_MPI_MIC=enable before running !!
Submit the job to mic from host, not from mic!
mpiexec.hydra -np 1 -host mic0 -env LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH /phi/pkg/smash-2.1.0/bin/smash.mic.impi < large-memory-b3.inp > large-memory-b3.mic0.impi.1 &

Timing data of 1x244 MPI threads in large-memory-b3.mic0.impi.1
Step CPU : 523.1, Total CPU : 523.1 of Master node
Step Wall : 2.5, Total Wall : 2.5 at Sat Jan 28 14:38:16 2017
--
Step CPU : 4845.5, Total CPU : 5368.6 of Master node
Step Wall : 20.3, Total Wall : 22.8 at Sat Jan 28 14:38:37 2017
--
Step CPU : 34261.4, Total CPU : 39630.0 of Master node
Step Wall : 141.5, Total Wall : 164.3 at Sat Jan 28 14:40:58 2017
--
Step CPU : 49.1, Total CPU : 39679.1 of Master node
Step Wall : 0.2, Total Wall : 164.6 at Sat Jan 28 14:40:58 2017
==> 34261.4÷141.5=241.1x

Using impi parallel on mic0 (61 process x 4 threads):
mpiexec.hydra -np 61 -host mic0 -env LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH /phi/pkg/smash-2.1.0/bin/smash.mic.impi < large-memory-b3.inp > large-memory-b3.mic0.impi.61x4 &

Timing data of 61x4 MPI threads in large-memory-b3.mic0.impi.61x4
Step CPU : 8.0, Total CPU : 8.0 of Master node
Step Wall : 2.9, Total Wall : 2.9 at Sat Jan 28 16:26:52 2017
--
Step CPU : 72.3, Total CPU : 80.4 of Master node
Step Wall : 18.7, Total Wall : 21.6 at Sat Jan 28 16:27:11 2017
--
Step CPU : 555.5, Total CPU : 635.9 of Master node
Step Wall : 140.5, Total Wall : 162.1 at Sat Jan 28 16:29:31 2017
--
Step CPU : 0.7, Total CPU : 636.5 of Master node
Step Wall : 0.2, Total Wall : 162.3 at Sat Jan 28 16:29:32 2017

Using impi parallel on mic0 (244 process x 1 threads):
mpiexec.hydra -np 244 -host mic0 -env LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH /phi/pkg/smash-2.1.0/bin/smash.mic.impi < large-memory-b3.inp > large-memory-b3.mic0.impi.244x1 &
Died.... Probably out of memory ???

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 10328 RUNNING AT mic0
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 10328 RUNNING AT mic0
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764
===================================================================================

Using impi parallel on mic0+mic1, 2x(1 process x 244 threads) :

Create ./hostfile containing two lines,

mic0
mic1
Then run with
I_MPI_FABRICS=shm:tcp mpiexec.hydra -machinefile hostfile -ppn 244 -env LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH /phi/pkg/smash-2.1.0/bin/smash.mic.impi < large-memory-b3.inp > large-memory-b3.mic0+1.impi.2x1x244 &

Timing data of 2x1x244 MPI threads in large-memory-b3.mic0.impi.2x1x244
Step CPU : 547.2, Total CPU : 547.2 of Master node
Step Wall : 2.6, Total Wall : 2.6 at Sat Jan 28 17:13:18 2017
--
Step CPU : 3335.4, Total CPU : 3882.6 of Master node
Step Wall : 14.7, Total Wall : 17.3 at Sat Jan 28 17:13:32 2017
--
Step CPU : 19142.3, Total CPU : 23024.8 of Master node
Step Wall : 79.5, Total Wall : 96.7 at Sat Jan 28 17:14:52 2017
--
Step CPU : 49.9, Total CPU : 23074.8 of Master node
Step Wall : 0.2, Total Wall : 97.0 at Sat Jan 28 17:14:52 2017
==> 19142.3÷79.5=240.8x

Hybrid mode @ CPU+Phi, CPU(1 process x 40 threads) + Phi 2x(1 process x 244 threads):
I_MPI_MIC=enable I_MPI_FABRICS=shm:tcp mpiexec.hydra -np 1 -host q183 -ppn 40 /phi/pkg/smash-2.1.0/bin/smash.intel64.impi : -env LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH -env OMP_NUM_THREADS 244 -np 1 -host mic0 /phi/pkg/smash-2.1.0/bin/smash.mic.impi : -env LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH -env OMP_NUM_THREADS 244 -np 1 -host mic1 /phi/pkg/smash-2.1.0/bin/smash.mic.impi < large-memory-b3.inp > large-memory-b3.q183+mic01.impi.488+40 &

Timing data of 1x40+2x1x244 MPI threads in large-memory-b3.q183+mic01.impi.488+40
Step CPU : 20.7, Total CPU : 20.7 of Master node
Step Wall : 1.7, Total Wall : 1.7 at Sun Jan 29 00:24:45 2017
--
Step CPU : 233.1, Total CPU : 253.8 of Master node
Step Wall : 11.2, Total Wall : 12.9 at Sun Jan 29 00:24:56 2017
--
Step CPU : 1430.7, Total CPU : 1684.5 of Master node
Step Wall : 58.6, Total Wall : 71.4 at Sun Jan 29 00:25:55 2017
--
Step CPU : 3.0, Total CPU : 1687.5 of Master node
Step Wall : 0.1, Total Wall : 71.5 at Sun Jan 29 00:25:55 2017

Concluding Remark (temporary)
Maybe ~80 seconds is the upper limit of parallelization to this size of problem ???
Or the performance of two Intel Phi cards is roughly equal to two E5-2670v2@2.3GHz ???

To be continued.....
1. Try mpitune and Intel Trace Analyzer (ref 2,3)
2. Play with KMP_AFFINITY= (ref 4)
3. Any available options in addition to I_MPI_FABRICS=shm:tcp

================================== NOTES ========================================
Debug log:

1. Edit /etc/hosts and added
172.31.1.254 phi
This solves error message similar to:
HYDU_getfullhostname (../../utils/others/others.c:146): getaddrinfo error (hostname: phi, error: Name or service not known)

2. Add OMP_STACKSIZE=1G if problems occur before SCF iterations. This applies to both OpenMP and MPI runs.

References:

https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/542161
https://software.intel.com/en-us/node/528811
http://slidegur.com/doc/76099/software-and-services-group
https://software.intel.com/en-us/node/522691

Keneun's Outerspace

Saturday, January 28, 2017

nwchem 6.6 compile log

smash 2.1.0 DFT @ Intel Phi

Environment

Compilation

OpenMP Run

Intel MPI Run (impi)