Saturday, January 28, 2017

nwchem 6.6 compile log

ccuf1: Debian 7.7
Intel Compiler Version 17.0.1.132 Build 20161005
Python v2.7.3

. /pkg1/intel/compilers_and_libraries_2017/linux/bin/compilervars.sh intel64
echo $MKLROOT

/pkg1/intel/compilers_and_libraries_2017.1.132/linux/mkl

cd /pkg1
tar jxf /f01/source/chem/nwchem/Nwchem-6.6.revision27746-src.2015-10-20.tar.bz2
cd nwchem-6.6/src

# New start compilation setup
export NWCHEM_TOP=/pkg1/chem/nwchem-6.6
export FC=ifort
export CC=icc
export USE_MPI=y
export NWCHEM_TARGET=LINUX64
export USE_PYTHONCONFIG=y
export PYTHONVERSION=2.7
export PYTHONHOME=/usr
export BLASOPT="-L$MKLROOT/lib/intel64 -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential \ 
       -lpthread -lm"
export SCALAPACK="-L$MKLROOT/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_core \        -lmkl_sequential -lmkl_blacs_intelmpi_ilp64 -lpthread -lm"
export NWCHEM_MODULES="all python"



ccuf0: Debian 4.0
Intel Compiler Version 14.0.2.144 Build 20140120
Python v2.4.4

. /opt/intel/composer_xe_2013_sp1.2.144/bin/compilervars.sh intel64
echo $MKLROOT

/opt/intel/composer_xe_2013_sp1.2.144/mkl

cd /temp
tar jxf /f01/source/chem/nwchem/Nwchem-6.6.revision27746-src.2015-10-20.tar.bz2
cd nwchem-6.6/src

# New start compilation setup
export NWCHEM_TOP=/temp/nwchem-6.6
export PATH="/pkg/x86_64/openmpi-1.6.5-i14/bin:$PATH"
export FC=ifort
export CC=icc
export USE_MPI=y
export NWCHEM_TARGET=LINUX64
export USE_PYTHONCONFIG=y
export PYTHONHOME=/usr
export PYTHONVERSION=2.4
export MPI_LOC=/pkg/x86_64/openmpi-1.6.5-i14
export MPI_LIB=/pkg/x86_64/openmpi-1.6.5-i14/lib
export MPI_INCLUDE=/pkg/x86_64/openmpi-1.6.5-i14/include
export LIBMPI="-lmpi_f90 -lmpi_f77 -lmpi -lpthread"
# export BLASOPT="-mkl -openmp"
export BLASOPT="-L$MKLROOT/lib/intel64 -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential \ 
       -lpthread -lm"
export SCALAPACK="-L$MKLROOT/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_core \
       -lmkl_sequential -lmkl_blacs_intelmpi_ilp64 -lpthread -lm"
export NWCHEM_MODULES="all python"

# Now start compilation 
make nwchem_config >& nwchem.config.log &
make -j 8 >& make.log &

# Finally manually resolve undefined references to openmpi-ifort and python2.4
ifort -i8 -align -fpp -vec-report6 -fimf-arch-consistency=true -finline-limit=250 -O2 -g -fp-model source  -Wl,--export-dynamic  -L/temp/nwchem-6.6/lib/LINUX64 -L/temp/nwchem-6.6/src/tools/install/lib  -L$MPI_LIB -o /temp/nwchem-6.6/bin/LINUX64/nwchem nwchem.o stubs.o -lnwctask -lccsd -lmcscf -lselci -lmp2 -lmoints -lstepper -ldriver -loptim -lnwdft -lgradients -lcphf -lesp -lddscf -ldangchang -lguess -lhessian -lvib -lnwcutil -lrimp2 -lproperty -lsolvation -lnwints -lprepar -lnwmd -lnwpw -lofpw -lpaw -lpspw -lband -lnwpwlib -lcafe -lspace -lanalyze -lqhop -lpfft -ldplot -lnwpython -ldrdy -lvscf -lqmmm -lqmd -letrans -lpspw -ltce -lbq -lmm -lcons -lperfm -ldntmc -lccca -lnwcutil -lga -larmci -lpeigs -lperfm -lcons -lbq -lnwcutil -lmpi_f90 -lmpi_f77 -lmpi -lpthread -lpython2.4 \ 
-L/opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential

# See also /pkg/x86_64/chem/README.nwchem-6.6 at ccuf0 for running setup.




q183 Phi: CentOS 6.6 (Final)
Intel Compiler Version 17.0.1.132 Build 20161005
Python v2.6.6

. /opt/intel/compilers_and_libraries_2017/linux/bin/compilervars.sh intel64
echo $MKLROOT

/opt/intel/compilers_and_libraries_2017.1.132/linux/mkl

cd /phi
tar jxf /f01/source/chem/nwchem/Nwchem-6.6.revision27746-src.2015-10-20.tar.bz2
cd nwchem-6.6/src

# apply patches

# Compilation setup for Phi
export NWCHEM_TOP=/phi/nwchem-6.6
export USE_MPI=y
export NWCHEM_TARGET=LINUX64
export USE_PYTHONCONFIG=y
export PYTHONHOME=/usr
export PYTHONVERSION=2.6
export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib/:$LD_LIBRARY_PATH
export PATH=/usr/lib64/openmpi/bin/:$PATH
export FC=ifort
export CC=icc
export USE_OPENMP=1
export USE_OFFLOAD=1

export BLASOPT="-mkl -qopenmp   -lpthread -lm"
export SCALAPACK="-mkl -qopenmp -lmkl_scalapack_ilp64 -lmkl_blacs_intelmpi_ilp64 -lpthread -lm"
export BLASOPT="-mkl -openmp   -lpthread -lm"
export NWCHEM_MODULES="all python"

smash 2.1.0 DFT @ Intel Phi

smash-2.1.0 DFT on q183 Phi: CentOS 6.6 (Final)

Environment

Intel Compiler Version 17.0.1.132 Build 20161005
Python v2.6.6

. /opt/intel/compilers_and_libraries_2017/linux/bin/compilervars.sh intel64
. /opt/intel/impi/2017.1.132/bin64/mpivars.sh intel64

echo $MKLROOT
/opt/intel/compilers_and_libraries_2017.1.132/linux/mkl

echo $MIC_LD_LIBRARY_PATH
/opt/intel/compilers_and_libraries_2017.1.132/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/mic:/opt/intel/compilers_and_libraries_2017.1.132/linux/ipp/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64_lin_mic:/opt/intel/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64_lin_mic:/opt/intel/compilers_and_libraries_2017.1.132/linux/tbb/lib/mic


Compilation

Compilation with Intel compiler with both OpenMP and Intel MPI (impi) interfaces can make one single binary run in parallel via the control using OMP_NUM_THREADS and mpirun
Compile on host with for the intel64 architecture
cd /phi/pkg/smash-2.1.0
cp Makefile Makefile.mpiifort    

Edit  Makefile.mpiifort and set
F90 = mpiifort -DILP64            # <---Note here it is "mpiifort", not "mpifort" !! Two i !!
LIB = -mkl=parallel 
OPT = -qopenmp -i8 -xHOST -ilp64 -O3

Then compile with  
make -f Makefile.mpiifort

Do not use parallel make -j. After successful build, rename the outcoming binary executable:
mv /phi/pkg/smash-2.1.0/bin/smash /phi/pkg/smash-2.1.0/bin/smash.intel64.impi

Cleanup the object files for the next build: 
make -f Makefile.mpiifort clean

Compile another version of binary for mic0 using -mmic
cp Makefile.mpiifort Makefile.mpiifort.mic

Edit  Makefile.mpiifort.mic and set
F90 = mpiifort -DILP64
LIB = -mkl=parallel 
OPT = -qopenmp -i8 -xHOST -ilp64 -O3 -mmic

Then compile with  
make -f Makefile.mpiifort.mic
mv /phi/pkg/smash-2.1.0/bin/smash /phi/pkg/smash-2.1.0/bin/smash.mic.impi

Now we have binaries for both architectures under /phi/pkg/smash-2.1.0/bin
ls -al /phi/pkg/smash-2.1.0/bin/smash*.impi
-rwxr-xr-x 1 jsyu ccu 5469540 Jan 27 02:35 /phi/pkg/smash-2.1.0/bin/smash.intel64.impi
-rwxr-xr-x 1 jsyu ccu 7612438 Jan 28 02:46 /phi/pkg/smash-2.1.0/bin/smash.mic.impi

Running test molecule (taxol, C47H51NO14) from the example file
/phi/pkg/smash-2.1.0/example/large-memory.inp but change to DFT instead of MP2:
cp /phi/pkg/smash-2.1.0/example/large-memory.inp  large-memory-b3.inp
Edit the input file large-memory-b3.inp and change the first line method=MP2 into method=B3LYP and reduce  memory=7GB


OpenMP Run

Using OpenMP parallel on host (using 20 threads):
OMP_NUM_THREADS=20 /phi/pkg/smash-2.1.0/bin/smash.intel64.impi large-memory-b3.inp > large-memory-b3.q183.openmp &
After finish, grep the timing data
grep -A 1 "Step CPU :" large-memory-b3.q183.openmp
The third step computes B3LYP/STO-3G energy from huckel guess; its timing is highlighted:
 Step CPU :       8.3, Total CPU :       8.3 of Master node
 Step Wall :      0.2, Total Wall :      0.2 at Sat Jan 28 14:30:30 2017
--
 Step CPU :     354.0, Total CPU :     362.3 of Master node
 Step Wall :      9.1, Total Wall :      9.3 at Sat Jan 28 14:30:39 2017
--
 Step CPU :    3286.7, Total CPU :    3649.0 of Master node
 Step Wall :     84.3, Total Wall :     93.6 at Sat Jan 28 14:32:03 2017
--
 Step CPU :       2.4, Total CPU :    3651.5 of Master node
 Step Wall :      0.1, Total Wall :     93.6 at Sat Jan 28 14:32:03 2017

Using OpenMP parallel on mic0 (using 60 threads), native mode:
Login to mic0, and add environments
export \ LD_LIBRARY_PATH="/opt/intel/compilers_and_libraries_2017.1.132/linux/mkl/lib/mic:/opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/mic"
ulimit -s unlimited

OMP_NUM_THREADS=20 /phi/pkg/smash-2.1.0/bin/smash.mic.impi large-memory-b3.inp > large-memory-b3.mic0.openmp.60 &
Timing data of 60 OMP threads in large-memory-b3.mic0.openmp.60  
 Step CPU :      66.3, Total CPU :      66.3 of Master node
 Step Wall :      1.2, Total Wall :      1.2 at Sat Jan 28 14:12:06 2017
--
 Step CPU :    1759.9, Total CPU :    1826.1 of Master node
 Step Wall :     29.3, Total Wall :     30.5 at Sat Jan 28 14:12:35 2017
--
 Step CPU :   15690.6, Total CPU :   17516.7 of Master node
 Step Wall :    263.2, Total Wall :    293.7 at Sat Jan 28 14:16:59 2017
--
 Step CPU :       6.7, Total CPU :   17523.4 of Master node
 Step Wall :      0.1, Total Wall :    293.8 at Sat Jan 28 14:16:59 2017


Timing data of 240 OMP threads in large-memory-b3.mic0.openmp.240  
 Step CPU :     488.2, Total CPU :     488.2 of Master node
 Step Wall :      2.3, Total Wall :      2.3 at Sat Jan 28 04:37:43 2017
--
 Step CPU :    4645.1, Total CPU :    5133.3 of Master node
 Step Wall :     19.6, Total Wall :     21.8 at Sat Jan 28 04:38:03 2017
--
 Step CPU :   43990.2, Total CPU :   49123.5 of Master node
 Step Wall :    184.6, Total Wall :    206.4 at Sat Jan 28 04:41:07 2017
--
 Step CPU :      55.5, Total CPU :   49179.0 of Master node
 Step Wall :      0.2, Total Wall :    206.6 at Sat Jan 28 04:41:08 2017

Intel MPI Run (impi)

@q183
Using impi parallel on host (1 process x 20 threads):
OMP_NUM_THREADS=20 mpiexec.hydra -np 1 -host q183 /phi/pkg/smash-2.1.0/bin/smash.intel64.impi < large-memory-b3.inp > large-memory-b3.q183.impi.1p20t &
Timing data of 20 MPI threads in large-memory-b3.q183.impi.1p20t  
 Step CPU :       2.4, Total CPU :       2.4 of Master node
 Step Wall :      0.1, Total Wall :      0.1 at Sat Jan 28 19:31:21 2017
--
 Step CPU :     224.1, Total CPU :     226.4 of Master node
 Step Wall :     11.2, Total Wall :     11.3 at Sat Jan 28 19:31:33 2017
--
 Step CPU :    1947.4, Total CPU :    2173.8 of Master node
 Step Wall :     97.6, Total Wall :    108.9 at Sat Jan 28 19:33:10 2017
--
 Step CPU :       0.2, Total CPU :    2174.0 of Master node
 Step Wall :      0.0, Total Wall :    108.9 at Sat Jan 28 19:33:10 2017

Using impi parallel on host (1 process x 40 threads):
mpiexec.hydra -np 1 -ppn 1 -host q183 /phi/pkg/smash-2.1.0/bin/smash.intel64.impi < large-memory-b3.inp > large-memory-b3.q183.impi.1p1t &

Timing data of 40 MPI threads in large-memory-b3.q183.impi.1p1t  
 Step CPU :       7.8, Total CPU :       7.8 of Master node
 Step Wall :      0.2, Total Wall :      0.2 at Sat Jan 28 15:25:09 2017
--
 Step CPU :     355.3, Total CPU :     363.1 of Master node
 Step Wall :      9.1, Total Wall :      9.3 at Sat Jan 28 15:25:18 2017
--
 Step CPU :    3241.8, Total CPU :    3604.9 of Master node
 Step Wall :     85.4, Total Wall :     94.8 at Sat Jan 28 15:26:43 2017
--
 Step CPU :       1.3, Total CPU :    3606.2 of Master node
 Step Wall :      0.0, Total Wall :     94.8 at Sat Jan 28 15:26:43 2017

Using impi parallel on host (2 process x 20 threads):
OMP_NUM_THREADS=20 mpiexec.hydra -np 2 -host q183 /phi/pkg/smash-2.1.0/bin/smash.intel64.impi < large-memory-b3.inp > large-memory-b3.q183.impi.2p20t &

Timing data of 40 MPI threads in large-memory-b3.q183.impi.2p20t  
 Step CPU :       4.0, Total CPU :       4.0 of Master node
 Step Wall :      0.2, Total Wall :      0.2 at Sat Jan 28 19:36:12 2017
--
 Step CPU :     177.0, Total CPU :     181.0 of Master node
 Step Wall :      8.9, Total Wall :      9.1 at Sat Jan 28 19:36:21 2017
--
 Step CPU :    1643.5, Total CPU :    1824.6 of Master node
 Step Wall :     82.2, Total Wall :     91.4 at Sat Jan 28 19:37:43 2017
--
 Step CPU :       1.2, Total CPU :    1825.7 of Master node
 Step Wall :      0.1, Total Wall :     91.4 at Sat Jan 28 19:37:43 2017

Using impi parallel on host (20 process x 2 threads):
OMP_NUM_THREADS=2 mpiexec.hydra -np 20 -host q183 /phi/pkg/smash-2.1.0/bin/smash.intel64.impi < large-memory-b3.inp > large-memory-b3.q183.impi.20p2t &

Timing data of 40 MPI threads in large-memory-b3.q183.impi.20p2t  
 Step CPU :       0.5, Total CPU :       0.5 of Master node
 Step Wall :      0.3, Total Wall :      0.3 at Sat Jan 28 19:49:06 2017
--
 Step CPU :      18.8, Total CPU :      19.3 of Master node
 Step Wall :      9.7, Total Wall :      9.9 at Sat Jan 28 19:49:16 2017
--
 Step CPU :     167.2, Total CPU :     186.6 of Master node
 Step Wall :     83.8, Total Wall :     93.7 at Sat Jan 28 19:50:40 2017
--
 Step CPU :       0.0, Total CPU :     186.6 of Master node
 Step Wall :      0.0, Total Wall :     93.7 at Sat Jan 28 19:50:40 2017


@mic
Using impi parallel on mic0 (1 process x 244 threads): 
Must export I_MPI_MIC=1 or export I_MPI_MIC=enable before running !!
Submit the job to mic from host, not from mic!
mpiexec.hydra -np 1 -host mic0 -env LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH /phi/pkg/smash-2.1.0/bin/smash.mic.impi < large-memory-b3.inp > large-memory-b3.mic0.impi.1 &

Timing data of 1x244 MPI threads in large-memory-b3.mic0.impi.1  
 Step CPU :     523.1, Total CPU :     523.1 of Master node
 Step Wall :      2.5, Total Wall :      2.5 at Sat Jan 28 14:38:16 2017
--
 Step CPU :    4845.5, Total CPU :    5368.6 of Master node
 Step Wall :     20.3, Total Wall :     22.8 at Sat Jan 28 14:38:37 2017
--
 Step CPU :   34261.4, Total CPU :   39630.0 of Master node
 Step Wall :    141.5, Total Wall :    164.3 at Sat Jan 28 14:40:58 2017
--
 Step CPU :      49.1, Total CPU :   39679.1 of Master node
 Step Wall :      0.2, Total Wall :    164.6 at Sat Jan 28 14:40:58 2017
==> 34261.4÷141.5=241.1x

Using impi parallel on mic0 (61 process x 4 threads): 
mpiexec.hydra -np 61 -host mic0 -env LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH /phi/pkg/smash-2.1.0/bin/smash.mic.impi < large-memory-b3.inp > large-memory-b3.mic0.impi.61x4 &

Timing data of 61x4 MPI threads in large-memory-b3.mic0.impi.61x4  
 Step CPU :       8.0, Total CPU :       8.0 of Master node
 Step Wall :      2.9, Total Wall :      2.9 at Sat Jan 28 16:26:52 2017
--
 Step CPU :      72.3, Total CPU :      80.4 of Master node
 Step Wall :     18.7, Total Wall :     21.6 at Sat Jan 28 16:27:11 2017
--
 Step CPU :     555.5, Total CPU :     635.9 of Master node
 Step Wall :    140.5, Total Wall :    162.1 at Sat Jan 28 16:29:31 2017
--
 Step CPU :       0.7, Total CPU :     636.5 of Master node
 Step Wall :      0.2, Total Wall :    162.3 at Sat Jan 28 16:29:32 2017

Using impi parallel on mic0 (244 process x 1 threads): 
mpiexec.hydra -np 244 -host mic0 -env LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH /phi/pkg/smash-2.1.0/bin/smash.mic.impi < large-memory-b3.inp > large-memory-b3.mic0.impi.244x1 &
Died.... Probably out of memory ???

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 10328 RUNNING AT mic0
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 10328 RUNNING AT mic0
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================


Using impi parallel on mic0+mic1,  2x(1 process x 244 threads)  
Create ./hostfile containing two lines,
mic0
mic1
Then run with
I_MPI_FABRICS=shm:tcp mpiexec.hydra -machinefile hostfile -ppn 244 -env LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH /phi/pkg/smash-2.1.0/bin/smash.mic.impi < large-memory-b3.inp > large-memory-b3.mic0+1.impi.2x1x244 &

Timing data of 2x1x244 MPI threads in large-memory-b3.mic0.impi.2x1x244  
 Step CPU :     547.2, Total CPU :     547.2 of Master node
 Step Wall :      2.6, Total Wall :      2.6 at Sat Jan 28 17:13:18 2017
--
 Step CPU :    3335.4, Total CPU :    3882.6 of Master node
 Step Wall :     14.7, Total Wall :     17.3 at Sat Jan 28 17:13:32 2017
--
 Step CPU :   19142.3, Total CPU :   23024.8 of Master node
 Step Wall :     79.5, Total Wall :     96.7 at Sat Jan 28 17:14:52 2017
--
 Step CPU :      49.9, Total CPU :   23074.8 of Master node
 Step Wall :      0.2, Total Wall :     97.0 at Sat Jan 28 17:14:52 2017
==> 19142.3÷79.5=240.8x


Hybrid mode CPU+Phi, CPU(1 process x 40 threads) + Phi 2x(1 process x 244 threads)
I_MPI_MIC=enable I_MPI_FABRICS=shm:tcp mpiexec.hydra -np 1 -host q183 -ppn 40 /phi/pkg/smash-2.1.0/bin/smash.intel64.impi : -env LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH -env OMP_NUM_THREADS 244 -np 1 -host mic0 /phi/pkg/smash-2.1.0/bin/smash.mic.impi : -env LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH -env OMP_NUM_THREADS 244 -np 1 -host mic1 /phi/pkg/smash-2.1.0/bin/smash.mic.impi < large-memory-b3.inp > large-memory-b3.q183+mic01.impi.488+40 &

Timing data of 1x40+2x1x244 MPI threads in large-memory-b3.q183+mic01.impi.488+40  
 Step CPU :      20.7, Total CPU :      20.7 of Master node
 Step Wall :      1.7, Total Wall :      1.7 at Sun Jan 29 00:24:45 2017
--
 Step CPU :     233.1, Total CPU :     253.8 of Master node
 Step Wall :     11.2, Total Wall :     12.9 at Sun Jan 29 00:24:56 2017
--
 Step CPU :    1430.7, Total CPU :    1684.5 of Master node
 Step Wall :     58.6, Total Wall :     71.4 at Sun Jan 29 00:25:55 2017
--
 Step CPU :       3.0, Total CPU :    1687.5 of Master node
 Step Wall :      0.1, Total Wall :     71.5 at Sun Jan 29 00:25:55 2017



Concluding Remark (temporary)
Maybe ~80 seconds is the upper limit of parallelization to this size of problem ???
Or the performance of two Intel Phi cards is roughly equal to two E5-2670v2@2.3GHz ???

To be continued.....
1. Try mpitune and Intel Trace Analyzer (ref 2,3)
2. Play with KMP_AFFINITY= (ref 4)
3. Any available options in addition to I_MPI_FABRICS=shm:tcp


================================== NOTES ========================================
Debug log:

1. Edit /etc/hosts and added  
172.31.1.254    phi
This solves error message similar to:
HYDU_getfullhostname (../../utils/others/others.c:146): getaddrinfo error (hostname: phi, error: Name or service not known)

2. Add OMP_STACKSIZE=1G if problems occur before SCF iterations. This applies to both OpenMP and MPI runs.


References:

  1. https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/542161
  2. https://software.intel.com/en-us/node/528811
  3. http://slidegur.com/doc/76099/software-and-services-group
  4. https://software.intel.com/en-us/node/522691