3.2 Parallel Runs

Some of the TURBOMOLE modules are parallelized using the message passing interface (MPI) for distributed and shared memory machines or with OpenMP or multi-threaded techniques for shared memory and multi-core machines.

Generally there are two hardware scenarios which determine the kind of parallelization that is possible to use:

The list of programs parallelized includes presently:

Additional keywords necessary for parallel runs with the MPI binaries are described in Chapter 20. However, those keywords do not have to be set by the users. When using the parallel version of TURBOMOLE, scripts are replacing the binaries. Those scripts prepare a usual input, run the necessary steps and automatically start the parallel programs. The users just have to set environment variables, see Sec. 3.2.1 below.

To use the OpenMP parallelization only an environment variable needs to be set. But to use this parallelization efficiently one should consider a few additional points, e.g. memory usage, which are described in Sec. 3.2.2.

3.2.1 Running Parallel Jobs — MPI case

The parallel version of TURBOMOLE runs on all supported systems:

Setting up the parallel MPI environment

In addition to the installation steps described in Section 2 (see page 47) you just have to set the variable PARA_ARCH to MPI, i.e. in sh/bash/ksh syntax:

export PARA_ARCH=MPI

This will cause sysname to append the string _mpi to the system name and the scripts like jobex will take the parallel binaries by default. To call the parallel versions of the programs ridft, rdgrad, dscf, grad, ricc2, or mpgrad from your command line without explicit path, expand your $PATH environment variable to:

export PATH=$TURBODIR/bin/‘sysname‘:$PATH

The usual binaries are replaced now by scripts that prepare the input for a parallel run and start mpirun (or poe on IBM) automatically. The number of CPUs that shall be used can be chosen by setting the environment variable PARNODES:

export PARNODES=8

The default for PARNODES is 2.

Finally the user can set a default scratch directory that must be available on all nodes. Writing scratch files to local directories is highly recommended, otherwise the scratch files will be written over the network to the same directory where the input is located. The path to the local disk can be set with

export TURBOTMPDIR=/scratch/username

This setting is automatically recognized by the parallel ridft and ricc2 programs. Note:

On all systems TURBOMOLE is using the MPI library that has been shipped with your operating system.

On Linux for PCs and Windows systems IBM Platform MPI (formerly known as HP-MPI) is used — see
IBM Platform MPI

COSMOlogic ships TURBOMOLE with a IBM Platform MPI Community Edition. TURBOMOLE users do not have to install or license IBM Platform MPI themselves. Parallel binaries will run out of the box on the fastest interconnect that is found - Infiniband, Myrinet, TCP/IP, etc.

The binaries that initialize MPI and start the parallel binaries (mpirun) are located in the $TURBODIR/mpirun_scripts/em64t-unknown-linux-gnu_mpi/PMPI/ directory.

Note: the parallel TURBOMOLE modules (except ricc2) need an extra server running in addition to the clients. This server is included in the parallel binaries and it will be started automatically—but this results in one additional task that usually does not need any CPU time. So if you are setting PARNODES to N, N+1 tasks will be started.

If you are using a queuing system or if you give a list of hosts where TURBOMOLE jobs shall run on (see below), make sure that the number of supplied nodes match $PARNODES — e.g. if you are using 4 CPUs via a queuing system, make sure that $PARNODES is set to 4.

In some older versions of the LoadLeveler on IBM systems the total number of tasks must be set to  $PARNODES + 1  (except for ricc2).

Starting parallel jobs

After setting up the parallel environment as described in the previous section, parallel jobs can be started just like the serial ones. If the input is a serial one, it will be prepared automatically for the parallel run.

The parallel versions of the programs dscf and grad need an integral statistics file as input which is generated by a parallel statistics run. This preparation step is done automatically by the scripts dscf and grad that are called in the parallel version. In this preparing step the size of the file that holds the 2e-integrals for semi-direct calculations twoint is recalculated and reset. It is highly recommended to set the path of this twoint file to a local scratch directory of each node by changing the line.

 unit=30 size=?????  file=twoint

to

 unit=30 size=?????  file=/local_scratchdir/twoint

For the additional mandatory or optional input for parallel runs with the ricc2 program see Section 10.6.

Running calculations on different nodes

If TURBOMOLE is supposed to run on a cluster, we highly recommed the usage of a queuing system like PBS. The parallel version of TURBOMOLE will automatically recognise that it is started from within the PBS environment and the binaries will run on the machines PBS provides for each job.

Important: Make sure that the input files are located on a network directory like an NFS disk which can be accessed on all nodes that participate at the calculation.

A file that contains a list of machines has to be created, each line containing one machine name:

node1  
node1  
node2  
node3  
node4  
node4

And the environment variable $HOSTS_FILE has to be set to that file:

export HOSTS_FILE=/nfshome/username/hostsfile

Note: Do not forget to set $PARNODES to the number of lines in $HOSTS_FILE.

Note: In general the stack size limit has to be raised to a reasonable amount of the memory (or to ulimited). In the serial version the user can set this by ulimit -s unlimited on bash/sh/ksh shells or limit stacksize unlimited on csh/tcsh shells. However, for the parallel version that is not sufficient if several nodes are used, and the /etc/security/limits.conf files on all nodes might have to be changed. Please see the following web site for details: Turbomole User Forum

Testing the parallel binaries

The binaries ridft, rdgrad, dscf, grad, and ricc2 can be tested by the usual test suite: go to $TURBODIR/TURBOTEST and call TTEST

Note: Some of the tests are very small and will only pass properly if 2 CPUs are used at maximum. Therefore TTEST will not run any test if $PARNODES is set to a higher value than 2.

If you want to run some of the larger tests with more CPUs, you have to edit the DEFCRIT file in TURBOMOLE/TURBOTEST and change the $defmaxnodes option.

Linear Algebra Settings

The number of CPUs and the algorithm of the linear algebra part of Turbomole depends on the settings of $parallel_platform:

cluster
– for clusters with TCP/IP interconnect. Communication is avoided by using an algorithm that includes only one or few CPUs.
MPP
– for clusters with fast interconnect like Infiniband or Myrinet. Number of CPUs that take part at the calculation of the linear algebra routines depends on the size of the input and the number of nodes that are used.
SMP
– all CPUs are used and SCALapack (see http://www.netlib.org/scalapack/) routines are involved.

The scripts in $TURBODIR/mpirun_scripts automatically set this keyword depending on the output of sysname. All options can be used on all systems, but especially the SMP setting can slow down the calculation if used on a cluster with high latency or small bandwidth.

Sample simple PBS start script
#!/bin/sh  
# Name of your run :  
#PBS -N turbomole  
#  
# Number of nodes to run on:  
#PBS -l nodes=4  
#  
# Export environment:  
#PBS -V  
 
# Set your TURBOMOLE pathes:  
 
######## ENTER YOUR TURBOMOLE INSTALLATION PATH HERE ##########  
export TURBODIR=/whereis/TURBOMOLE  
###############################################################  
 
export PATH=$TURBODIR/scripts:$PATH  
 
## set locale to C  
unset LANG  
unset LC_CTYPE  
 
# set stack size limit to unlimited:  
ulimit -s unlimited  
 
# Count the number of nodes  
PBS_L_NODENUMBER=‘wc -l < $PBS_NODEFILE‘  
 
# Check if this is a parallel job  
if [ $PBS_L_NODENUMBER -gt 1 ]; then  
##### Parallel job  
# Set environment variables for a MPI job  
    export PARA_ARCH=MPI  
    export PATH="${TURBODIR}/bin/‘sysname‘:${PATH}"  
    export PARNODES=‘expr $PBS_L_NODENUMBER‘  
else  
##### Sequentiel job  
# set the PATH for Turbomole calculations  
    export PATH="${TURBODIR}/bin/‘sysname‘:${PATH}"  
fi  
 
#VERY important is to tell PBS to change directory to where  
#     the input files are:  
 
cd $PBS_O_WORKDIR  
 
######## ENTER YOUR JOB HERE ##################################  
jobex -ri > jobex.out  
###############################################################

3.2.2 Running Parallel Jobs — SMP case

The SMP version of TURBOMOLE currently combines three different parallelization schemes which all use shared memory:

Setting up the parallel SMP environment

In addition to the installation steps described in Section 2 (see page 47) you just have to set the variable PARA_ARCH to SMP, i.e. in sh/bash/ksh syntax:

export PARA_ARCH=SMP

This will cause sysname to append the string _smp to the system name and the scripts like jobex will take the parallel binaries by default. To call the parallel versions of the programs ridft, rdgrad, dscf, riper, ricc2, aoforce, escf or egrad from your command line without explicit path, expand your $PATH environment variable to:

export PATH=$TURBODIR/bin/‘sysname‘:$PATH

The usual binaries are replaced now by scripts that prepare the input for a parallel run and start the job automatically. The number of CPUs that shall be used can be chosen by setting the environment variable PARNODES:

export PARNODES=8

The default for PARNODES is 2.

NOTE: Depending on what you are going to run, some care has to be taken that the system settings like memory limits, etc. will not prevent the parallel versions to run. See the following sections.

OpenMP parallelization of dscf, ricc2, ccsdf12, pnoccsd, and riper

The OpenMP parallelization does not need any special program startup. The binaries can be invoked in exactly the same manner as for sequential (non-parallel) calculations. The only difference is, that before the program is started the environment variable PARNODES has to be set to the number or threads that should be used by the program, the scripts will set OMP_NUM_THREADS to the same value and start the OpenMP binaries. The number of threads is essentially the max. number of CPU cores the program will try to utilize. To exploit e.g. all eight cores of a machine with two quad-core CPUs set

export PARNODES=8

(for csh and tcsh use setenv PARNODES=8).

Presently the OpenMP parallelization of ricc2 comprises all functionalities apart from the O(N4)-scaling LT-SOS-RI functionalities (which are only parallelized with MPI) and expectation values for Ŝ2 (not parallelized). Note that the memory specified with $maxcor is for OpenMP-parallel calculation the maximum amount of memory that will be dynamically allocated by all threads together. To use your computational resources efficiently, it is recommended to set this value to about 75% of the physical memory available for your calculations, but to at most 16000 (megabytes). (Due to the use of integer*4 arithmetics the ricc2 program is presently limited to 16 Gbytes.)

In the dscf program the OpenMP parallelization covers presently only the Hartree-Fock coulomb and exchange contributions to the Fock matrix in fully integral-direct mode and is mainly intended to be used in combination with OpenMP parallel runs of ricc2. Nevertheless, the OpenMP parallelization can also be used in DFT calculations, but the numerical integration for the DFT contribution to the Fock matrix will only use a single thread (CPU core) and thus the overall speed up will be less good.

Localized Hartree-Fock calculations ( dscf program ) are parallelized using OpenMP. In this case an almost ideal speedup is obtained because the most expensive part of the calculation is the evaluation of the Fock matrix and of the Slater-potential, and both of them are well parallelized. The calculation of the correction-term of the grid will use a single thread.

The OpenMP parallelization of riper covers all contributions to the Kohn-Sham matrix and nuclear gradient. Hence an almost ideal speedup is obtained.

Restrictions:

Multi-thread parallelization of aoforce, escf and egrad 

The parallelization of those modules is described in [26] and is based on fork() and Unix sockets. Except setting PARNODES which triggers the environment variable SMPCPUS, nothing has to be set in addition. Alternatively, the binaries can be called with -smpcpus <N> command line option or with the keyword $smp_cpus in the control file.

Multi-thread parallelization of dscf, grad, ridft and rdgrad 

Instead of the default binaries used in the SMP version, setting

$TM_PAR_FORK=on

will cause the TURBOMOLE scripts to use the multi-threaded versions of dscf, grad, ridft and rdgrad. The efficiency of the parallelization is usually similar to the default version, but for ridft and rdgrad RI-K is not parallelized. If density convergence criteria ($denconv) is switched on using ridft and if no RI-K is being used, the multi-threaded version should be used.

Global Arrays parallelization of ridft and rdgrad

ridft and rdgrad are parallelized with MPI using the Global Arrays toolkit.

Those versions are automatically used when setting PARA_ARCH=SMP. Due to the explicit usage of shared memory on an SMP system, user has to be allowed to use sufficient shared memory: