Platform: Linux Applies to: COMSOL Multiphysics®, COMSOL Server™ Versions: 6.0, 5.6

Problem Description

I am observing that distributed cluster jobs on Linux are not starting up. I am receiving error messages from MPI.

Solution

The underlying reason for COMSOL not working on a Linux cluster might be that the network interface and fabrics are not detected correctly. COMSOL 6.0 is shipped with Intel MPI 2021.2 on Linux. You can investigate if there is an incompatibility with Intel MPI using the following steps:

When you find that Intel MPI is not working on your cluster, you should first make sure that your submission script is configured correctly. In addition, you should run the MPI test by calling

comsol hydra mpitest -nn 2 -f hostfile

or, e.g. with Slurm,

#SBATCH --nodes=2  
#SBATCH --ntasks-per-node=1 
...
comsol hydra mpitest -nn 2 -nnhost 1 

to see that actually MPI is the issue. You can add the switch '-mpidebug 10' for getting additional debug output.

For resolving the problem you can try the suggestions A. and B. If A. works for you, you should try B. as this option would offer better performance.

A. Fall back to TCP

Export the environment variable FI_PROVIDER and set it to 'sockets'. With Slurm, this can be done by means of

#SBATCH --export=FI_PROVIDER=sockets

Otherwise, you can use

export FI_PROVIDER=sockets 

or

setenv FI_PROVIDER sockets

and make sure that this environment variable is handed over to your cluster job.

The downside with this approach is that the communication falls back to TCP, which might be slow if you have a faster fabrics.

B. Install a later Intel MPI

Download Intel MPI 2021.6 from here and install it. You can install to your home directory if you don't have admin rights on the cluster.

Launch COMSOL with the additional switch

-mpiroot <Intel 2021.6 installation directory>/intel/oneapi/mpi/2021.6.0 

On Slurm, you can call for example

#SBATCH --nodes=2  
#SBATCH --ntasks-per-node=1 
...
comsol hydra mpitest -nn 2 -nnhost 1 -mpiroot <Intel 2021.6 installation directory>/intel/oneapi/mpi/2021.6.0

Remarks:

  • You can also point to other MPICH2-based MPI installations (but not to OpenMPI for example)
  • In COMSOL 5.6 you can point to IMPI 2021.6 via -mpiroot as well.