このページの更新は終了しました。

最新の情報はTSUBAME3.0計算サービスのWebページをご覧ください。

TSUBAME2.5からTSUBAME3.0へのデータ移行方法の資料はこちら

How should I do while an MPI job fails ?

Several  MPI environments(libraries) are available, please check the MPI environment used firstly. Some examples on execution error with respect to the MPI environments are shown as follows.

1. using openmpi
  The error messages are printed out as follows,
    /var/spool/PBS/mom_priv/jobs/1662.t2zpbs03.SC: line 10: c: command not found
    --------------------------------------------------------------------------
    orterun was unable to launch the specified application as it could not access
    or execute an executable:

     Executable: ./check
     Node: t2a001121-vm1

     while attempting to start process rank 0.
    --------------------------------------------------------------------------

 (a) Maybe the progroam is compiled with mvapich2 or mpich2.
 (b) Perhaps the program is executed on V nodes.
     (V nodes run on virtual machine, where only mpich2 works well.)

 2. using mvapich2
  The error messages are printed out as follows with mvapich2,

     /var/spool/PBS/mom_priv/jobs/1663.t2zpbs03.SC: line 10: c: command not found
     /usr/apps/mvapich2/1.5.1/pgi/bin/mpirun_rsh: error while loading shared libraries: libibumad.so.3: cannot open shared object file: No such file or directory

 (a) Maybe the program is built with openmpi or mpich2.
 (b) Perhaps the program is executed on V nodes.
     (V nodes run on virtual machine, where only mpich2 works well.)

 3. using mpich2
 (a) It means that the program is built with openmpi, if the error messages are as follows.
    --------------------------------------------------------------------------
    It looks like opal_init failed for some reason; your parallel process is
    likely to abort.  There are many reasons that a parallel process can
    fail during opal_init; some of which are due to configuration or
    environment problems.  This failure appears to be an internal failure;
    here's some additional information (which may only be relevant to an
    Open MPI developer):

      opal_carto_base_select failed
      --> Returned value -13 instead of OPAL_SUCCESS
    --------------------------------------------------------------------------
    [t2a001121-vm1:31986] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 77
    --------------------------------------------------------------------------
      :

  (b) It means that the program is built with mvapich2, if the error messages are as follows.

     ./check: error while loading shared libraries: libibumad.so.3: cannot open shared object file: No such file or directory
     ./check: error while loading shared libraries: libibumad.so.3: cannot open shared object file: No such file or directory

4.Other errors 
[[54475,1],0][btl_openib_component.c:3224:handle_wc] from t2a001021 to:  
t2a001023 error polling LP CQ with status RETRY EXCEEDED ERROR status number  
12 for wr_id 187335168 opcode 0  vendor error 129 qp_idx 0
--------------------------------------------------------------------------

 or , 

 [t2a000015][[14969,1],134][btl_openib_component.c:1492:init_one_device] error
 obtaining device context for mlx4_1 errno says Device or resource busy
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   t2a000015
    Local device: mlx4_1

In such an error, IB has possibility of fault. 
Although it usually separates from employment, 
public holidays can consider the following solutions. 

 (a) Only one side specifies IB. 
    mpirun -n 2 --mca btl_openib_max_btls 1 -machinefile  ...
   Although communication performance falls a little, it may be able to be used. 

 (b) A machine directly normal at login is used.
   Execution becomes possible although it becomes one-set futility. 

5.When mpirun is performed directly, the following errors may come out. 

> mpirun -np 4 ./sample5a
Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(1306): MPI_Bcast(buf=0x6afbc8, count=4, MPI_CHARACTER, root=1, MPI_COMM_WORLD) failed
PMPI_Bcast(1268): Invalid root (value given was 1)
  :
> mpirun -np 4 -hostfile ./host ./sample5a
mpirun_rsh: PMI key 'PMI_process_mapping' not found.[cli_0]: readline failed
[cli_1]: readline failed
[cli_2]: readline failed
  :
> mpirun -np 4 ./sample5a
[t2a006179:18372] *** An error occurred in MPI_Bcast
[t2a006179:18372] *** on communicator MPI_COMM_WORLD
[t2a006179:18372] *** MPI_ERR_ROOT: invalid root
[t2a006179:18372] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
      :

In this case, performed MPI differs from compiled MPI. 
(For example, it compiles by mpich2 and is performing by openmpi. )
Please check the MPI environment.  

6.When some zombie processes are created such problems happen.
We check regularly zombie processes and kill them.
In such a case please submit your jobs after they are removed.
The deletion program runs every hour  10 minutes.

(a) mvapich2
   Error in init phase...wait for cleanup! (1/8 mpispawn connections)
   Failed in initilization phase, cleaned up all the mpispawn!

(b) openmpi
   orterun was unable to cleanly terminate the daemons on the nodes shown
   below. Additional manual cleanup may be required - please refer tothe  
   "orte-clean" tool for assistance.