Opened 7 years ago

Last modified 2 years ago

#1422 new bug

error messages in MPI_Barrier of larger number of processes

Reported by: wangraying@… Owned by: wbland
Priority: major Milestone: mpich-3.3
Component: mpich Keywords:
Cc:

Description (last modified by balaji)

I did some tests, it seems that MPICH will fail if I started a larger number of processes, say 2000.

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <mpi.h>

int main(int argc, char **argv)
{
        MPI_Init(&argc, &argv);
        MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);


        MPI_Barrier(MPI_COMM_WORLD);

        MPI_Finalize();
        return 1;
}

The error messages are as follows:

[proxy:0:0 ] send_cmd_downstream (./pm/pmiserv/pmip_pmi_v1.c:79): assert (!closed) failed
[proxy:0:0 ] fn_get (./pm/pmiserv/pmip_pmi_v1.c:351): error sending PMI response
[proxy:0:0 ] pmi_cb (./pm/pmiserv/pmip_cb.c:326): PMI handler returned error
[proxy:0:0 ] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0 ] main (./pm/pmiserv/pmip.c:208): demux engine error waiting for event
[mpiexec ] control_cb (./pm/pmiserv/pmiserv_cb.c:150): assert (!closed) failed
[mpiexec ] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec ] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:206): error waiting for event
[mpiexec ] main (./ui/mpich/mpiexec.c:404): process manager error waiting for completion

I used the following command,

mpirun -np 2000 -machinefile mfile -disable-auto-cleanup ./barrier

Change History (8)

comment:1 Changed 7 years ago by fhr@…

We are also seeing this exact same error. In our case we are starting only 20 processes. Some of the processes get to a collective operation, e.g. MPI_File_set_view, and some throw an exception before getting to the collective operation. The exceptions cause MPI_Abort to be called, and then the error in this ticket occurs on a couple nodes. We end up with the job finishing, but those processes are hung on the nodes. Attaching to the processes we can see the stack trace and they are hung in the collective operation.

comment:2 Changed 7 years ago by balaji

  • Milestone set to mpich2-1.3.4
  • Owner set to balaji
  • Status changed from new to assigned

Can you attach the test program that shows this error? I'm not able to reproduce this.

comment:3 Changed 7 years ago by balaji

  • Milestone changed from mpich2-1.3.4 to mpich2-1.4.1

Milestone mpich2-1.3.4 deleted

comment:4 Changed 5 years ago by balaji

  • Milestone changed from mpich2-1.5 to mpich-3.0

FT is not a priority for the 1.5 release. Moving to be revisited for the 3.0 release.

comment:5 Changed 5 years ago by balaji

  • Owner changed from balaji to buntinas

comment:6 Changed 5 years ago by balaji

  • Milestone changed from mpich-3.0 to mpich-3.0.1

comment:7 Changed 5 years ago by balaji

  • Description modified (diff)
  • Milestone changed from mpich-3.1 to mpich-3.2
  • Owner changed from buntinas to wbland
  • Status changed from assigned to new

comment:8 Changed 2 years ago by balaji

  • Milestone changed from mpich-3.2.1 to mpich-3.3

Milestone mpich-3.2.1 deleted

Note: See TracTickets for help on using tickets.