Opened 4 years ago

Closed 3 years ago

Last modified 3 years ago

#1421 closed bug (fixed)

MPI functions hang during faults for large-message communication

Reported by: wangraying@… Owned by: buntinas
Priority: major Milestone: mpich2-1.3.2
Component: mpich Keywords:
Cc:

Description

The following 2-process test program hangs if the receiver (rank 1) is killed.

#include <stdio.h>
#include "mpi.h"

#define BUFLEN (1024 * 1024)

int main(int argc, char **argv)
{
    int rank, i, ret;
    char buf[BUFLEN];
    MPI_Status status;

    MPI_Init(NULL, NULL);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    printf("[%d] pid: %d\n", rank, getpid());

    for (i = 0; i < 1000000; i++) {
        if (rank == 0) {
            if (MPI_Send(buf, BUFLEN, MPI_CHAR, 1, 0, MPI_COMM_WORLD)) {
                printf("got an error; breaking out\n");
                break;
            }
        }
        else {
            if (MPI_Recv(buf, BUFLEN, MPI_CHAR, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE)) {
                printf("got an error; breaking out\n");
                break;
            }
        }
    }

    MPI_Finalize();

    return 0;
}

I used the following mpiexec options:

mpiexec -disable-auto-cleanup -n 2 ./a.out

Change History (6)

comment:1 in reply to: ↑ description Changed 3 years ago by wangraying@…

I did some tests. It hangs if the sender(rank 0) is killed, as well.

comment:2 Changed 3 years ago by buntinas

  • Status changed from new to accepted

I have some ideas on what's wrong and how to fix them, but I don't think I'll make it in time for the next release.

comment:3 Changed 3 years ago by buntinas

  • Resolution set to fixed
  • Status changed from accepted to closed

Rui,

I committed some things that should fix this in [eea2dda84f884f04a01de03d9500bea0225f4ef4]. Let us know if this doesn't work.

Thanks,
-d

comment:4 Changed 3 years ago by wangraying@…

It works well now!

Regards, Rui

comment:5 Changed 3 years ago by ilaguna@…

Hi,

Could you guys give a little bit of more detail about what a hang means here and the expected behavior of the -disable-auto-cleanup option?

Does a hang mean that the peer process continues running and exits eventually?, or does it mean that it continues running but never exits?

If I use the -disable-auto-cleanup option, what is the expected behavior if a process is killed? I have been reading at the code and what I understand is that processes are signaled with SIGUSR1 (from what I see in a loop at src/pm/hydra/pm/pmiserv/pmip_cb.c). From what I read in the README file, the application can catch SIGUSR1 (maybe to do something before aborting), is this correct? But what happen with a program like the test above with no SIGUSR1 signal handler? Does MPICH2 have a default SIGUSR1 signal handler?

Thank you in advance for your answers. I work in an LLNL research project to build debugging tools for large-scale clusters.

comment:6 Changed 3 years ago by balaji

With -disable-auto-cleanup, when a process dies, the process manager does not automatically cleanup the remaining processes.

The hang was a bug in MPICH2 that was causing a hang (meaning, communication with the dead process was waiting infinitely, instead of returning an error). This has been fixed in [eea2dda84f884f04a01de03d9500bea0225f4ef4]. You can try out the latest nightly snapshot for this fix:

http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nightly/trunk

Note that this fix has not been backported to the 1.3.x branch, AFAICT. So, it will not be available in the upcoming 1.3.2 release.

With respect to SIGUSR1, this is if you need notification when some process in the system dies. In most cases, you do not need such notification, as the communication operation will return a non-MPI_SUCCESS return code and you'd know that a process is dead. Also, note that SIGUSR1 signaling is a temporary hack and will disappear in the future. Please see the README in 1.3.2rc1 for more details:

https://trac.mcs.anl.gov/projects/mpich2/browser/mpich2/tags/release/mpich2-1.3.2rc1/README.vin#L738

As pointed out in the README, please DO NOT use the SIGUSR1 handler unless you really need to. This is temporary, and will almost certainly disappear in a future release.

Note: See TracTickets for help on using tickets.