Opened 8 years ago

Closed 7 years ago

Last modified 7 years ago

#1418 closed bug (fixed)

sub-communicator operation sometime fails if node failure occurs

Reported by: wangraying@… Owned by: buntinas
Priority: major Milestone: mpich2-1.4
Component: mpich Keywords:
Cc:

Description

Point-to-point communication sometime fails in the new communicator splited from MPI_COMM_WORLD.
The test I used is as follows,

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <mpi.h>
int main(int argc, char **argv)
{
	int		rank, rc = MPI_SUCCESS, tag = 99,  buf,  len;
	char		string[MPI_MAX_ERROR_STRING];
	int		myrow, mycol, npcol = 4;
	MPI_Comm	col_comm;
	MPI_Status	status;

	MPI_Init(&argc, &argv);
	MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);

	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	myrow = rank / npcol;
       	mycol = rank - myrow * npcol;
	MPI_Comm_split( MPI_COMM_WORLD, mycol, myrow, &col_comm );
	MPI_Comm_set_errhandler(col_comm, MPI_ERRORS_RETURN);
	
	if(rank == 2)
		exit(0);
	sleep(1);
//	MPI_Barrier(MPI_COMM_WORLD);	
	if(myrow == 2)
	{
		buf = rank;
		rc = MPI_Send(&buf, 1 ,MPI_INT, 0,  tag, col_comm);
		MPI_Error_string(rc, string, &len);
      		printf("P%d: rc=%d, error string=%s\n", rank, rc, string);
	
	}
	else if(myrow == 0)
	{
   		rc = MPI_Recv(&buf, 1, MPI_INT, 2, tag, col_comm, &status);
		MPI_Error_string(rc, string, &len);
      		printf("P%d: rc=%d, error string=%s, buf=%d\n", rank, rc, string, buf);
	}	

	MPI_Finalize();
	return 1;
}

Initiate 16 processes for the test, and it has produced the following results.

P11: rc=0, error string=No MPI error
P3: rc=0, error string=No MPI error, buf=11
P8: rc=0, error string=No MPI error
P9: rc=0, error string=No MPI error
P10: rc=0, error string=No MPI error
P1: rc=0, error string=No MPI error, buf=9
P0: rc=0, error string=No MPI error, buf=8
P9: rc=0, error string=No MPI error
P1: rc=0, error string=No MPI error, buf=9
P8: rc=0, error string=No MPI error
P10: rc=0, error string=No MPI error
P11: rc=0, error string=No MPI error
*** glibc detected *** ./ft: free(): invalid next size (fast): 0x0000000000d16950 ***
*** glibc detected *** ./ftP0: rc=134847759, error string=Other MPI error, error stack:
MPI_Recv(186).............: MPI_Recv(buf=0x7fff522e8134, count=1, MPI_INT, src=2, tag=99, comm=0x84000004, status=0x7fff522e7d10) failed
dequeue_and_set_error(596): Communication error with rank 8, buf=0
Unexpected state MPIDI_VC_STATE_MORIBUND in vc 0x975c60 (expecting MPIDI_VC_STATE_ACTIVE)
Assertion failed in file ch3u_handle_connection.c at line 318: vc->state == MPIDI_VC_STATE_ACTIVE
internal ABORT - process 0
P3: rc=940154127, error string=Other MPI error, error stack:
MPI_Recv(186).............: MPI_Recv(buf=0x7fff8bfcb1c4, count=1, MPI_INT, src=2, tag=99, comm=0x84000002, status=0x7fff8bfcada0) failed
dequeue_and_set_error(596): Communication error with rank 11, buf=0
Unexpected state MPIDI_VC_STATE_MORIBUND in vc 0xe179938 (expecting MPIDI_VC_STATE_ACTIVE)
Assertion failed in file ch3u_handle_connection.c at line 318: vc->state == MPIDI_VC_STATE_ACTIVE
internal ABORT - process 3
APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
P8: rc=0, error string=No MPI error
P9: rc=0, error string=No MPI error
P11: rc=0, error string=No MPI error
P0: rc=0, error string=No MPI error, buf=8
P1: rc=0, error string=No MPI error, buf=9
P3: rc=0, error string=No MPI error, buf=11
P10: rc=0, error string=No MPI error
Unexpected state MPIDI_VC_STATE_MORIBUND in vc 0x17cafd80 (expecting MPIDI_VC_STATE_ACTIVE)
Assertion failed in file ch3u_handle_connection.c at line 318: vc->state == MPIDI_VC_STATE_ACTIVE
internal ABORT - process 7
*** glibc detected *** ./ft: free(): invalid next size (fast): 0x000000000370ba60 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3779c722ef]
/lib64/libc.so.6(cfree+0x4b)[0x3779c7273b]
./ft[0x421d8c]
./ft[0x42961f]
./ft[0x4297f7]
./ft[0x4298ae]
./ft[0x408b30]
./ft[0x40217a]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3779c1d994]
./ft[0x401f29]
======= Memory map: ========
00400000-004b1000 r-xp 00000000 08:03 46172340                           /home/wr/tests/ft
006b0000-006b2000 rw-p 000b0000 08:03 46172340                           /home/wr/tests/ft
006b2000-006d9000 rw-p 006b2000 00:00 0 
03704000-03725000 rw-p 03704000 00:00 0                                  [heap]
3779800000-377981c000 r-xp
P1: rc=0, error string=No MPI error, buf=9
P3: rc=0, error string=No MPI error, buf=11
P9: rc=0, error string=No MPI error
P11: rc=0, error string=No MPI error
P8: rc=0, error string=No MPI error
P10: rc=0, error string=No MPI error
P0: rc=537500943, error string=Other MPI error, error stack:
MPI_Recv(186).............: MPI_Recv(buf=0x7fff3883f274, count=1, MPI_INT, src=2, tag=99, comm=0x84000004, status=0x7fff3883ee50) failed
dequeue_and_set_error(596): Communication error with rank 8, buf=0

If I replace the statement "sleep(1)" with "MPI_Barrier", the result is

P8: rc=0, error string=No MPI error
P0: rc=0, error string=No MPI error, buf=8
P9: rc=0, error string=No MPI error
P10: rc=474065679, error string=Other MPI error, error stack:
MPI_Send(173)..................: MPI_Send(buf=0x7fffd6085b04, count=1, MPI_INT, dest=0, tag=99, comm=0x84000002) failed
MPIDI_EagerContigShortSend(262): failure occurred while attempting to send an eager message
MPIDI_CH3_iStartMsg(36)........: Communication error with rank 2
P11: rc=0, error string=No MPI error
P1: rc=0, error string=No MPI error, buf=9
P3: rc=0, error string=No MPI error, buf=11

I think this result is correct, since P10 is the only process who tries to communicate with the dead process P2.

Change History (6)

comment:1 Changed 7 years ago by balaji

  • Milestone set to mpich2-1.3.2
  • Owner set to buntinas
  • Status changed from new to assigned

Setting this to 1.3.2 for consideration, but it'll likely be pushed to 1.3.3.

comment:2 Changed 7 years ago by buntinas

  • Milestone changed from mpich2-1.3.2 to mpich2-1.3.3

comment:3 Changed 7 years ago by buntinas

I'm not seeing the same errors. One thing I noticed is that sometimes proc 2 exits before all processes have completed the comm_split. This will result in an invalid col_comm communicator at some nodes. Check the return code for comm_split to check for that.

I added a barrier between the comm_split and the exit. That ensured that the comm_split would succeed. (Sometimes the barrier returned an error, but that's OK).

Can you try this again with the latest trunk and let us know if you still get this error?

Thanks,
-d

comment:4 Changed 7 years ago by wangraying@…

Yep, it was the problem. It works well after adding a barrier.

Best Regards,
Rui

comment:5 Changed 7 years ago by buntinas

  • Resolution set to fixed
  • Status changed from assigned to closed

Great! Keep looking for more bugs! :-)

-d

comment:6 Changed 7 years ago by balaji

  • Milestone changed from mpich2-1.3.3 to mpich2-1.4

Milestone mpich2-1.3.3 deleted

Note: See TracTickets for help on using tickets.