Opened 7 years ago

Last modified 22 months ago

#1419 new bug

hangs and error messages in r7750 point-to-point communication

Reported by: wangraying@… Owned by: wbland
Priority: major Milestone: mpich-3.3
Component: mpich Keywords:
Cc:

Description (last modified by balaji)

The test program:

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <mpi.h>

#define	I_RECV 0
#define I_SEND 1
void rollN(
		int 	lengthR,
		int 	lengthS,
		int	partner,
		MPI_Comm comm)
{
	int	N = 465, LDU = 232, rank;
	MPI_Datatype               type[2];
	MPI_Request		   request;
	MPI_Status		   status;
	int                        Cmsgid=100, ierr = MPI_SUCCESS;
	double			   buf[1000000];
	
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	if( lengthR > 0 )
      	{
	       	if( ierr == MPI_SUCCESS )
			ierr =   MPI_Type_vector( N, lengthR, LDU, MPI_DOUBLE,
                                      &type[I_RECV] );
		if( ierr == MPI_SUCCESS )
            		ierr =   MPI_Type_commit( &type[I_RECV] );
         	if( ierr == MPI_SUCCESS )
	 	{
			ierr = MPI_Irecv( buf, 1, type[I_RECV],
					partner, Cmsgid, comm, &request );
      			printf("P%d: recv from %d, ierr = %d, recv_size=%d, send_size=%d\n", 
	  				  rank, partner, ierr, lengthR, lengthS);
         	}
      	}
	if(rank == 9)
		exit(0);
       
	if( lengthS > 0 )
  	{
		if( ierr == MPI_SUCCESS )
		      	ierr =   MPI_Type_vector( N, lengthS, LDU, MPI_DOUBLE,
					&type[I_SEND] );
	       	if( ierr == MPI_SUCCESS )
		   	ierr =   MPI_Type_commit( &type[I_SEND] );
	       
		if( ierr == MPI_SUCCESS )
	      	{
	     		ierr = MPI_Send( buf, 1, type[I_SEND],
		      			partner, Cmsgid, comm);
	   		printf("P%d: rollN, send to %d, ierr =%d\n", 
		     			rank, partner, ierr );
	       	}
       		if(ierr == MPI_SUCCESS)
		   	ierr =   MPI_Type_free(   &type[I_SEND] );
       	}
     	if( lengthR > 0 )
       	{
	      	if( ierr == MPI_SUCCESS )
	 	{
		     	MPI_Wait( &request, &status );
	    		printf("P%d: rollN,recv wait ok! ierr = %d\n",
		    			rank ,ierr);
	       	}
	   	if( ierr == MPI_SUCCESS )
		      	ierr =   MPI_Type_free(   &type[I_RECV] );
	}

}
int main(int argc, char **argv)
{
	int rank, col_rank, size, rc = MPI_SUCCESS, ret = 0, i=0;
	int len;
	char string[MPI_MAX_ERROR_STRING];
	double t1, t2;
	int myrow, mycol, npcol = 4;
	MPI_Comm	col_comm;
	
	MPI_Status status;
	MPI_Request req;
	MPI_Datatype type;

	MPI_Init(&argc, &argv);
	MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);

	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	MPI_Comm_size(MPI_COMM_WORLD, &size);

	myrow = rank / npcol;
       	mycol = rank - myrow * npcol;
	MPI_Comm_split( MPI_COMM_WORLD, mycol, myrow, &col_comm );

	if(myrow == 0)
	{
		rollN(60, 70, 2, col_comm);

	}
	else if(myrow == 2)
	{
		rollN(70, 60, 0, col_comm);
       	}	

	MPI_Finalize();
	return 1;
}

I did some tests, and sometimes it hangs along with the following some error messages, but not every time.

...
Assertion failed in file /home/wr/backup/mpich2-trunk-[7750]/src/util/wrappers/mpiu_shm_wrappers.h at line 437: *hnd_ptr
internal ABORT - process 6
...
Unexpected state MPIDI_VC_STATE_MORIBUND in vc 0x27a4f88 (expecting MPIDI_VC_STATE_ACTIVE)
Assertion failed in file ch3u_handle_connection.c at line 318: vc->state == MPIDI_VC_STATE_ACTIVE
internal ABORT - process 3
...
Assertion failed in file /home/wr/backup/mpich2-trunk-[7750]/src/util/wrappers/mpiu_shm_wrappers.h at line 309: MPIU_SHMW_Hnd_is_init(hnd)
internal ABORT - process 1
...


Change History (9)

comment:1 Changed 7 years ago by balaji

  • Milestone set to mpich2-1.3.2
  • Owner set to buntinas
  • Status changed from new to assigned

Setting this to 1.3.2 for consideration, but it'll likely be pushed to 1.3.3.

comment:2 Changed 7 years ago by buntinas

  • Milestone changed from mpich2-1.3.2 to mpich2-1.3.3

comment:3 follow-up: Changed 7 years ago by buntinas

How many processes did you use?
-d

comment:4 in reply to: ↑ 3 Changed 7 years ago by wangraying@…

I started 16 processes, and not all of them had error messages.

comment:5 Changed 7 years ago by balaji

  • Milestone changed from mpich2-1.3.3 to mpich2-1.4

Milestone mpich2-1.3.3 deleted

comment:6 Changed 5 years ago by balaji

  • Milestone changed from mpich2-1.5 to mpich-3.0

FT is not a priority for the 1.5 release. Moving this to 3.0.

comment:7 Changed 5 years ago by balaji

  • Milestone changed from mpich-3.0 to mpich-3.0.1

comment:8 Changed 4 years ago by balaji

  • Description modified (diff)
  • Milestone changed from mpich-3.1 to mpich-3.2
  • Owner changed from buntinas to wbland
  • Status changed from assigned to new

comment:9 Changed 22 months ago by balaji

  • Milestone changed from mpich-3.2.1 to mpich-3.3

Milestone mpich-3.2.1 deleted

Note: See TracTickets for help on using tickets.