Opened 4 years ago

Last modified 3 years ago

#2146 new bug

Don't destroy requests on MPI_IRECV error with MPI_ANY_SOURCE

Reported by: wbland Owned by: wbland
Priority: major Milestone: future
Component: ulfm Keywords:
Cc:

Description

Right now, if an error/failure occurs before or during an MPI_IRECV with the source as MPI_ANY_SOURCE, the request is marked as failed before returning to the application. According to the spec, this should actually not be marked as failed. When the corresponding MPI_WAIT is called, it should return MPI_ERR_PROC_FAILED_PENDING until the user calls MPI_COMM_FALIURE_ACK.

Change History (2)

comment:1 Changed 4 years ago by wbland

This is going to be more tricky than it seemed originally. It has two parts that are going to be nasty:

  1. The MPIDI_CH3U_Recvq_FDU_or_AEP function will have to be able to tell the difference between an MPI_IRECV and an MPI_RECV. If it's called by an MPI_IRECV, it will have to not set the request as completed if there is a process failure, but somehow flag it so it will cause MPI_WAIT/TEST/etc. to return, but not destroy the request (see the next bullet).
  2. When the already existing MPIDI_CH3U_Complete_disabled_anysources function cleans up MPI_ANY_SOURCE operations, it also has to do so in a way that will cause them to return control to the user, but not destroy the request. This is new and a little weird. In the past, a request was either being worked on by the progress engine (not completed) or given back to the user (completed/failed/etc.). Now we need to figure out another way to give a request back to the user to say that the request is neither completed, nor failed, but "paused" in the same way we mark all requests as MPI_ERR_PENDING when some other request in a list has an error during an MPI_WAITALL.

This is probably going to require some rather intrusive changes at the CH3 level (and maybe even higher) so I'm happy to get feedback here on suggested plans of attack. One option would be to have a flag in the request object that says not to release refcounts on the request due to the limbo state. We'd still have to figure out how to decrement and re-increment the completion counter though in order to get the request to bubble up in the progress engine.

comment:2 Changed 3 years ago by balaji

  • Milestone changed from mpich-3.2 to future
Note: See TracTickets for help on using tickets.