Opened 5 years ago

Last modified 5 years ago

#1799 assigned bug

Hydra: hostname propagation for localhost

Reported by: balaji Owned by: balaji
Priority: major Milestone: future
Component: mpich Keywords:
Cc: bradc@…

Description

Note from Brad --

Hi Pavan and Rajeev --

This is a low priority issue, but one of my students ran into it, so I said that I'd check with you guys to see if it was a bug or not (apologies for not using the Trac system to check -- it seemed I had to make an account even to read the existing bugs, and that was a bigger barrier than I was up for.

The issue seems to happen when running between two machines, say they're named mach01 and mach02.  If we launch from mach01 using:

    mpirun -np 2 -host mach01,mach02

then things work as you expect.  If, instead, we use:

    mpirun -np 2 -host mach02,localhost

then we get a fatal error:

Fatal error in PMPI_Barrier: Other MPI error, error stack:
PMPI_Barrier(425).........: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(331)....: Failure during collective
MPIR_Barrier_impl(313)....:
MPIR_Barrier_intra(83)....:
dequeue_and_set_error(596): Communication error with rank 0
Fatal error in PMPI_Barrier: Other MPI error, error stack:
PMPI_Barrier(425).........: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(331)....: Failure during collective
MPIR_Barrier_impl(313)....:
MPIR_Barrier_intra(83)....:
dequeue_and_set_error(596): Communication error with rank 1

My armchair diagnosis would be that using 'localhost' causes a different launch mechanisms to be used than naming a hostname explicitly and that the two mechanisms are somehow not compatible.



Again, this is not at all holding us up, having diagnosed it.  I just wanted to pass it along in case it was still an issue and to get your take on it.

Thanks,
-Brad

Change History (3)

comment:1 Changed 5 years ago by balaji

  • Milestone set to mpich-3.0.3
  • Owner set to balaji
  • Status changed from new to assigned

comment:2 Changed 5 years ago by balaji

  • Cc bradc@… added

comment:3 Changed 5 years ago by balaji

  • Milestone changed from mpich-3.1 to future
Note: See TracTickets for help on using tickets.