Opened 9 years ago

Closed 5 years ago

#927 closed bug (wontfix)

Spawn() fails on remote node with nemesis on windows

Reported by: jayesh Owned by: jayesh
Priority: major Milestone: future
Component: mpich Keywords:
Cc: lradev@…

Description (last modified by balaji)

Actually, it does work locally but fails remotely, with channel nemesis. As it turns out, the issue is unrelated to C++ and Boost.

Consider this, a program named "tm":

int main(int argc, char* argv[])
{
        int supported;
        MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &supported);

        MPI_Comm parent, children;
        MPI_Comm_get_parent(&parent);

        if (parent == MPI_COMM_NULL) {
                const int NHOST = 2;
                int nproc[NHOST] = {1, 1};
                char* hosts[NHOST] = {"lradev-w02", "lradev-w03"};
                char* progs[NHOST] = {"c:/pub/tm",  "c:/pub/tm"};
                MPI_Info infos[NHOST];
                for (int i=0; i < NHOST; ++i) {
                        MPI_Info_create(&infos[i]);
                        MPI_Info_set(infos[i], "host", hosts[i]);
                }
                MPI_Comm_spawn_multiple(NHOST, progs, NULL, nproc, infos, 0, MPI_COMM_WORLD, &children, NULL);
        }

        MPI_Finalize();
        return 0;
}

lradev-w02 is my localhost on which the program is being run, and lradev-w03 is the remote host.

The program runs fine when run with NHOST==1, i.e. only locally - it spawns a copy of itself and exits.

However, when run with NHOST==2, it freezes after spawning one local and one remote copy, i.e. locally I can observe 2 processes named "tm.exe" (plus mpiexec) and one "tm.exe" process on the remote host (plus mpiexec). Those apparently eat all CPU available to them and have to be killed to stop.

With the sock channel it works fine both locally and remotely, obviously in MPI_THREAD_SINGLE mode. It crashes with mt and ssm channels (due to unhandled win32 exception).

I have your private build installed on both hosts.

Change History (20)

comment:1 Changed 9 years ago by jayesh

Hi,

Is "tm" a non-MPI program ?

(PS: Looks like SMPD is not correctly dealing with spawned non-MPI programs)
Regards,
Jayesh

comment:2 Changed 9 years ago by Lubomir Radev <lradev@…>

if by "non-MPI" you mean a program started directly and not via mpiexec, then yes

comment:3 Changed 9 years ago by balaji

  • Description modified (diff)

In the example above, tm is an MPI program. Note, btw, that the MPI standard does not allow you to spawn non-MPI programs (MPI-2.2 spec, pg. 310, lines 8-10).

comment:4 Changed 9 years ago by jayesh

I wanted to make sure that you are spawning the same program (not another version that is a non-MPI program) on all the hosts.

-Jayesh

comment:5 Changed 9 years ago by Lubomir Radev <lradev@…>

Yes, it is the same program. Just name the above code "tm", replace the hostnames and prog filepath, and you could use it to reproduce the issue in your environment.

comment:6 Changed 9 years ago by jayesh

Hi,

Can you try the program below (contains bug fixes on the code you provided) and let us know if it works for you ? The code works for me on two machines (I changed the program name & the machines names.). The code works for multithreaded channel (mt) and the nemesis channel for singleton init (programs launched without mpiexec) programs (as well as programs launched with mpiexec).
Also try specifying the complete hostname (eg: lradev-w02.domain.company.com) Or the ipaddresses of the machines.
Also try out the release candidate for 1.2.1 (1.2.1rc1) available at http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads that contains the fix for ticket 891.

#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"

#define NHOST 2

int main(int argc, char* argv[]) {

    int supported;
    int rank, size;
    MPI_Comm parent, intercomm; 
    
    MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &supported);

    if(supported != MPI_THREAD_MULTIPLE){
        printf("The library does not support MPI_THREAD_MULTIPLE\n");
        exit(-1);
    }

    MPI_Comm_get_parent(&parent);

    if (parent == MPI_COMM_NULL){
        int i;
        int nproc[NHOST] = {1, 1};
        char* hosts[NHOST] = {"lradev-w02", "lradev-w03"};
        char* progs[NHOST] = {"c:\\pub\\tm",  "c:\\pub\\tm"};

        MPI_Info infos[NHOST];
        
        for (i=0; i < NHOST; i++) {
            MPI_Info_create(&infos[i]);
            MPI_Info_set(infos[i], "host", hosts[i]);
        }
        MPI_Comm_spawn_multiple(NHOST,
            progs, MPI_ARGVS_NULL, nproc, infos,
            0, MPI_COMM_WORLD, &intercomm, MPI_ERRCODES_IGNORE);
        for (i=0; i < NHOST; i++) {
            MPI_Info_free(&infos[i]);
        }
    }
    else{
        intercomm = parent;
    }
    MPI_Comm_rank(intercomm, &rank);
    MPI_Comm_size(intercomm, &size);
    printf("[%d/%d] Hello world\n", rank, size); fflush(stdout);

    MPI_Comm_free(&intercomm);
    MPI_Finalize();
    return 0;
} 

Let us know the results.

Regards,
Jayesh

comment:7 Changed 9 years ago by Lubomir Radev <lradev@…>

Unfortunately, the issue still persists for me.

I installed 1.2.1rc1 on both machines, and then tried your corrected program with exactly the same outcome I mentioned above: with nemesis, it hangs on both local and remote hosts, eating up all CPU it can - I can see 2 tm.exe processes on the local and 1 on the remote host. I also tried it with FQDN hostnames and IP addresses, to no avail - but I don't think there's a resolver problem as apparently it's able to spawn the program both locally and remotely, and it's just that it hangs somewhere *after* establishing communication channel and spawning itself. Again, it works fine for me *only* with the sock channel.

If this helps, I'm running WinXP Pro SP3 on both machines. There's no firewall on either one.

comment:8 Changed 9 years ago by Lubomir Radev <lradev@…>

Correction - it also works with the mt channel - but not with nemesis.

comment:9 Changed 9 years ago by jayesh

Hi,

How did the multithreaded sock channel (mt) start working (As per you prev email it did not work for you before)?

-Jayesh

comment:10 Changed 9 years ago by jayesh

Hi,

Also make sure that you have registered your username/password with mpiexec before running your program.
Try running the program with mpiexec (mpiexec -n 1 -channel nemesis -machinefile mf.txt remote_spawn.exe) to make sure that MPICH2 can launch jobs on the remote host.

Regards,
Jayesh

comment:11 Changed 9 years ago by Lubomir Radev <lradev@…>

I guess the mt channel started working as result of some patch you did in rc1 after you sent me the private build, because with that build it didn't work and with rc1 it does.

I have my credentials registered and I can launch remote programs with mpiexec on nemesis. However, that's not what I'm interested in. The code that will be launching my workers won't be a stand-alone program but part of a library which client apps (including GUIs) will be linking against, therefore mpiexec isn't an option for me.

comment:12 Changed 9 years ago by jayesh

I am assuming that the machines are not heterogeneous (MPICH2 does not currently support heterogeneous systems - eg: You cannot run your job across 32-bit and 64-bit machines). Let us know the OS on each machine (Type "winver" at the command prompt).

Run smpd in debug mode on both the machines and provide us with the outputs. To run smpd in debug mode follow the steps below,

# Stop any instances of smpd running on the machines by using the "smpd -stop" command.

# Run smpd in debug mode using the "smpd -d > smpd_MACHINENAME.log" command.

# Run remote_spawn.exe on one of the machines.

# Without killing (Cntrl-C) the program (if it hangs), copy the smpd debug log to another file (copy smpd_MACHINENAME.log smpd_MACHINENAME_tosend.log)

# Now kill the program (Cntrl-C) - I am assuming that the program is hanging.

# Provide us with the smpd debug logs from both the machines

Regards,
Jayesh

comment:13 Changed 9 years ago by Lubomir Radev <lradev@…>

No, both machines run 32-bit XP:

Version 5.1 (Build 2600.xpsp_sp3_gdr.090804-1435 : Service Pack 3)

I ran your corrected code above. It hanged with nemesis. I sent you the logs via email.

comment:14 Changed 9 years ago by jayesh

Hmmm... Now I know why your program crashed with my custom build. Since the version numbers had changed in the trunk you had to re-install (not restart) smpd. However I am unable to reproduce the hang that you mentioned with 1.2.1rc1+nemesis.

I want to make sure that you have a valid MPICH2 install. Please follow the steps below and let us know the results.

# Uninstall MPICH2 on both the machines

# Make sure that you don't have any stale libs/binaries in your systems.

  • On both the machines delete the MPICH2 installation directory
  • On both the machines delete the MPICH2 dlls if they are around ("del c:\windows\system32\mpich2mpi.dll" "del c:\windows\system32\mpich2mt.dll" "del c:\windows\system32\mpich2nemesis.dll")

# Install MPICH2 1.2.1rc1 on both the machines

# Re-compile your code (remote_spawn) on both the machines

# Re-run your code (remote_spawn).

(PS: Even with version mismatch your program should not have crashed. This was due to a bug in the code which I have a patch for and will be added to trunk soon. Thank you for reporting the bug.)

Regards,
Jayesh

comment:15 Changed 9 years ago by Lubomir Radev <lradev@…>

I already did all that when installing the rc1 - your msi package won't install over an older installation, it requires you uninstall it first and then do a clean install - which is just what I did. I double-checked the DLLs in the system directory and they are all rc1 and not leftovers from the private build. I even rebooted both machines.

I see you've released 1.2.1-stable today, but unless you confirm there were some relevant patches since the rc1, I don't see the point of doing the whole exercise again.

comment:16 Changed 8 years ago by jayesh

  • Milestone changed from mpich2-1.3 to mpich2-1.3.1

Verified that remote spawn works with Nemesis (1.3rc1) .
However there is still a bug where remote spawn fails (Nemesis takes the wrong interface address) with Nemesis on Windows for nodes with multiple interfaces.

Regards,
Jayesh

comment:17 Changed 7 years ago by jayesh

  • Milestone changed from mpich2-1.3.2 to mpich2-1.3.3

comment:18 Changed 7 years ago by balaji

  • Milestone changed from mpich2-1.3.3 to mpich2-1.4

Milestone mpich2-1.3.3 deleted

comment:19 Changed 7 years ago by balaji

  • Milestone changed from mpich2-1.4 to future

comment:20 Changed 5 years ago by balaji

  • Description modified (diff)
  • Resolution set to wontfix
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.