Opened 8 years ago

Closed 5 years ago

#1025 closed bug (wontfix)

closesocket failed, sock 536, error 10093

Reported by: tony.garratt@… Owned by: jayesh
Priority: major Milestone: future
Component: mpich Keywords:
Cc:

Description (last modified by balaji)

We are getting sporadic socket error messaages at the very end of our MPI run. We are running on win32 and win64 using MPICH2. Is this a known bug or something we are doing incorrectly?

We are using mpiexec -localonly -n <2,3 or 4>

Change History (30)

comment:1 Changed 8 years ago by jayesh

  • Owner set to jayesh
  • Status changed from new to assigned

Hi,

Which version of MPICH2 are you using ? The latest version of MPICH2 is available at http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads .

(PS: There was a bug that existed in previous releases related to sockets+localonly option. That bug has already been fixed.)

Regards,
Jayesh

comment:2 Changed 8 years ago by buntinas

Also, double check that all processes in your application call MPI_Finalize before exiting. That might cause the error messages you're seeing.

-Darius

comment:3 Changed 8 years ago by tony.garratt@…

Thank you for getting back in touch. We are using version 2-1.0.8. Do you think the bug existed in that version and will be fixed in the most recent 2-1.1?

Thank You!
Tony

comment:4 Changed 8 years ago by jayesh

Hi,

The bug was in SMPD (it showed up on valid/correct MPI programs) and should be fixed in the latest stable release (1.2.1p1).
Let us know if it works for you.

Regards,
Jayesh

comment:5 Changed 8 years ago by tony.garratt@…

Thank you so much - this appears to work for us. The error is somewhat sporadic, so we will continue testing for a few days, but at the moment things look good!

Thank you,
tony

comment:6 Changed 8 years ago by tony.garratt@…

Agh! I think I spoke too soon! On win64 we get the following error (see below).

Error posting writev, A request to send or receive data was disallowed because the socket had already been shut down in that direction with a previous shutdown call.(10058)
unable to post a write for the next command,
sock error: Error = 10058

unable to post a write of the closed_request command to the parent context.
unable to close the stdin context.
state machine failed.

comment:7 Changed 8 years ago by jayesh

Do you have a sample test program that reproduces the error ?
Meanwhile, why are you using the "-localonly" option (The MPI processes are launched on the local host by default - if you don't specify the "-host" or "-machinefile" option)?

Regards,
Jayesh

comment:8 Changed 8 years ago by tony.garratt@…

We had a lot of problems generally with MPICH2 and we were running on VMs, so we decided to try -localonly to see if it would make any difference.

As it turns out, we made an error in our upgrade to the newer MPICH2, and since we have fixed that, we are not getting any socket errors any more. Sorry for any confusion. We are going to continue to monitor our test suite for a few days, but at the moment things look good with the latest MPICH2.

Thanks so much for your help.

Tony

comment:9 Changed 8 years ago by jayesh

Hi,

Good to know MPICH2 is working for you now. Let us know if you have any problems.

(PS: What kind of problems were you facing with MPICH2+VM without localonly option ?)
-Jayesh

comment:10 Changed 8 years ago by tony.garratt@…

We basically were getting a socket error messages at the end of the run. The run actually completed, so it seems the socket (failure to close messages etc) errors were happening during shutdown of all processes. And it was sporadic too - they didnt always happen, and the type of socket errors were not always the same.

Tony

comment:11 Changed 8 years ago by thakur

  • Resolution set to worksforme
  • Status changed from assigned to closed

comment:12 Changed 8 years ago by tony.garratt@…

  • Resolution worksforme deleted
  • Status changed from closed to reopened

Unfortunately, further testing has revealed that we are still getting socket error messages (see below).

What is happening in our code is that the Fortran main program calls a C subroutine at the end of the run. The C routine writes some messages to standard output, and it then calls exit; it does not return back to the Fortran main program. Immediately after the message written by the C program we see the socket error message (though it could be going to standard error rather than standard output). The socket error is sporadic - does not happen every time; we were wondering if there is something we are doing that might explain why we sometime get the error.

Thank you
Tony

Error posting writev, A request to send or receive data was disallowed because the socket had already been shut down in that direction with a previous shutdown call.(10058)
unable to post a write for the next command,
sock error: Error = 10058

unable to post a write of the closed_request command to the parent context.
unable to close the stdin context.
state machine failed.

comment:13 Changed 8 years ago by thakur

Is MPI_Finalize being called at all in this case?

comment:14 Changed 8 years ago by tony.garratt@…

We are, but its a good question to ask - I will ask our programmer to make absolutely sure that it is the case.

comment:15 Changed 8 years ago by jayesh

Do you have a sample test case that reproduces the error ? It would be really helpful to be able to recreate the problem in our lab.

Regards,
Jayesh

comment:16 Changed 8 years ago by tony.garratt@…

I would be very difficult to give you a sample test case - we have only seen it on large simulations, and due to IPR issues, it would be difficult to give you source code. We are going to try to do a bit more debugging to see if we can isolate it and I will let you know how we get on.

comment:17 Changed 8 years ago by jayesh

Any further info would be helpful. It might be useful to test on multicore machines with a sample program (similar characteristics) run in a loop (batch script).

-Jayesh

comment:18 Changed 8 years ago by tony.garratt@…

Hi!

Well, we have been battling with this one for a few weeks now. I can reproduce it on with our code on a dual core win64 XP machine using the latest MPICH2 release, but it is sporadic - it doesnt always happen. Our code does do IO via C and Fortran to files, but we are pretty sure they are all closed before the finalize is called on each thread. It reproduces on my dual core machine using mpiexec -n 2.

I would love to be able to send you code to reproduce it, but IPR/NDA issues do not allow this, unless we went through a process of an NDA.

Anyhow, we are mystified by this one and are just going to ignore the error message for now (it appears to happen at the end of the run and everything completed OK - we just get this darn error message from time to time - I would estimate about 5% of th time).

I just wanted to give you an update.

comment:19 Changed 8 years ago by jayesh

Hi,

Thanks. The bug is in the close protocol of the process manager and hence should not effect your MPI program execution.
We will try to recreate the bug here in the lab next week and let you know the results.

Regards,
Jayesh

comment:20 Changed 8 years ago by tony.garratt@…

Thank you so much. FYI, we are on win64 and using -localonly; I am going to remove localonly and see if the problem goes away.

comment:21 Changed 8 years ago by jayesh

  • Milestone set to future

comment:22 Changed 8 years ago by tony.garratt@…

Hi!

Does this mean this is being regarded as a bug to be fixed in a future version?

Thank you!
Tony

comment:23 Changed 8 years ago by jayesh

Yes. Since the bug is not easily reproducible we might not be able to fix it for the upcoming 1.3 release. But this is indeed a bug that we need to take a look at and is in my radar.

Regards,
Jayesh

comment:24 Changed 7 years ago by ben.held@…

Does anyone have a timeline on a fix for this issue? We are seeing it 100% of the time (when we use -localonly). This is a critical issue for us and may cause us to look at alternative implementations of MPI for Windows.

comment:25 Changed 7 years ago by thakur

Does it happen even with the latest release, 1.4.1p1?

comment:26 Changed 7 years ago by ben.held@…

Yes it does

comment:27 Changed 7 years ago by ben.held@…

What we notice is that our program prints out its last line out output and then it hangs for about 5-10 seconds before the Windows crash dialogs show up. If we instead run with -np, it exits immediately w/o crashing.

We really need to use -localonly to skip the authentication step when used from our UI. Any suggestions?

comment:28 Changed 7 years ago by jayesh

The bug (as mentioned above) in the close protocol of the SMPD state machine. We haven't been able to recreate this bug for some time now.
We would like to fix these bugs. Unfortunately we don't have enough developer cycles to fix this bug.

Regards,
Jayesh

comment:29 Changed 7 years ago by buntinas

You can try using Microsoft's MPI. They completely rewrote the SMPD code.

comment:30 Changed 5 years ago by balaji

  • Description modified (diff)
  • Resolution set to wontfix
  • Status changed from reopened to closed
Note: See TracTickets for help on using tickets.