Opened 4 years ago

Last modified 2 years ago

#2104 new bug

pathscale-noshared failure on mpich-master-special-tests

Reported by: jczhang Owned by: raffenet
Priority: major Milestone: future
Component: mpich Keywords: pathscale, noshared
Cc: raffenet

Description

See failure report at https://jenkins.mpich.org/job/mpich-master-special-tests/compiler=pathscale,jenkins_configure=noshared,label=ubuntu64/

I can reproduced the problem on stomp, with configure option --disable-shared --enable-nemesis-dbg-localoddeven CC=pathcc CXX=pathCC FC=pathf90 F77=pathf90

I can successfully build mpich and then use mpicc to compile a program. But the generated executable issues segfault at runtime. I can reproduce the error even with this very simple program.

/* test.c */
#include <mpi.h>
int main(int argc, char **argv)
{
    MPI_Init(&argc, &argv);
    MPI_Finalize();
}

Disassembling the executable in gdb, I find segfault is caused by instructions "callq 0x0", which are supposed to be calls to MPI_Init and MPI_Finalize.

0x4006a2 <main+34>              mov    %edi,-0x10(%rbp)
0x4006a5 <main+37>              mov    %rsi,-0x8(%rbp)
0x4006a9 <main+41>              lea    -0x10(%rbp),%rdi
0x4006ad <main+45>              lea    -0x8(%rbp),%rsi
0x4006b1 <main+49>              callq  0x0
0x4006b6 <main+54>              callq  0x0

Look at symbols in the executable, nm test

                 w MPI_Init
                 w MPI_Init_thread
                 w MPI_Initialized
0000000000400680 T main

It does not give text code addresses for these weak symbols. It looks like a Patchscale bug. All other compilers, like gcc, clang, icc, pgi etc. are happy with this configuration.
I will report it to support@…

Attachments (1)

weak.tgz (576 bytes) - added by jczhang 4 years ago.
A test to reproduce a Pathscale bug on weak symbols

Download all attachments as: .zip

Change History (24)

comment:1 Changed 4 years ago by balaji

It's better to reproduce it with a non-MPI program before reporting the bug.

comment:2 Changed 4 years ago by jczhang

Contacted with C Bergström from Pathscale. After serval email conversations, it seems it is due to pathcc reported false positive results during autoconf, and then generated wrong code. They will provide a fix in a couple of days.

Changed 4 years ago by jczhang

A test to reproduce a Pathscale bug on weak symbols

comment:3 Changed 4 years ago by jczhang

Attached a very small test to reproduce the problem. The test is based on MPICH configure result using ekopath-2013-11-15. Type "make; make run", you will see the segfault. If use gcc, gcc will report a compilation error. Obviously, pathscale reported false positive results during MPICH configure checks on weak symbol support.

Sent this test to pathscale and it seems they now fixed the bug in the latest pathscale nightly download. I built MPICH with the current pathscale, the problem described in this ticket disappears.

BTW, to download the latest pathscale, we need to change the date in the URL provided on pathscale website. For example, use http://c591116.r16.cf2.rackcdn.com/ekopath/nightly/Linux/ekopath-2014-06-09-installer.run to get the version of 2014-06-09.

comment:4 Changed 4 years ago by jczhang

  • Cc raffenet added
  • Resolution set to fixed
  • Status changed from new to closed

Add Ken to let him update pathscale on MCS machines.

comment:5 Changed 4 years ago by raffenet

The pathscale compilers are updated on MCS machines as of this morning.

comment:6 Changed 4 years ago by raffenet

I've reverted the pathscale update for the time being. Each recent snapshot that I've tried (6/4, 6/9) causes all MPICH tests to timeout either in Jenkins or on my laptop.

comment:7 Changed 4 years ago by jczhang

I tested on stomp with nofast configuration, it only failed with "2 tests failed out of 833"
But I noticed ./cxx/io/testlist took a long time.

comment:8 follow-up: Changed 4 years ago by raffenet

Which nightly build did you use?

comment:9 in reply to: ↑ 8 Changed 4 years ago by jczhang

Replying to raffenet:

Which nightly build did you use?

The 2014-06-09 one.

comment:10 Changed 4 years ago by raffenet

Apparently --disable-fast is the secret to make things run fast... Try a default configuration and you'll see timeouts.

comment:11 Changed 4 years ago by jczhang

  • Resolution fixed deleted
  • Status changed from closed to reopened

pathcc reports errors when compiling the following code on Linux. However, gcc is fine with the code. On MacOS, both clang and gcc report errors since MacOS does not support weak symbols. But pathcc should compile that on Linux.

Reported to C Bergström from Pathscale. Waiting for his fix.

    extern int PFoo(int);
    #pragma weak PFoo = Foo
    int Foo(int a) { return a; }
    int main () { return PFoo(1); }

comment:12 Changed 4 years ago by raffenet

Once Toni and I finish up #2002, MPICH will support function attributes for weak symbols again. I tested the current review branch and pathcc works with this method. We can effectively workaround the pragma bug at that point.

comment:13 Changed 4 years ago by balaji

#2002 is fixed, but this problem appears to be still there.

comment:14 Changed 4 years ago by raffenet

Pathscale configurations now use weak symbols instead of building libpmpi, though. Using a newer Pathscale snapshot (from 7-12-2014) still causes all tests to timeout.

comment:15 Changed 4 years ago by jczhang

In the email Bergström wrote: "The code around the area you hit before is a bit brittle and we probably should write a bunch of test cases for it." I think we'd better wait for Pathscale's fix for the above small test I reported, before digging into it on our side.

comment:16 Changed 4 years ago by balaji

I agree. I think we have correctly identified the issue and reported it to Pathscale. There's nothing more we can do on our end.

Can you disable the corresponding jenkins test and set this ticket to "future", in that case? Once pathscale fixes their compiler, we can revisit this ticket.

Please also add a note to the RELEASE_NOTES about this issue and point it to this ticket.

comment:17 Changed 4 years ago by jczhang

  • Milestone changed from mpich-3.1.2 to future

comment:18 Changed 4 years ago by jczhang

Since it is bad to say pathscale is unusable even with the default configure option in a release note (though it is true), I looked at the issue more closely. I think with the recent MPICH updates, the weak symbol problem is fixed. This time, it is an incorrect code generation problem. I paste the mail sent to pathscale.


From: Junchao Zhang <jczhang@…>
Date: Tue, Jul 15, 2014 at 4:17 PM
Subject: Re: Errors when compiling MPICH with EKOPath
To: "C. Bergström" <cbergstrom@…>

Hello,

I downloaded the nightly download of pathscale of 2014-07-14 and built MPICH with it. But when I used mpicc to compile a test, the test hung. I dug it a little and found it seems pathcc generates wrong code (?). Here is my experiment.

$cat test.c
typedef struct { volatile int v; } OPA_int_t;
static inline int OPA_load_int(const OPA_int_t *ptr) { return ptr->v; }

typedef struct MPID_nem_barrier
{
    OPA_int_t val;
    OPA_int_t wait;
} MPID_nem_barrier_t;

static int sense;
MPID_nem_barrier_t barrier;

int MPID_nem_barrier(void)
{
    while (OPA_load_int(&barrier.wait) == sense);
    return 0;
}

  $pathcc -c test.c
  $objdump -d test.o

test.o:     file format elf64-x86-64
Disassembly of section .text:

0000000000000000 <MPID_nem_barrier>:
   0:	83 3d 00 00 00 00 00 	cmpl   $0x0,0x0(%rip)        # 7 <MPID_nem_barrier+0x7>
   7:	74 07                	je     10 <MPID_nem_barrier+0x10>
   9:	31 c0                	xor    %eax,%eax
   b:	c3                      retq
   c:	0f 1f 40 00          	nopl   0x0(%rax)
  10:	eb fe                   jmp    10 <MPID_nem_barrier+0x10>

The jmp instruction is suspicious, since it jumps to itself, resulting a dead loop when the while(..) condition is true. I also did the experiment with gcc

 $ gcc -c test.c -O3
 $ objdump -d test.o

test.o:     file format elf64-x86-64
Disassembly of section .text:

0000000000000000 <MPID_nem_barrier>:
   0:	8b 05 00 00 00 00    	mov    0x0(%rip),%eax        # 6 <MPID_nem_barrier+0x6>
   6:	85 c0                	test   %eax,%eax
   8:	74 f6                	je     0 <MPID_nem_barrier>
   a:	31 c0                	xor    %eax,%eax
   c:	c3                      retq

We can see the instructions look good. I also tried with pathscale of 2013-11-15. The problem does not happen. Could you verify the experiment?
Thank you.

--Junchao Zhang

Last edited 4 years ago by jczhang (previous) (diff)

comment:19 Changed 4 years ago by balaji

The pathscale compiler seems to work fine with the default build on jenkins. Just the --disable-shared build seems to be breaking.

comment:20 Changed 4 years ago by balaji

FYI, I'm not suggesting saying that MPICH won't work with pathscale compilers at all, in the RELEASE_NOTES. Just with that specific configure option.

comment:21 Changed 4 years ago by jczhang

For an old version pathscale, i.e., 5.0.1, MPICH won't work with a specific configure option (i.e., --disable-shared). The symptom is compiled programs give segfalult.

For the latest pathscale, i.e., 5.0.5, MPICH's most configure options fail, with one of them being the default option. I knew an exception is --enable-fast=O0. The symptom is compiled programs hang during MPI_Init.

Jenkins results look good since Jenkins uses the old pathscale.

Last edited 4 years ago by jczhang (previous) (diff)

comment:22 Changed 4 years ago by balaji

Please add the following comment:

Due to bugs in the Pathscale compiler suite, some configurations of MPICH do not build correctly.
 * v5.0.1: When the --disable-shared configure option is passed, MPICH will fail to configure.
 * v5.0.5: Unless you pass the --enable-fast=O0 configure flag to MPICH, applications will hang.

Then disable the jenkins test for --disable-shared.

comment:23 Changed 2 years ago by raffenet

  • Owner changed from jczhang to raffenet
  • Status changed from reopened to new
Note: See TracTickets for help on using tickets.