Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#1590 closed bug (fixed)

Problem with shared libraries and Fortran common

Reported by: gropp Owned by: goodell
Priority: major Milestone: mpich2-1.5
Component: mpich Keywords:
Cc:

Description

I'm having a problem with some of the Spawn tests in Fortran 90 when running on my Mac, and after some investigation, I discovered that MPI_ARGV_NULL in Fortran 77 and in Fortran 90 were not the same location in memory. Unfortunately, the spawn code relies on the initialization of the value of this location, and only one is set (there are probably other problems with tests for things like MPI_IN_PLACE).

My guess is that since the common block that contains this variable is present in two shared libraries, the linker on the Mac isn't resolving them to the same address but instead making them unique only within their library. Picking static linking isn't an option on Macs as critical system libraries only exists in shared form, and -static only works if everything is available in static form.

I'm using gfortran as the F77 and F90 compiler, so there are no compatibility problems between the Fortran compilers.

One fix might be to force these into a single Fortran library, rather than separate ones for Fortran and Fortran90, particularly when the compilers are the same. There might be other fixes that make common blocks behave correctly with shared libraries (the behavior makes sense when building components for which the common is supposed to be private to the component; unfortunately, that's not what is expected in the Fortran standard).

Bill

Change History (9)

comment:1 Changed 3 years ago by goodell

  • Owner set to goodell
  • Status changed from new to accepted

Bill, does your link line contain -Wl,-commons,use_dylibs? Lines 5974-6003 of configure.ac are intended to add that flag darwin. AFAIK that flag should cause the right thing to happen here.

It's been a little while since I did an --enable-shared build, but I can try it again today.

comment:2 Changed 3 years ago by gropp

Yes, I did a -show, and it includes -Wl,-commons,use_dylibs

comment:3 Changed 3 years ago by goodell

Nevermind, this option doesn't quite do what I thought it did. It does cause common symbols in object files (applications) to be matched up correctly with defined (non-common) symbols in a dylib. The darwin linker doesn't seem to allow actual common symbols (" C " in nm output) in dylibs. So that linker option won't cause two dylibs to share a symbol as common in the same way we would have with static archives.

Combining libmpichf90, libpmpich, and libmpich would make sense to me, but as you implicitly point out there may be a problem when the f77 and f90 compilers differ. This might be a good time to discuss whether keeping distinct f77 and f90 compilers makes sense. I think the OMPI folks are moving to a single fortran compiler.

We could try switching back to -Wl,-flat_namespace, although that might still suffer from various problems. I'll play around with it some more.

comment:4 Changed 3 years ago by goodell

So it looks like -Wl,-flat_namespace will solve the problem:

% for i in *.c ; do echo "---8<--- $i ---8<---" ; cat $i ; done
---8<--- bar.c ---8<---
int common_sym;

void *barfn(void)
{
    return &common_sym;
}
---8<--- baz.c ---8<---
int other_common_sym;
int common_sym;

void *bazfn(void)
{
    return &common_sym;
}
---8<--- foo.c ---8<---
#include <stdio.h>
int common_sym;
extern void *barfn(void);
extern void *bazfn(void);

int main(void)
{
    printf("addr of foo's common_sym=%p\n", &common_sym);
    printf("addr of bar's common_sym=%p\n", barfn());
    printf("addr of baz's common_sym=%p\n", bazfn());
    return 0;
}

% gcc -dynamiclib -o bar.dylib bar.c ; gcc -dynamiclib -o baz.dylib baz.c ; gcc foo.o -L. bar.dylib baz.dylib
% ./a.out
addr of foo's common_sym=0x102668078
addr of bar's common_sym=0x10266e000
addr of baz's common_sym=0x102675000

% gcc -Wl,-flat_namespace -dynamiclib -o bar.dylib bar.c ; gcc -Wl,-flat_namespace -dynamiclib -o baz.dylib baz.c ; gcc -Wl,-flat_namespace foo.o -L. bar.dylib baz.dylib
% ./a.out
addr of foo's common_sym=0x106fe3080
addr of bar's common_sym=0x106fe3080
addr of baz's common_sym=0x106fe3080

It's now a matter of passing that to libtool correctly and hoping that we don't run into other problems with flat namespaces (I think I hit one in the past, but don't remember what it was specifically).

comment:5 Changed 3 years ago by goodell

I think we run into some risks if an application mixes two-level libraries/frameworks and a set of -Wl,flat_namespace libraries. I think there are potentially complicated issues at play for programs that include C++ because of exception handling:

http://lists.apple.com/archives/darwin-dev/2011/Feb/msg00002.html

Let me play with the "-U" ld(1) option and see if we can get two-level namespaces to work by having one defined version of all of the mpifcmb symbols that exists only in libmpich.dylib. If we can suppress the definition of the mpifcmb symbols as BSS symbols in libmpichf90 then we should be relatively safe.

comment:6 Changed 3 years ago by goodell

The "-U" option didn't do what I was hoping it would do. I was hoping it would take " C " (common) symbols from .o files and turn them into " U " (undefined) symbols when creating a dylib. Instead it seems to have no effect and they are instead defined as " B " (bss) symbols in the dylib.

comment:7 Changed 3 years ago by balaji

  • Milestone set to mpich2-1.5

comment:8 Changed 3 years ago by goodell

  • Resolution set to fixed
  • Status changed from accepted to closed

comment:9 Changed 3 years ago by goodell

Note we'll need to twiddle the ABI version string for the upcoming release again because of this change.

Note: See TracTickets for help on using tickets.