Opened 5 years ago

Closed 3 years ago

Last modified 3 years ago

#1877 closed bug (fixed)

MPI_BOTTOM translation between C/Fortran is "wrong"

Reported by: wbland Owned by: jczhang
Priority: major Milestone: future
Component: mpich Keywords:
Cc:

Description

The implementation of translating addresses from Fortran to C works in most cases, but it has been pointed out that there are potential cases where it is incorrect. Rather than subtracting MPI_BOTTOM inside MPI_GET_ADDRESS and manipulating that address, there needs to be a check in each function where a buffer is passed in to make sure the address is valid and then subtract out MPI_BOTTOM if necessary.

This will obviously have performance implications so it's possible that we don't want to fix this as it has been working up to now.

Attachments (1)

aintf.f (2.6 KB) - added by jczhang 4 years ago.
The test fails with MPICH

Download all attachments as: .zip

Change History (33)

comment:1 Changed 5 years ago by balaji

  • Milestone set to mpich-3.1

comment:2 Changed 5 years ago by balaji

  • Owner set to wbland

comment:3 Changed 5 years ago by wbland

I got some clarification on this from Jim:

The issue is that Fortran MPI_Aint values are relative to a different, non-zero MPI_BOTTOM value. The way we handle this in MPICH Fortran bindings right now (which are C and assume that BOTTOM is 0) is probably broken.

comment:4 Changed 5 years ago by thakur

MPI_BOTTOM is something we define. Can we not define it to be 0 even in Fortran?

comment:5 Changed 5 years ago by wbland

I really don't know a lot about this, but my understanding is that it can't always be set to 0, though we actually do set it to 0 explicitly most of the time. However looking through the code, the value is initialized to:

common /MPIFCMB3/

in setbotf.f.in line 18

I couldn't find where this value pointed to, but conceivably it might not be 0. Maybe someone who knows more about this can shed some more light on this and help decide if this is actually a bug that needs fixing.

comment:6 follow-up: Changed 5 years ago by thakur

It is set to 0 at least in one place (src/binding/f77/setbot.c)
void *MPIR_F_MPI_BOTTOM = 0;
but that is under an #ifndef USE_POINTER_FOR_BOTTOM

If you do a "print *, MPI_BOTTOM" from a Fortran program, it prints 0.

Bill may know more.

comment:7 Changed 5 years ago by gropp

Fortran doesn't have address-valued variables. So "MPI_BOTTOM" is a name for a Fortran variable; it could be a pointer to a Fortran object, but a pointer is not necessarily an address. Thus, in general, we define "MPI_BOTTOM" as a memory location in Fortran, and then check for that specific location when its address is passed in. We can't (in a portable way) place MPI_BOTTOM at address zero. The non-portable path is the USE_POINTER_FOR_BOTTOM, which is why you see this.

comment:8 follow-up: Changed 5 years ago by thakur

So does anything need to be fixed as this ticket says?

comment:9 in reply to: ↑ 8 Changed 5 years ago by wbland

Replying to thakur:

So does anything need to be fixed as this ticket says?

Unless I hear from someone that says that this is a real problem that needs to be fixed, I don't think this is a blocker for 3.1. We've been doing things this way for a long time and apparently there haven't been any complaints.

comment:10 Changed 5 years ago by wbland

  • Milestone changed from mpich-3.1 to future

comment:11 in reply to: ↑ 6 Changed 4 years ago by jczhang

Replying to thakur:

It is set to 0 at least in one place (src/binding/f77/setbot.c)
void *MPIR_F_MPI_BOTTOM = 0;

MPIR_F_MPI_BOTTOM is a C pointer variable, which points to Fortran's MPI_BOTTOM. This line of code simply initializes the pointer to NULL. Later, when mpirinitc_ is called, it will be updated.

but that is under an #ifndef USE_POINTER_FOR_BOTTOM

If you do a "print *, MPI_BOTTOM" from a Fortran program, it prints 0.

MPI_BOTTOM in Fortran is an integer in common block, while MPI_BOTTOM in C is a pointer.

Bill may know more.

comment:12 Changed 4 years ago by wbland

  • Owner changed from wbland to jczhang

comment:13 Changed 4 years ago by thakur

Bill,
Will Example 17.12 on pg 651 of MPI-3 work currently in MPICH? It uses Get_address in Fortran to create a struct type, which is passed to C to create another struct type that is passed to MPI_Send using MPI_BOTTOM. Also, ln 10 of the same page says "The function MPI_GET_ADDRESS returns the same value in all languages." Is that true in MPICH if MPIR_F_MPI_BOTTOM is subtracted from the address returned in Fortran?

comment:14 Changed 4 years ago by gropp

Yes, I believe that the current code is incorrect; it probably dates from the MPI 1 era when language interoperability was weaker. However, note that the MPICH code has a USE_POINTER_FOR_BOTTOM check, though it looks like the related configure tests were lost (there used to be a separate configure for the Fortran bindings). I did a quick test, and with gfortran on my Mac, and with the Cray and Portland Group compilers on Blue Waters, the following can be used to create an MPI_BOTTOM whose value, as a pointer, is 0 (NULL):

real, pointer :: MPI_BOTTOM
nullify( MPI_BOTTOM )

So it may be possible to create a version that will provide an MPI_BOTTOM that looks like the C version. I'm not sure that this is portable, but may be workable with the appropriate configure test that would also set USE_POINTER_FOR_BOTTOM.

Last edited 4 years ago by gropp (previous) (diff)

comment:15 Changed 4 years ago by thakur

And if we don't do that, every data transfer function (pt2pt, collective, RMA, I/O) would have to do a check for MPI_BOTTOM, right?

comment:16 follow-up: Changed 4 years ago by gropp

Yes, in Fortran, they'd have to do that test against MPI_BOTTOM.

comment:17 Changed 4 years ago by jczhang

If MPI_BOTTOM is declared and initialized as Bill said, what will be passed to MPI_Send(MPI_BOTTOM, ....)? In C, do I get value of MPI_BOTTOM, i.e., NULL, or address of MPI_BOTTOM?

comment:18 follow-up: Changed 4 years ago by jczhang

My experiment shows Bill's approach works.

integer, target :: i
integer, pointer :: MPI_BOTTOM

MPI_BOTTOM => i
!nullify(MPI_BOTTOM)
call c_func(MPI_BOTTOM)

c_func is implicit, as in F77.

void c_func_(void *buf) {
    printf("buf = %p\n", buf);
}

I got 0x0 if MPI_BOTTOM is nullified, otherwise the address of i.

Last edited 4 years ago by jczhang (previous) (diff)

comment:19 in reply to: ↑ 16 Changed 4 years ago by thakur

Replying to gropp:

Yes, in Fortran, they'd have to do that test against MPI_BOTTOM.

Another option might be to use the non-null MPI_BOTTOM in Fortran as the value of MPI_BOTTOM in C as well. Then MPI_Get_address can subtract the same MPI_BOTTOM in both languages, and no change is needed in data transfer functions.

But, that would be an ABI change :-)

comment:20 in reply to: ↑ 18 Changed 4 years ago by thakur

Replying to jczhang:

My experiment shows Bill's approach works.

Can we use this solution then?

comment:21 Changed 4 years ago by jczhang

MPI-3 recommends using a non-null value for MPI_BOTTOM in C, so that it can be differentiated with NULL pointer. If we do that in MPICH, surely it will break ABI.

comment:22 Changed 4 years ago by thakur

That text has been there since MPI-2 (see pg 55 of MPI 2.0). We may have just ignored it since it is only a recommendation. (As a result, we can't have a check for null pointer, though.)

comment:23 Changed 4 years ago by jczhang

MPICH's F90 binding relies on its F77 binding. Pointer is introduced in Fortran 90/2003. I'm not sure if I can use this approach in F77 files. But I will test that. Basically, I will do the following changes

mpif.h:
   INTEGER, POINTER :: MPI_BOTTOM

setbotf.f:
   integer, pointer :: bt => NULL()

Even if it works, will we assume we always have F90+ compilers?

I hope I can use this trick in F08 binding. But considering MPI_BOTTOM is passed to an assumed-type, assumed-rank choice buffer, I am not sure if the Cray compiler can prepare a correct C descriptor for NULL. Let me check that with Cray guys.

I look at TS29113, it has this text.

"void * base_addr; If the object is an unallocated allocatable variable or a pointer that is disassociated, the value is a null pointer."

So I believe it is legal to use this trick in F08.

Last edited 4 years ago by jczhang (previous) (diff)

comment:24 Changed 4 years ago by thakur

It should work with most compilers. But can we do it based on a configure test and switch to the current, buggy method on compilers that don't support it?

comment:25 Changed 4 years ago by gropp

Yes - that's just what I said. The current code has #ifdefs for USE_POINTER_FOR_BOTTOM. In the cases where the Fortran compiler accepts this and where null/nullify gives 0 for the pointer, this is probably the way to go. A simple test in configure can check on this. For now, leaving the current buggy code in place when the Fortran compiler doesn't do what we need won't break any current user's program, though it isn't completely correct.

Changed 4 years ago by jczhang

The test fails with MPICH

comment:26 Changed 3 years ago by jczhang

I implemented the MPI_BOTTOM patch. It works as expected. One issue I want to point out is that it breaks ABI of MPICH F77/90 bindings.

Previously, in mpif.h, MPI_BOTTOM, MPI_IN_PLACE etc are integers in a common block named MPIPRIV1.

INTEGER MPI_BOTTOM MPI_IN_PLACE, MPI_UNWEIGHTED
COMMON /MPIPRIV1/ MPI_BOTTOM, MPI_IN_PLACE, MPI_STATUS_IGNORE

With the patch, it is changed to

INTEGER, POINTER :: MPI_BOTTOM
INTEGER MPI_IN_PLACE, MPI_UNWEIGHTED
COMMON /MPIPRIV1/ MPI_BOTTOM, MPI_IN_PLACE, MPI_STATUS_IGNORE

Pointers has different calling convention and storage size than integers. For instance, on a platform, integer MPI_BOTTOM has 4 bytes, while pointer MPI_BOTTOM has 8 bytes. Suppose with old mpif.h, we compiled a Fortran program which referenced MPI_IN_PLACE. The compiler used address mpipriv1_+4 for this symbol. However, with the new library, MPI_IN_PLACE's address is mpipriv1_ + 8. This mismatch will cause the runtime unable to recognize MPI_IN_PLACE from the user program.


I compiled test/mpi/f77/coll/inplacef.f with MPICH-3.2b2, then ran it with new libraries produced by this patch. It was broken. I debugged it and found the reason was exactly as explained.

I suggest we accumulate ABI broken patches to a point we think it is OK to break it.

comment:27 Changed 3 years ago by raffenet

What happens in the case where the compiler does not support null pointers? Will the test case still fail?

comment:28 Changed 3 years ago by jczhang

Yes. It will fail. Null pointers were introduced in Fortran 90 in 1991. After 20+ years, almost all Fortran compilers support that (I am not aware of a compiler that did not support it). If a compiler does not, we just let it fall back to the buggy MPICH code.

comment:29 Changed 3 years ago by raffenet

I see you refer to tedious address checking and adjusting. How much work is that? I agree that your fix is correct, but I do not see us breaking ABI for this release. Just want to get an idea if a workaround is possible in the near term. Thanks.

comment:30 Changed 3 years ago by Junchao Zhang <jczhang@…>

  • Resolution set to fixed
  • Status changed from new to closed

In 0970b5849e2900595330ccac81956748d41753e4:

Adjust MPI_BOTTOM to zero in F77/90 bindings

Change the bindings to make it as if MPI_BOTTOM in Fortran is
at address zero as it in C. See discussion on p.652 of MPI-3.0

Fixes #1877

Signed-off-by: Ken Raffenetti <raffenet@…>

comment:31 Changed 3 years ago by Junchao Zhang <jczhang@…>

In fd32dce74ddaa62ae537dc7a88e43a02a9db11a0:

Add a test for MPI_BOTTOM and absolute datatypes in C/Fortran mixed code

Refs #1877

Signed-off-by: Ken Raffenetti <raffenet@…>

comment:32 Changed 3 years ago by Ken Raffenetti <raffenet@…>

In 65945a437838d8716da01ea12a9c58e664ee8bf8:

Code cleanup to remove USE_POINTER_FOR_BOTTOM

The configure test was dropped long time ago in MPICH, and to keep ABI compatible,
we don't want to restore it.

Refs #1877

Signed-off-by: Ken Raffenetti <raffenet@…>

Note: See TracTickets for help on using tickets.