Opened 7 months ago

Closed 7 months ago

#2089 closed bug (fixed)

ROMIO and darray types

Reported by: robl Owned by: Rob Latham <robl@…>
Priority: major Milestone: mpich-3.1.1
Component: mpich Keywords:
Cc:

Description

From the OpenMPI list:

http://www.open-mpi.org/community/lists/users/2014/05/24286.php

In the testcase I first initialise an array of 25 doubles (which will be a 5x5 grid), then I create a datatype representing a 5x5 matrix distributed in 3x3 blocks over a 2x2 process grid. As a reference I use MPI_Pack to pull out the block cyclic array elements local to each process (which I think is correct). Then I write the original array of 25 doubles to disk, and use MPI-IO to read it back in (performing the Open, Set_view, and Real_all), and compare to the reference.

Running this with OMPI, the two match on all ranks.

mpirun -mca io ompio -np 4 ./darr_read.x

Rank 0 === (9 elements)

Packed: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0
Read: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0

Rank 1 === (6 elements)

Packed: 15.0 16.0 17.0 20.0 21.0 22.0
Read: 15.0 16.0 17.0 20.0 21.0 22.0

Rank 2 === (6 elements)

Packed: 3.0 4.0 8.0 9.0 13.0 14.0
Read: 3.0 4.0 8.0 9.0 13.0 14.0

Rank 3 === (4 elements)

Packed: 18.0 19.0 23.0 24.0
Read: 18.0 19.0 23.0 24.0

However, using ROMIO the two differ on two of the ranks:

mpirun -mca io romio -np 4 ./darr_read.x

Rank 0 === (9 elements)

Packed: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0
Read: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0

Rank 1 === (6 elements)

Packed: 15.0 16.0 17.0 20.0 21.0 22.0
Read: 0.0 1.0 2.0 0.0 1.0 2.0

Rank 2 === (6 elements)

Packed: 3.0 4.0 8.0 9.0 13.0 14.0
Read: 3.0 4.0 8.0 9.0 13.0 14.0

Rank 3 === (4 elements)

Packed: 18.0 19.0 23.0 24.0
Read: 0.0 1.0 0.0 1.0

My interpretation is that the behaviour with OMPIO is correct. Interestingly everything matches up using both ROMIO and OMPIO if I set the block shape to 2x2.

This was run on OS X using 1.8.2a1r31632. I have also run this on Linux with OpenMPI 1.7.4, and OMPIO is still correct, but using ROMIO I just get segfaults.

Attachments (2)

darr_read.c (2.2 KB) - added by robl 7 months ago.
tests case for darray issue
darr_read.2.c (2.6 KB) - added by robl 7 months ago.
updated test case with MPI error checking and cleaning up of resources

Download all attachments as: .zip

Change History (14)

Changed 7 months ago by robl

tests case for darray issue

comment:1 Changed 7 months ago by robl

mpich from master gives yet another answer, though it is once again ranks 1 and 3 which show errors

=== Rank 0 === (9 elements) 
Packed:  0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0 
Read:    0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0 

=== Rank 1 === (6 elements) 
Packed: 15.0 16.0 17.0 20.0 21.0 22.0 
Read:   15.0 16.0 17.0  0.0  0.0  0.0 

=== Rank 2 === (6 elements) 
Packed:  3.0  4.0  8.0  9.0 13.0 14.0 
Read:    3.0  4.0  8.0  9.0 13.0 14.0 

=== Rank 3 === (4 elements) 
Packed: 18.0 19.0 23.0 24.0 
Read:   18.0 19.0  0.0  0.0 

Changed 7 months ago by robl

updated test case with MPI error checking and cleaning up of resources

comment:2 Changed 7 months ago by thakur

I get the right answer if I run with current MPICH source on my Mac laptop.

comment:3 Changed 7 months ago by robl

fascinating. define 'current'?

comment:4 follow-up: Changed 7 months ago by thakur

The files in the install directory have a date of March 11.

comment:5 Changed 7 months ago by robl

flattened representation of this type:

rank flattened offset-lenght tuples status
0 (0, 24) (40, 24) (80, 24) (200, 0) correct
1 (0, 0) (120, 24) (200, 0) INCORRECT
2 (0,0) (24, 16) (64, 16) (104, 16) (200, 0)correct
3 (0,0) (144,16) (200, 0) INCORRECT

The "consistent datatype assertions" i wanted in #2073 would have caught this discrepancy -- ranks 1 and 3 are missing one offset-length tuple. The upper and lower bounds are correct but the size of the type is not.


comment:6 Changed 7 months ago by thakur

But it must be correct in the March 11 code that is giving the right answers. Did we introduce any bug in the flattening code between then and now, which may also be the cause of other errors we are seeing, such as resized?

comment:7 in reply to: ↑ 4 Changed 7 months ago by robl

Replying to thakur:

The files in the install directory have a date of March 11.

Huh. wonder what could possibly be different between your environment and mine

src/mpi/romio/common/ad_darray.c has seen no significant changes since the conversion to SVN

src/mpid/common/datatype/dataloop/darray_support.c has likewise seen no significant changes since the conversion to SVN

comment:8 Changed 7 months ago by robl

Michael Raymond from SGI replied to me off-list:

FWIW I got even worse breakage with his test using the ROMIO inside MPT. I was getting memory corruption because the value returned by ADIOI_Count_contiguous_blocks() was too small and thus the blocklens and indices arrays were too small. If I artificially bumped flat->count to a much higher value, the code no longer crashed but did get the wrong results. I tracked this down to MPT not marking HVECTORs / STRUCTs with 0-sized counts as contiguous. Once I changed this, the memory corruption and the data mismatches both went away.

comment:9 Changed 7 months ago by thakur

With MPICH from March 11, I get this result: (no errors)

mpiexec -n 4 a.out
=== Rank 0 === (9 elements) 
Packed:  0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0 
Read:    0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0 

=== Rank 1 === (6 elements) 
Packed: 15.0 16.0 17.0 20.0 21.0 22.0 
Read:   15.0 16.0 17.0 20.0 21.0 22.0 

=== Rank 2 === (6 elements) 
Packed:  3.0  4.0  8.0  9.0 13.0 14.0 
Read:    3.0  4.0  8.0  9.0 13.0 14.0 

=== Rank 3 === (4 elements) 
Packed: 18.0 19.0 23.0 24.0 
Read:   18.0 19.0 23.0 24.0 

With today's MPICH, I get this result: (errors in last few elements on ranks 1 and 3)

% mpiexec -n 4 a.out
=== Rank 0 === (9 elements) 
Packed:  0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0 
Read:    0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0 

=== Rank 1 === (6 elements) 
Packed: 15.0 16.0 17.0 20.0 21.0 22.0 
Read:   15.0 16.0 17.0  0.0  0.0  0.0 

=== Rank 2 === (6 elements) 
Packed:  3.0  4.0  8.0  9.0 13.0 14.0 
Read:    3.0  4.0  8.0  9.0 13.0 14.0 

=== Rank 3 === (4 elements) 
Packed: 18.0 19.0 23.0 24.0 
Read:   18.0 19.0  0.0  0.0 

comment:10 Changed 7 months ago by robl

It was my recent commit to fix #2073! [50f3d580] -- which I intended to only muck with indexed and hindexed types, interacts with darray.

comment:11 Changed 7 months ago by robl

You remember how I said i was really nervous about mucking with struct types and so claimed I did not do that in my patch? Turns out I missed a spot! Patch pending.

comment:12 Changed 7 months ago by Rob Latham <robl@…>

  • Owner set to Rob Latham <robl@…>
  • Resolution set to fixed
  • Status changed from new to closed

In 97114ec5b135538a43fabb45b0d6be9a830b623e:

Got a bit carried away zapping zero-length blocks

A partial revert of the portion of commit
50f3d5806e5cf3934ef991eef2d7d238846380d6 : I did not mean to modify
anything in the struct case. I did, though, and that modification
caused a bug in darray datatypes. The zero-length blocklens in the
struct case indicate upper bound and lower bounds and must be respected.

Closes: #2089

No Reviewer

Note: See TracTickets for help on using tickets.