Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add replicatedDimCount to ESMF_ArrayGet, and improve some unit tests of replicated dimensions #187

Merged
merged 12 commits into from
Oct 12, 2023

Conversation

billsacks
Copy link
Member

@billsacks billsacks commented Oct 10, 2023

This started with my work towards addressing #184 . That turned out to be a bigger problem than I anticipated, but some of the work towards addressing that issue seemed worth preserving.

This PR:

(1) Introduces a method to get the replicated dimension count of an array and uses this method in place of some previously-duplicated code (in both ESMCI_Array.C and ESMCI_IO.C).

(2) Adds a replicatedDimCount optional output argument to ESMF_ArrayGet, with associated unit tests.

(3) Improves unit tests of Arrays with replicated dimensions in ESMF_ArrayIOUTest.F90 - most notably, outputting the results of different tests to different files to aid later examination, and making the values in the Array vary within each PET to make the tests more rigorous.

This allows us to remove some duplicated code to count the number of
replicated dimensions, both within and outside of ESMCI_Array.C. The
motivation for doing this now is that I anticipate adding another place
where we need to get this count in ESMCI_IO.C.

In order to test this new function (which is not exposed via the Fortran
API), I have added a new C++ unit test class, ESMCI_ArrayUTest.C.
Previously, two files were getting overwritten by subsequent tests. This
made it impossible to examine the results. I am introducing new files to
avoid this overwriting.
Add more variability to the arrays.

This started as an attempt to fix what appear to be messed up values in
some of the I/O of Arrays with replicated dimensions, especially this
one:

netcdf Array_repli1_first {
dimensions:
	myData_dim001 = 10 ;
	myData_dim002 = 3 ;
variables:
	double myData(myData_dim002, myData_dim001) ;
data:

 myData =
  0, _, 12345, _, 24690, _, 37035, _, 49380, 61725,
  0, _, 12345, _, 24690, _, 37035, _, 49380, 61725,
  0, _, 12345, _, 24690, _, 37035, _, 49380, 61725 ;
}

But this pattern of missing data persists no matter how I fill the
arrays. So I'm giving up on trying to fix this pattern of missing data,
but am leaving in place this slight improvement in the setting of
values (so that there's more variability from one point to another).
These tests are now completely redundant with the recently-added tests
at the Fortran API level.
@billsacks billsacks requested a review from theurich October 10, 2023 15:50
@billsacks billsacks self-assigned this Oct 10, 2023
Copy link
Member Author

@billsacks billsacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theurich - see a couple of inline questions I have about this PR.

Also, note that I ended up removing the C++ unit test class (db668ee): its tests were completely redundant with the new Fortran-level tests, so it seemed like more of a burden than a help to have this.

@@ -385,6 +385,8 @@ namespace ESMCI {
const int *getTotalLBound() const {return totalLBound;}
const int *getTotalUBound() const {return totalUBound;}
int getTensorCount() const {return tensorCount;}
int getReplicatedDimCount() const
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I don't have a return code (rc) for this function, since nothing can currently lead to it failing. I was thinking of this as just a small step beyond a simple getter. Does that seem reasonable, or should this have a place-holder rc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this fine. New code on the C++ side should rely on exception throwing anyway in case of error, instead of explicit RC return.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I would go a step further, and implement the actual determination of the replicated dim count right here under int getReplicatedDimCount().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like:

int getReplicatedDimCount()             const {
  int replicatorCount = 0;
  for (auto i=0; i<distgrid->getDimCount(); i++) 
    if (distgridToArrayMap[i] == 0) ++replicatorCount;
  return replicatorCount;
}

Then get rid of the getReplicatedDimCountImpl() altogether.

@@ -129,7 +129,7 @@ contains
! !INTERFACE:
! Private name; call using ESMF_ArrayGet()
subroutine ESMF_ArrayGetDefault(array, keywordEnforcer, arrayspec, typekind, &
rank, localarrayList, indexflag, distgridToArrayMap, &
rank, replicatedDimCount, localarrayList, indexflag, distgridToArrayMap, &
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this seem like a good place to insert this new optional output argument? Another possible position is after dimCount, though the stuff there seems related solely to the distGrid, so that didn't feel quite as appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about putting it /after/ the dimCount variable further back in the list. Seems a bit more appropriate to me. But all pretty subjective.

Copy link
Member

@theurich theurich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bill, please see my comments in the code. My sense is that the implementation can be simplified. Everything else looks good to me.

@@ -385,6 +385,8 @@ namespace ESMCI {
const int *getTotalLBound() const {return totalLBound;}
const int *getTotalUBound() const {return totalUBound;}
int getTensorCount() const {return tensorCount;}
int getReplicatedDimCount() const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this fine. New code on the C++ side should rely on exception throwing anyway in case of error, instead of explicit RC return.

@@ -385,6 +385,8 @@ namespace ESMCI {
const int *getTotalLBound() const {return totalLBound;}
const int *getTotalUBound() const {return totalUBound;}
int getTensorCount() const {return tensorCount;}
int getReplicatedDimCount() const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I would go a step further, and implement the actual determination of the replicated dim count right here under int getReplicatedDimCount().

@@ -385,6 +385,8 @@ namespace ESMCI {
const int *getTotalLBound() const {return totalLBound;}
const int *getTotalUBound() const {return totalUBound;}
int getTensorCount() const {return tensorCount;}
int getReplicatedDimCount() const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like:

int getReplicatedDimCount()             const {
  int replicatorCount = 0;
  for (auto i=0; i<distgrid->getDimCount(); i++) 
    if (distgridToArrayMap[i] == 0) ++replicatorCount;
  return replicatorCount;
}

Then get rid of the getReplicatedDimCountImpl() altogether.

@@ -503,6 +505,8 @@ namespace ESMCI {
static void superVecParam(Array *array, int localDeCount,
bool superVectorOkay, int superVecSizeUnd[3], int *superVecSizeDis[2],
int &vectorLength);
private:
static int getReplicatedDimCountImpl(int dimCount, int *distgridToArrayMap);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is need of a static implementation of this feature. It seems like a perfect instance method to me. See my other comments.

@@ -129,7 +129,7 @@ contains
! !INTERFACE:
! Private name; call using ESMF_ArrayGet()
subroutine ESMF_ArrayGetDefault(array, keywordEnforcer, arrayspec, typekind, &
rank, localarrayList, indexflag, distgridToArrayMap, &
rank, replicatedDimCount, localarrayList, indexflag, distgridToArrayMap, &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about putting it /after/ the dimCount variable further back in the list. Seems a bit more appropriate to me. But all pretty subjective.

//-----------------------------------------------------------------------------
#undef ESMC_METHOD
#define ESMC_METHOD "ESMCI::Array::getReplicatedDimCountImpl()"
int Array::getReplicatedDimCountImpl(int dimCount, int *distgridToArrayMap) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope this can be removed.

int replicatorCount = 0; // initialize
for (int i=0; i<dimCount; i++)
if (distgridToArrayMapArray[i] == 0) ++replicatorCount;
int replicatorCount = getReplicatedDimCountImpl(dimCount, distgridToArrayMapArray);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to getReplicatedDimCount().

int replicatorCount = 0; // initialize
for (int i=0; i<dimCount; i++)
if (distgridToArrayMapArray[i] == 0) ++replicatorCount;
int replicatorCount = getReplicatedDimCountImpl(dimCount, distgridToArrayMapArray);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to getReplicatedDimCount().

@billsacks
Copy link
Member Author

@theurich - thanks a lot for your comments here. I only have a minute right now, so have just skimmed them for now, but I wanted to explain the need (I think) for getReplicatedDimCountImpl: this is called from Array::create as well as from the public getReplicatedDimCount. The call from Array::create is done before the Array object is constructed. This led me to think that I should create a static method to do the work, at least if I wanted to remove the duplicated code from all places. Does that make sense, or do you still think I should remove this Impl method? It's quite possible I'm missing something here.

@theurich
Copy link
Member

Yes, being functional inside Array::create() is an issue. My preference in this case would be to implement just a local utility routine to cover that phase of Array construction. Public facing, both on the ESMCI level as well as the ESMF access is through the instance method. But let's discuss tomorrow on the core call.

@theurich
Copy link
Member

And on second thought, it is really a DistGrid method. Still fine to route it through the ESMCI::Array and ESMF_Array API for convenience (as you have it), but I think the implementation should be an instance method of ESMCI::DistGrid.

@billsacks
Copy link
Member Author

@theurich - I made your suggested change to the placement of the replicatedDimCount argument.

Yes, let's talk tomorrow about your other comments. I thought that it didn't make sense to put this new function in the DistGrid class because it's mainly working with distgridToArrayMap, which I think is specific to the Array, not the DistGrid. I'm happy to make this a helper function rather than a static method (I see this is done elsewhere in ESMCI_Array.C, e.g., sparseMatMulStoreNbVectors), but I don't understand your whole vision, e.g., making this helper method public-facing. Anyway, let's talk more tomorrow so I can better understand your thinking here.

@theurich
Copy link
Member

Yes, the DistGrid method was a dumb idea, with half my brain out the door already yesterday afternoon. Just ignore it please.
My thinking this morning - hopefully with more brain engaged this time - is to add a member variable replicatedDimCount in the Array class. It would be set during Array::create(). Probably still worth having an internal utility routine that implements it, since used in more than one place. But not worth a static class method. Finally the public accessor getReplicatedDimCount() is simply returning the replicatedDimCount member.

@billsacks
Copy link
Member Author

I have checked the new API documentation and confirmed that it looks good.

@billsacks
Copy link
Member Author

@theurich - I think I have made the changes you suggested (making a member variable for number of replicated dimensions); can you look this over and let me know if it looks good to you now?

However, I didn't see where / how to change the calculation of buffer size in the serialize / deserialize. If that's needed, can you point me to what I need to do for it? (Unit tests currently pass, so maybe nothing more is needed?)

Copy link
Member

@theurich theurich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good, except for one small change needed (answering your question about the serialize buffer size inquire part).
Code at

ip += 2 + 2*tensorCount + 2*distgrid->getDimCount () +

ip += 2 + 2*tensorCount + 2*distgrid->getDimCount () +
      rank + delayout->getDeCount ();

needs to be changed to

ip += 3 + 2*tensorCount + 2*distgrid->getDimCount () +
      rank + delayout->getDeCount ();

i.e. bump up the first number by one, because there is one more int member now.

@billsacks
Copy link
Member Author

Thanks @theurich ! I made that change, reran local testing (all good) and have pushed it. Once you approve, I'll merge this.

Copy link
Member

@theurich theurich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bill, this looks good to me now. Please merge at your convenience. Thanks!

@billsacks billsacks merged commit ea64eb3 into develop Oct 12, 2023
2 checks passed
@billsacks billsacks deleted the replicated_dims_cleanup branch October 12, 2023 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants