Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Templated MD arrays #39

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Draft: Templated MD arrays #39

wants to merge 16 commits into from

Conversation

mauneyc-LANL
Copy link
Collaborator

@mauneyc-LANL mauneyc-LANL commented Oct 2, 2023

PR Summary

This is a prototype for "templated MD", that is the reduction of code of the type

PORTABLE_FUNCTION PortableMDArray(T *data, int nx1) noexcept
      : pdata_(data), nx1_(nx1), nx2_(1), nx3_(1), nx4_(1), nx5_(1), nx6_(1) {}
  PORTABLE_FUNCTION
  PortableMDArray(T *data, int nx2, int nx1) noexcept
      : pdata_(data), nx1_(nx1), nx2_(nx2), nx3_(1), nx4_(1), nx5_(1), nx6_(1) {
  }
  PORTABLE_FUNCTION
  PortableMDArray(T *data, int nx3, int nx2, int nx1) noexcept
      : pdata_(data), nx1_(nx1), nx2_(nx2), nx3_(nx3), nx4_(1), nx5_(1),
        nx6_(1) {}

To

  template <typename... NXs>
  PORTABLE_FUNCTION PortableMDArray(T *p, NXs... nxs) noexcept
      : PortableMDArray(p, nx_arr(nxs...), sizeof...(NXs)) {}

The single-value dimension variables int nxN_ have been replaced with a staticly sized container std::array<size_t, MAXDIM> nxs_. Also added PortableMDArray::rank_ member to avoid some reevaluations, though I'm not sure it's necessary.

Currently WIP, so documentation is wanting and a few function names/namespaces are silly/inconsistent/gibberish. I know there can be some reluctance about this coding style, so I wanted to get a "at least compiles" version up and gauge interest in iterating on it. (Actually passes a few tests! but haven't tried spiner with this yet).

It was working fine, why change it?

  • Update code to more modern style
  • Code is easier to maintain and diagnose
  • Allows for forward planning when C++ standards are updated
  • No change in down-facing API
  • For developers, the code is easier to read/glance

What are the downsides?

  • Some extra boilerplate to handle C++14's constrains on variadic programming (e.g. no std::apply(), no std::tuple, no fold expressions). C++14 can be "recursion heavy" in this regard, which may degrade performance.
  • API not "explicit" w.r.t. index ordering
  • (possibly) Longer compile times
  • For downstream users, code or errors passing through/originating in expanded template types can be agitating

Any suggestions, comments, questions welcome! @jonahm-LANL @chadmeyer @dholladay00 @jhp-lanl @jdolence

PR Checklist

  • Any changes to code are appropriately documented.
  • Code is formatted.
  • Install test passes.
  • Docs build.
  • If preparing for a new release, update the version in cmake.

// set_value end-case
template <std::size_t Ind, typename NX>
constexpr void set_value(narr &ndat, NX value) {
ndat[Ind] = value;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is operator[] of std::array marked constexpr? I thought this was something that preventing more widespread usage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I intended to replace all these with appropriate PORTABLE_ s but missed a few

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you probably meant something else. We can use a raw array, if std::array is messing with things

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check that it runs on device, but I thought std::array worked with --expt-relaxed-constexpr but I'm not sure

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mauneyc-LANL you may be able to use std::get<Ind>(arr) = value;. But we really need to check and make sure it works on device.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using --expt-relaxed-constexpr this runs smoothly using volta-x86 partition, kokkos+cuda cuda_arch=70 +wrapper, cmake .. -DPORTS_OF_CALL_BUILD_TESTING -DPORTABILITY_STRATEGY_KOKKOS=ON

PORTABLE_FORCEINLINE_FUNCTION int GetSize() const {
return nx1_ * nx2_ * nx3_ * nx4_ * nx5_ * nx6_;
return std::accumulate(nxs_.cbegin(), nxs_.cend(), 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this on GPUs? I know Brendan had an issue and we had to write a custom accumulate. One could still use binary operators provided by the STL, but the accumulate itself wouldn't run. Perhaps we did something wrong, but I want to make sure this is still GPU callable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also be concerned about std::accumulate. Might be better to hardcode this one, or write a a recursive thing ourselves.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It "worked" but gave warnings (even with --expt-relaxed-constexpr). I rewrote it as an explicit loop.

Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice cleanup. Before merging, I'd like to:
a) Get a sense of compile time differences
b) Know for sure it all works on device

// maximum number of dimensions
constexpr std::size_t MAXDIM = 6;
// array type of dimensions/strides
using narr = std::array<std::size_t, MAXDIM>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't love that this is lower-case. I'd find this type easier to interpret if it were, e.g.,

Suggested change
using narr = std::array<std::size_t, MAXDIM>;
using Narr_t = std::array<std::size_t, MAXDIM>;

Comment on lines 65 to 78
// compute_index base case, i.e. fastest moving index
template <std::size_t Ind>
PORTABLE_INLINE_FUNCTION size_t compute_index(const narr &nd,
const size_t index) {
return index;
}

// compute_index general case, computing slower moving index strides
template <std::size_t Ind, typename... Tail>
PORTABLE_INLINE_FUNCTION size_t compute_index(const narr &nd,
const size_t index,
const Tail... tail) {
return index * nd[Ind] + compute_index<Ind + 1>(nd, tail...);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice---assuming the array on device issues work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests complete on volta-x86, though we need to put together some more tests for ports-of-call

// array type of dimensions/strides
using narr = std::array<std::size_t, MAXDIM>;

namespace detail {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter very much, by why detail over impl?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just habit


target_compile_options(${POCLIB}
INTERFACE
$<${with_cxx}:$<${ps_kokkos}:--expt-relaxed-constexpr>>)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the event of +kokkos~cuda, wouldn't this result in a compile error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it, but POC doesn't have a USE_CUDA or similar like singularity-eos does, so I need to probe which compiler is being used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how I've done it in other projects:

  get_target_property(kokkos_link_libs Kokkos::kokkoscore
                      INTERFACE_LINK_LIBRARIES)
  string(REGEX MATCH "CUDA"
         kokkos_has_cuda_libs "${kokkos_link_libs}")
  ...
  if(kokkos_has_cuda_libs)
    do cuda stuff...
  endif()

@mauneyc-LANL
Copy link
Collaborator Author

I've made some changes to push a lot of the indexing/building functions into the object. I don't know that I love the result, which lead to the elimination of some constexpr initialization (not that it couldnt be done, just saved time to get a working code).

One question to ask is how much time should PortableMDArray devote to reconstructing internal layout configuration data - in particular strides. These are the offset widths associated with a particlar dimensional spec, eg.

// ordering based on what MDArray is doing
dims = {NT, NZ, NY, NX};
strides = {NZ * NY * NX, NY * NX, NX, 1};

The options are:

  1. at each compute_index call, recompute the strides. This obviously costs some FLOPS but strides are simple to compute and GPU compute is cheap. Further, because the rank is known at this point, the stride data only needs to go up to rank, rather than compute/store values up to MAXDIM
  2. (current commit) at each construction or reshaping, recompute the strides. This is the "natural" path but does introduce more data to move around, to a first approximation doubling the data that PortableMDArray needs to move from between contexts. Contra to (1), this array needs MAXDIM storage which is likely going to be mostly dead weight.

@Yurlungur
Copy link
Collaborator

I've made some changes to push a lot of the indexing/building functions into the object. I don't know that I love the result, which lead to the elimination of some constexpr initialization (not that it couldnt be done, just saved time to get a working code).

One question to ask is how much time should PortableMDArray devote to reconstructing internal layout configuration data - in particular strides. These are the offset widths associated with a particlar dimensional spec, eg.

// ordering based on what MDArray is doing
dims = {NT, NZ, NY, NX};
strides = {NZ * NY * NX, NY * NX, NX, 1};

The options are:

1. at each `compute_index` call, recompute the strides. This obviously costs some FLOPS but strides are simple to compute and GPU compute is cheap. Further, because the rank is known at this point, the stride data only needs to go up to `rank`, rather than compute/store values up to `MAXDIM`

2. (current commit) at each construction or reshaping, recompute the strides. This is the "natural" path but does introduce more data to move around, to a first approximation doubling the data that `PortableMDArray` needs to move from between contexts. Contra to (1), this array needs `MAXDIM` storage which is likely going to be mostly dead weight.

I don't know the answer to this a priori. I'm not married to either option. Which leads to the cleaner code? Probably 2? And do we see a performance hit in a simple test case?

@mauneyc-LANL
Copy link
Collaborator Author

I don't know the answer to this a priori. I'm not married to either option. Which leads to the cleaner code? Probably 2? And do we see a performance hit in a simple test case?

Would be work considering the context that PortableMDArray is used in. Does it reshape often, does it get move on/off device frequently, ect. (2.) would be "best" for host, since it avoids recomputation, but (1.) may be better for device since it would minimize data transfer (tho also it's like 48 bytes extra so probably over-optimizing here).

I've included a simple test with the PR. As originally there wasn't very much being tested directly with poc we may need to add some more for checking these details. I glanced over spiner tests for ideas but most of what is there looks more numerics based.

@Yurlungur
Copy link
Collaborator

Yeah fair questions:

Does it reshape often?

No, almost never.

does it get move on/off device frequently

Also no.

(2.) would be "best" for host, since it avoids recomputation, but (1.) may be better for device since it would minimize data transfer (tho also it's like 48 bytes extra so probably over-optimizing here).

Like you said, 48 bytes is basically nothing, so I lean towards (2).

@dholladay00
Copy link
Collaborator

Previously we recalculated at every access it seems. It seems like the number of adds will be the same and the number of multiplies saved will only be noticeable with high rank arrays. Performance data for ranks 2-6 would helpful in this case.

@dholladay00
Copy link
Collaborator

The more I think about this the more I think we should recompute the strides. Even so, I'd like to see performance differences. Integer operations can be noticeable on GPUs so pre-computed may be more performant, but a larger object may use more registers. There are a lot of competing factors so data is the best.

@dholladay00
Copy link
Collaborator

now that singularity-eos 1.9.0 release has been cut, I think we should up to c++17 and get this merged in prep for the next release.

@Yurlungur
Copy link
Collaborator

now that singularity-eos 1.9.0 release has been cut, I think we should up to c++17 and get this merged in prep for the next release.

I am in favor of moving to C++17 at this time. However, I would prefer to decouple the shift to 17 from this MR. Let's just update the cmake builds to require 17 and then merge other MRs when they are ready.

@cmauney
Copy link
Contributor

cmauney commented Aug 14, 2024

@jonahm-LANL @dholladay00 I've pushed some benchmarks using catch2, right now just doing index computation through contiguous memory - (i, j, k) to n. I haven't tested this on device yet but I would be interested to see the results. I intend to put in the same benchmarks on random (rather than contiguous) memory.

I can take off Draft and move forward with review if you want to get this in. There are things I'd like to modify and add but there's a danger of code-creep and I don't want to hold up, the PR 'as-is' should be ready to go.

Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this---it's a significant cleanup from the original implementation, which was quite low-level. I want @dholladay00 's build system concerns addressed before merge. Also, just to confirm, this doesn't change the API forfunctionality at all right? It's just more general.

return r;
}
PORTABLE_FORCEINLINE_FUNCTION
decltype(auto) vp_prod() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the decltype needed? Why not just an auto return value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, I'm returning a generic lambda and that's my habbit. But it's not necessary here (tho I don't think it's harmful either)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the decltype adds anything and it's a bit harder to read but I won't push back too much on this.

PORTABLE_FORCEINLINE_FUNCTION auto make_nxs_array(NX... nxs) {
std::array<std::size_t, MAXDIM> a;
std::array<std::size_t, N> t{static_cast<std::size_t>(nxs)...};
for (auto i = 0; i < N; ++i) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna insist this be int or size_t. IMO this is not a style thing but a self-documenting thing. It's an index it's not a double.

@cmauney
Copy link
Contributor

cmauney commented Aug 15, 2024

If C++17 gets in before this is merged, then I do want to do another draft - it will make things cleaner and clearer.

@mauneyc-LANL mauneyc-LANL mentioned this pull request Aug 15, 2024
3 tasks
@Yurlungur
Copy link
Collaborator

@mauneyc-LANL let us know when this is ready for re-review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants