Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow dimensions in NXdata to be lists of strings #1246

Merged
merged 15 commits into from
Dec 20, 2023
Merged

Conversation

phyy-nx
Copy link
Contributor

@phyy-nx phyy-nx commented Mar 13, 2023

Two changes here:

  • Add new datatype, NX_CHAR_OR_NUMBER, the union of NX_CHAR and NX_NUMBER. As proposed by @prjemian.
  • Change AXISNAME in NXdata to allow a list of strings using this new datatype

With this change, part of the channel use case proposed by @soph-dec in #940 can be supported. Quoting the documentation for the current FileWriter2 proposal by Dectris with slight modification (the dimensions of data) as an example:

data: NXdata
  @signal = "data"
  @axes = ["image_id", "channel", ".", "."]
  @image_id_indices = 0
  @channel_indices = 1
  image_id = [1, ..., nP]
  channel = ["threshold_1", "threshold_2", "difference"]
  data = uint[nP, nC, i, j]

Where nP is the number of images and nC is the number of channels. Normally the channel field, aka AXISNAME, could only be of type NX_NUMBER, similar to the image_id field, but a set of English names is preferred to identify the channels.

Note, while the change in AXISNAME was approved by vote in nexusformat/NIAC#97, I do not know what effect this has on existing plotting software. For example the DIALS image viewer would need to know what channel to display by default. Also, this could affect the implementation of plot viewing in NeXpy.

We have a Telco tomorrow, so I hope we can discuss those issues then.

Closes #945 and nexusformat/NIAC#97. Partially addresses #940.

@phyy-nx phyy-nx added the telco label Mar 13, 2023
@phyy-nx
Copy link
Contributor Author

phyy-nx commented Mar 13, 2023

Anyone know how to fix the warning that's being generated? :)

@prjemian
Copy link
Contributor

Anyone know how to fix the warning that's being generated? :)

It's a new data type included in the PR.

@prjemian
Copy link
Contributor

As stated in #945:

This needs a branch and pull request since the NIAC has already approved.

Which slice of data to show in a plot by default. This is useful especially for datasets with more than 2 dimensions.
@phyy-nx
Copy link
Contributor Author

phyy-nx commented Mar 17, 2023

Hi all, I had a meeting today about this PR and we have some changes (thanks @graeme-winter, @yayahjb, @soph-dec, and @kalcutter).

First, for 2D images, we note that if the data has the dimensions listed below, then existing general case NeXus readers/plotters should work:

  • [i,j]: single image, single panel
  • [i,j,k]: single image, multi-panel
  • [nP,i,j]: multiple images, single panel
  • [nP,i,j,k]: multiple images, multi-panel

Note the number of dimensions above: 2, 3, 3, and 4. Distinguishing between the two degenerate cases is done by the presence of the NXdetector_module class, which has the fields data_origin and data_size that can be expanded from 2 to three dimensions to handle the multi-panel case.

OK! So here are the four new use-cases that are not compatible with existing general-case plot making tools:

  • [nC, i,j]: single image, multi-channel, single panel
  • [nC,i,j,k]: single image, multi-channel, multi-panel
  • [nP,nC,i,j]: multiple images, multi-channel, single panel
  • [nP,nC,i,j,k]: multiple images, multi-channel, multi-panel

With the number of dimensions 3,4,4,5. Now we have another layer of degeneracy and no way to know how to plot the 2D image.

So the new proposal is to add an attribute @default_slice. You can see the docstring I've proposed in this PR, but briefly, it allows the user to reduce the channel degeneracy, or any degeneracy where there are more dimensions that the plotter can handle, by specifying which subset of the data to show.

In the below example, we combine both using strings as array indices and default_slice:

data: NXdata
  @signal = "data"
  @axes = ["image_id", "channel", ".", "."]
  @image_id_indices = 0
  @channel_indices = 1
  @default_slice = [".", "difference", ".", "."]
  image_id = [1, ..., nP]
  channel = ["threshold_1", "threshold_2", "difference"]
  data = uint[nP, nC, i, j]

So, what do folks think of this alternative?

Note to @mkoennecke, some feedback we received during this meeting was that since the proposed use case of multi-channel data already breaks existing viewers, there's no harm in extending the data type of AXISNAME. Existing software will still read data that doesn't use this feature just fine, but any software that wants to use this feature will need to understand that indices can be strings and which slice to show by default anyway.

@phyy-nx
Copy link
Contributor Author

phyy-nx commented Apr 5, 2023

Any comments on this PR? Including @default_slice?

Also, I'm afraid I don't know enough about the build system to fix the current build error. Can someone lend a hand? Thanks!

@phyy-nx
Copy link
Contributor Author

phyy-nx commented Apr 5, 2023

(updated above comment to indicate it's the data type of AXISNAME that is being extended to include strings, not AXISNAME_indices)

@phyy-nx phyy-nx mentioned this pull request Apr 6, 2023
@soph-dec
Copy link
Contributor

Also, I'm afraid I don't know enough about the build system to fix the current build error. Can someone lend a hand? Thanks!

Adding nxdl:NX_CHAR_OR_NUMBER to the members of primitiveType solved the build error for me locally.

phyy-nx and others added 2 commits April 18, 2023 14:32
Co-authored-by: Sophie Hotz sophie.hotz@dectris.com
Co-authored-by: soph-dec <73243774+soph-dec@users.noreply.github.com>
@phyy-nx
Copy link
Contributor Author

phyy-nx commented Apr 18, 2023

Thanks @soph-dec! Checks now pass.

phyy-nx and others added 2 commits September 19, 2023 20:23
Co-authored-by: Sophie Hotz <sophie.hotz@dectris.com>

Co-authored-by: soph-dec <73243774+soph-dec@users.noreply.github.com>
@woutdenolf
Copy link
Contributor

woutdenolf commented Sep 28, 2023

Is this correct?

Should be an array of strings of length equal to the number of dimensions in the data, with the following possible values:

    * ".": All the data in this dimension should be included
    * Integer: Only this slice should be used.
    * String: Only this slice should be used. Use if ``AXISNAME`` is a string array.

Isn't the idea to always have integers but saved as NX_CHAR because we need to support "."? Why does it depend on the type of AXISNAME? That won't work btw because you can have multiple axes that span a single data dimension, and they could be of type NX_NUMBER and NX_CHAR.

Edit: maybe we should use -1 instead of "." so @default_slice can be a list of integers?

@phyy-nx
Copy link
Contributor Author

phyy-nx commented Oct 19, 2023

@woutdenolf I've added the type NXCHAR_OR_NUMBER to @default_slice which I think resolves your concern. This matches the type of of AXISNAME. Note I think hdf5 not allow you to mix ints and chars in an attribute array, but I don't think we need to call that out here.

@woutdenolf
Copy link
Contributor

Ok thanks for the clarification. So basically default_slice has as many values as data dimensions and each value is either

  • "."
  • "<str>" (for example "difference")
  • "<int>" (for example "2")
  • <int> (for example 2)

@woutdenolf woutdenolf self-requested a review October 19, 2023 16:37
@phyy-nx
Copy link
Contributor Author

phyy-nx commented Oct 19, 2023

Ok thanks for the clarification. So basically default_slice has as many values as data dimensions and each value is either

  • "."
  • "<str>" (for example "difference")
  • "<int>" (for example "2")
  • <int> (for example 2)

Yup, that's my understanding. Thanks!

Fix conflicts, adjusting phrase "Most dimensions scales will be sequences of numbers" to "Most AXISNAME fields will be sequences of numbers"
@phyy-nx
Copy link
Contributor Author

phyy-nx commented Nov 21, 2023

Conflicts fixed, mostly trivial, with one re-word. @woutdenolf can you re-check this? Thanks!

@phyy-nx phyy-nx added this to the NXDL 2023.10 milestone Nov 21, 2023
@phyy-nx phyy-nx removed the milestone label Nov 21, 2023
soph-dec added a commit to soph-dec/documentation that referenced this pull request Dec 5, 2023
@phyy-nx
Copy link
Contributor Author

phyy-nx commented Dec 6, 2023

This PR is ready for a NIAC vote. Please vote with an emoji on this comment. Options include thumbs up for yes, thumbs down for no, and anything else to abstain. Thank you. Voting will close in two weeks.

@paulmillar
Copy link

Sorry, at the risk of exposing my ignorance, I have a few questions.

Should be an array of length equal to the number of dimensions in the data

This statement seems somewhat ambiguous to me.

Does the "should" refer to an expectation that @default_slice is an array, or does "should" refer to an expectation that if @default_slice is an array then that array has a length equal to the number of dimensions in the data?

What does the word "should" actually mean? Are we following RFC 2119 semantics, or is this meant as a more emphatic statement?

  • Is it valid for @default_slice to be something other than an array? (e.g., a simple value, if data has only one dimension).
  • Is it valid for @default_slice to have length different from the number of dimensions in the data?

@default_slice is defined as having type NX_CHAR_OR_NUMBER. Does this mean it is valid for @default_slice to be an array of integers (rather than an array of strings)?

Given @default_slice as an array of strings, how are numerical value distinguished from an axis names? For example,

  • is "01" a slice index or an axis name?
  • is "-1" a slice index or an axis name?
  • is "010" a slice index with value 10 (decimal), a slice index with value 8 (octal), or is it an axis name?
  • is "0xf" a slice index with value 16 (hex) or an axis name?

What are the processing expectations if AXISNAME is an array of NX_CHAR that contains an element "." and the @default_slice array contains the element "."?

What are the processing expectations if AXISNAME is an array of NX_CHAR, the first element is "2" and the @default_slice array contains the element "2"?

What are the processing expectations if AXISNAME is an array of NX_CHAR with two elements having the same value and that value is specified in @default_slice ?

@phyy-nx
Copy link
Contributor Author

phyy-nx commented Dec 12, 2023

Thank you @paulmillar! Note that at this point in we are in the middle of a vote so it is too late to make changes to this PR, otherwise it would be unfair to those that have already voted. That has been the protocol at live NIAC meetings, where once a vote has been called for and seconded it has to finish. I don't see why it wouldn't be the case for an online vote as well. Indeed this PR has already been discussed and reviewed so I think that's fair.

THAT SAID, your questions are good, and I think you've identified one change I would have made in this PR. However, I think the change is clarification, and not substance, so I think it would be fine as a separate PR that doesn't need a vote. More below.

What does the word "should" actually mean? Are we following RFC 2119 semantics, or is this meant as a more emphatic statement?

Should is a requirement. I had never seen RFC 2119 before, thanks for pointing it out! Just to check, I reviewed NXmx to find usages of the word "should" and found 10. Each could equally be considered ambiguous based on RFC 2119. I'd prefer a separate review of the usage of imperatives outside of this PR.

Is it valid for @default_slice to be something other than an array? (e.g., a simple value, if data has only one dimension).

Yes to your specific example, a single value if the data has only one dimension. My understanding of NeXus is that this doesn't generally need to be specified, but I could be wrong here.

Is it valid for @default_slice to have length different from the number of dimensions in the data?

No. See "should" above.

@default_slice is defined as having type NX_CHAR_OR_NUMBER. Does this mean it is valid for @default_slice to be an array of integers (rather than an array of strings)?

NX_CHAR_OR_NUMBER is a new term defined here as "Any valid character string or NeXus number representation". So a list of integers is inherently included. Technically this should be NXCHAR_OR_INT I suppose...

Given @default_slice as an array of strings, how are numerical value distinguished from an axis names?

This is a good question. The second example lists @default_slice = [".", "2", ".", "."] as a possibility. I had presumed someone writing software to support this would attempt a lookup for "2" in the AXISNAME channel, not find it, and fall back to see if "2" can be parsed as a number. But this could be more clear in the documentation. Note you cannot mix integers and strings in an HDF5 array, so using "." in an array with integers for @default_slice has to be done this way.

What are the processing expectations if AXISNAME is an array of NX_CHAR that contains an element "." and the @default_slice array contains the element "."?

"." in @default_slice has a reserved meaning so the elements of AXISNAMEdon't matter. If you have "." as a value in AXISNAME, you won't be able specify it in @default_slice.

What are the processing expectations if AXISNAME is an array of NX_CHAR, the first element is "2" and the @default_slice array contains the element "2"?

Order of operations states that "2" will be searched for in AXISNAME and the index of "2" in AXISNAME will then be the slice used for @default_slice

What are the processing expectations if AXISNAME is an array of NX_CHAR with two elements having the same value and that value is specified in @default_slice?

This is undefined.

I really appreciate these comments. Of these, I'd like to add "order of operations" to @default_slice, to clarify that AXISNAME should be searched before a CHAR array is attempted to be parsed. However, I'd like to do that in a separate PR, as I think it's clarification, and not really a change that needs a vote. Do you agree?

For the rest, I think we have to be careful to avoid overdefining terms. I don't think we need to add these clarifications to the NXdata spec unless someone asks for them. Do you agree?

@prjemian
Copy link
Contributor

Since "2" is not a valid name for a group or field, that logic could be included and the search for such an HDF5 address avoided.

@prjemian
Copy link
Contributor

Same for ".".

@phyy-nx
Copy link
Contributor Author

phyy-nx commented Dec 12, 2023

Ah good stuff @prjemian. Thanks.

@phyy-nx
Copy link
Contributor Author

phyy-nx commented Dec 20, 2023

Vote has passed, thanks all.

@phyy-nx phyy-nx merged commit 41b754c into main Dec 20, 2023
4 checks passed
@phyy-nx phyy-nx deleted the dimension_as_char branch December 20, 2023 17:38
@phyy-nx
Copy link
Contributor Author

phyy-nx commented Dec 20, 2023

@paulmillar I believe @prjemian answered your questions best when he pointed out that "2" and "." are not legal AXISNAMEs. Is there anything in your comment you think should be made into an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

Allow NXdata dimension variables to contain a list of strings
5 participants