Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: 1381 change to the axes attribute meaning #1396

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

woutdenolf
Copy link
Contributor

@woutdenolf woutdenolf commented Jun 30, 2024

Closes #1381

HTML RENDERING OF NXDATA FIX <-> current nxdata for comparison

When reviewing this PR, please keep in mind that the purpose is to rectify NXdata, not improve it. Suggestions for improvements can go in #1381.

For reference: multi-dimensional axes were introduced here https://www.nexusformat.org/2014_axes_and_uncertainties.html

Context

The NXdata definition got a makeover in PR #1213 to make it more understandable. In this effort, I assumed the @axes attribute was supposed to say what all the axes are of the NXdata group. It didn't occur to me this attribute is not about the axes, it is about the signal. It defines what the default axis is for each signal dimension. The unintended change do not make existing files invalid but it does introduce more flexibility which changes things for existing readers as pointed out in issue #1381 by @jacobfilik .

Purpose of this PR

The sole purpose of this PR is to rectify PR #1213 and NOT introduce anything new. The alternative PR #1392 by @PeterC-DLS fixes the situation by carefully modifying some sentences here and there. I would argue however that the entire "axes" section in the introduction has been structured with the less restrictive @axes in mind. In this PR I refactor the entire "axes" section to better reflect restrictive @axes.

Details on @axes rectification

This is the current less restrictive @axes attribute definition which I'm rectifying

The `@axes` attribute provides the names of all the `AXISNAME` fields in the NXdata group.
The corresponding `@AXISNAME_indices` attributes provide the signal dimensions they span
and when missing the position(s) of `AXISNAME` in `@axes` are taken as the signal dimensions
spanned by `AXISNAME`. The shape of an `AXISNAME` field is required to be equal to the shape
of the spanned signal dimensions.

This definition is much simpler and more concise than the definition in the "axes" section in this PR. However it removes the restriction that length(axes) == rank(signal) so we need to put it back in with @axes being the default axes, not all axes.

@PeterC-DLS
Copy link
Contributor

Also add to @axes the following from my #1392

    <dimensions rank="1">
        <dim index="1" value="dataRank"/>
    </dimensions>

@rayosborn
Copy link
Contributor

@woutdenolf requested some evidence that NIAC had previously allowed the dimension size of an axis array to be one greater than the associated data dimension. It turns out that this was approved at the very first NIAC meeting in Pasadena in September, 2003. According to the minutes

"Histograms can be specified either by having an extra element in the axis array or with an attribute histogram_offset in the axis."

If you look at the current PDF version of the manual, you will find examples of axes implementing this rule on page 16/17. An example NeXus file has the following two entries:

NX Data  : data[148,750] (NX_INT32)
NX Data  : time_of_flight[751] (NX_FLOAT32)
NX Data  : polar_angle[148] (NX_FLOAT32)

Finally, it is also implied in the current NXdetector definition. Here are two of the field definitions:

time_of_flight: (optional) [NX_FLOAT](https://manual.nexusformat.org/nxdl-types.html#nx-float) (Rank: 1, Dimensions: [tof+1]) {units=[NX_TIME_OF_FLIGHT](https://manual.nexusformat.org/nxdl-types.html#nx-time-of-flight)}
raw_time_of_flight: (optional) [NX_INT](https://manual.nexusformat.org/nxdl-types.html#nx-int) (Rank: 1, Dimensions: [tof+1]) {units=[NX_PULSES](https://manual.nexusformat.org/nxdl-types.html#nx-pulses)}

The tof symbol refers to the number of time-of-flight bins, so the field contains tof+1 bin boundaries. The time_of_flight field even includes a (now deprecated) axis attribute, implying that this could be linked to a NXdata group. Time-of-flight bin boundaries are stored in the raw data files of virtually every spallation neutron source instrument as well as time-of-flight spectrometers at reactor sources.

@woutdenolf
Copy link
Contributor Author

woutdenolf commented Aug 15, 2024

Refactor again to take NIAC comments into account:

  • start with simple example at the very top (@phyy-nx)
  • histogram edges (axis with 1 extra value, @rayosborn )
  • when AXISNAME_indices is the same a the indices of "AXISNAME" in axes there is no need to provide them (@benajamin )
  • indices of "AXISNAME" in axes must be a subset of AXISNAME_indices (comment by @rayosborn to prohibit contradiction between AXISNAME_indices and axes)
  • add <dimensions rank="1"> to @axes (@PeterC-DLS )

The structure of the NXdata doc section at the top is now

  • Usage: canonical example + list supported cases
  • Signals:
    • Defined by ...
    • Definition
    • Example
  • Axes:
    • Defined by ...
    • Definition
    • Example
  • Uncertainties
    • Defined by ...
    • Definition
    • Example

One example per section is enough. The example in the axes section covers all axes use cases supported by NXdata (histogram edges +1, alternative axes, multi-dimensional axes, "." in axes, missing indices). I made this example very concrete by referring to a specific scientific case and included an image.

Edit: I keep discovering new things about the axes. I added this statement which did not seem to be in the current definition but I think is implied?

The number of values in AXISNAME_indices must be equal to the rank of the corresponding AXISNAME field.

@woutdenolf woutdenolf force-pushed the 1381-change-to-the-axes-attribute-meaning branch 10 times, most recently from edba426 to a518e52 Compare August 16, 2024 03:37
…s +1, indices can be omitted, indices cannot contradict @axes
@woutdenolf woutdenolf force-pushed the 1381-change-to-the-axes-attribute-meaning branch from a518e52 to e37a676 Compare August 16, 2024 03:47
phyy-nx added a commit that referenced this pull request Aug 19, 2024
- NX_FLOAT -> NX_NUMBER for scaling_factor and offset
- Sync language with number #1396 for the FIELDNAME convention
.. code-block::

data:NXdata
@signal = "data"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@signal is optional and if missing, we use data so maybe simplest example should omit it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know this. Not specified in the definition. I'll add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess when @axes is missing, we use [".", ".", ...] (as many dots as rank)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If @signal is missing, there is no rule that says "data" is required as a field name. In this case and multiple fields, selection of the signal field for the default plot is not guaranteed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought NXdata evolved from a fixed signal field (usually called data, see e.g.

<field name="data" type="NX_NUMBER">
) with an (optional?) attribute @signal=1 so to be backwardly compatible, a modern parser could assume the group attribute signal if missing is "data".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that @prjemian is correct. I just scanned through the minutes of all the NIAC meetings, and I can't find any reference to a default name for the signal. I believe that the current (non-deprecated) method of defining signals and axes requires both signal and axes attributes to be defined at group level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was found that assigning these links to fields would break when the fields were linked (local or external) into other groups. Such field attributes could produce a group with multiple field @signal attributes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we need to make these attributes required then.

Copy link
Contributor

@prjemian prjemian Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could bring that up again for a NIAC vote. Should not make it required until then, since NIAC voted the other way.

Consider a simple use case (for not having this attribute). User wants to save a 2-D image, only the 2-D image. NeXus file consists of /entry:NXentry/data:NXdata/image where image is the 2-D image. We've gotten strong pushback on requiring @signal in this use case.

`Dimensions`
4. When :ref:`AXISNAME_indices &lt;/NXdata@AXISNAME_indices-attribute&gt;` is the same as the indices of "AXISNAME" in the
:ref:`axes &lt;/NXdata@axes-attribute&gt;` attribute, there is no need to provide
:ref:`AXISNAME_indices &lt;/NXdata@AXISNAME_indices-attribute&gt;`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for 1D axis fields?

Copy link
Contributor Author

@woutdenolf woutdenolf Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this applies in general then @axes=["x", "x"] would imply @x_indices=[0,1]. That would be ok when x is 2D.

When x is 1D it would not work. It would violate two requirements

  • The number of values in AXISNAME_indices must be equal to the rank of the corresponding AXISNAME field.
  • The indices of "AXISNAME" in the axes attribute must be a subset of AXISNAME_indices.

So if a 1D x needs to work for @axes=["x", "x"] we need to change those requirements. Edit: for the record I don't think this is needed and if it would be needed, you can use softlinks to have one dataset with two different names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, a pathological case could be @x_indices=[1,0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see. So omitting indices can only work unambiguously in 1D cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another pathological case would be @x_indices=[0,0]. I'm not sure what this means but technically the definition does not forbid it. So we should say indices cannot contain duplicates?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of @x_indices=[0,0], this says the same 1-D array applies to both independent axes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean @x_indices=[0,0] and x being 2D.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eeeeww! That's pathological, and likely a mistake.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, the first example. axes=["x", "x"] would be equivalent to @x_indices=[0,1] no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eeeeww! That's pathological, and likely a mistake.

Agreed but technically the definition does not present it as far as I can see, which we should probably fix but not sure how.

Stating that AXISNAME_indices cannot have duplicates would solve it. But it would prohibit the case where you have for example a single 1D x axis of lets say 100 points and you have a 2D signal of 100 x 100 points and you want x to be used for both dimensions. As I said before, you could define x1 and x2 and both can be symlinks to x so effectively x is used for both dimensions even though @axes=["x1", "x2"].

@rayosborn
Copy link
Contributor

rayosborn commented Aug 20, 2024

I would suggest changing the following text:

The AXISNAME_indices attributes describe the DATA dimensions spanned by the corresponding AXISNAME fields.

to:

The optional AXISNAME_indices attributes describe the DATA dimensions spanned by the corresponding AXISNAME fields. These attributes may be used to indicate alternative axis coordinates to the default.

My rationale is that the fact that these attributes are not necessarily needed may otherwise be missed by those who don't read the following points carefully.

@rayosborn
Copy link
Contributor

I would suggest changing histogram bin edges to histogram bin boundaries, which is more common usage in the neutron time-of-flight community.

@rayosborn
Copy link
Contributor

rayosborn commented Aug 20, 2024

Just a general comment. I think @woutdenolf has done a great job of clarifying the various scenarios for specifying axes, but I found the list following "Additional requirements for a complete definition" to be difficult to follow, even as someone who has participated in this discussion for years. I wonder if it should have a separate heading, such as "Multidimensional Axes" so that people know they only need to get their heads round this if they have specific needs for more complex axis specifications. To emphasize this, I think it might be better if the simple example at the top is two-dimensional, with both x and y axes, to prevent people from thinking that multidimensional data are intrinsically complicated to plot. The only drawback is that it opens a can of worms about what is the horizontal axis, x or y, when axes=["x", "y"].

@woutdenolf
Copy link
Contributor Author

woutdenolf commented Aug 20, 2024

NIAC Aug 2024 comments: target audience are scientists on the one hand and software developers that need to write/read HDF5 files on the other hand. The rigorous definition should not alienate the first group. Split the description at the top accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change to the axes attribute meaning
4 participants