HdfStorage.read: fix rowNumbers for all start/end values #430

sbesson · 2024-09-19T14:02:29Z

Under some conditions, the rowNumbers array returned as part of the data was inconsistent with the range of rows selected by the tables.Table.read() API. As the start/end parameters are expected to have the same meaning as the built-in Python slices, this modifies the current implementation to slice all row numbers using these values using start/end.

Discovered as part of the work on ome/openmicroscopy#6412 to cover the different cases of the current table reading API. With this change included, all the test cases in testReadStartEnd introduced in ome/openmicroscopy#6412 should pass. Without it, several of the rowNumbers checks will fail.

This can also be tested manually e.g. by loading an existing table and calling table.read() with start/end values outside the [0 getNumberOfRows() - 1] range e.g.

>>> n = table.getNumberOfRows()
>>> table.read([0], -1, n)
>>> table.read([0], 0, n + 10)
>>> table.read([0], 0, -1)
>>> table.read([0], 0, -n + 1)

Without this change, the above should return incorrect values for rowNumbers. With this change, the rowNumbers should be consistent with the [start:end] slice and the columns value

Under some conditions, the rowNumbers array returned as part of the data was inconsistent with the range of rows selected by the tables.Table.read() API. As the start/end parameters are expected to have the same meaning as the built-in Python slices, this modifies the current implementation to slice all row numbers using these values using start/end.

chris-allan

Obviously the concept is what we need the thing I worry about is the performance when slicing a large table. If the table has 10M rows asking for a few hundred with row numbers included is going to perform very poorly with this implementation.

For example:

In [2]: %timeit list(range(0, 10_000_000))
198 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

sbesson · 2024-09-20T10:20:09Z

Completely agreed, this was the most generic way to solve all scenarios but has a real performance implication in the scenario you are describing above.

It should be each to add extra checks on whether start and end are within the [0 nrows-1] range. From there, I see two possible behaviors up for discussion:

only apply the slice to the full list as a fallback i.e. if start and or end are outside the[0 nrows-1].
decide that start,end values outside the [0 nrows-1] were never declared as part of the API contract and the implementation has been broken for 10+ years and should not supported

sbesson · 2024-09-25T11:15:54Z

Reading this more carefully, I am finding there are multiple issues:

the current return value of rowNumbers is incorrect for some values of start/end- I this is a regression of the changes made in Fix HdfStorage.read() implementation to work with all column types #288 although it's unclear whether values outside the row ranges were supported in the first place.
calling read(colNumbers, 0, 0) returns all rows which is at odds with the note in https://omero.readthedocs.io/en/stable/developers/Tables.html#omero.grid.Table.read and seems to have been introduced in eba3ef7
as noted above the current implementation is possible inefficient for large arrays and should be minimally replaced by numpy.arange() wherever possible

As indicated in ome/openmicroscopy#6412 (comment), the typical scenario of calling table.read(colNumbers, start, end) with 0 <= start < end < nrows works as expected and the new tests added should cover this.

I'll close this and open a separate issue to discuss the API expectation depending on the values of start/end and possible solutions

sbesson requested review from chris-allan and jburel September 19, 2024 14:02

sbesson mentioned this pull request Sep 19, 2024

Extend test coverage for tables slice, read and readCoordinates API ome/openmicroscopy#6412

Merged

chris-allan reviewed Sep 20, 2024

View reviewed changes

sbesson closed this Sep 25, 2024

sbesson mentioned this pull request Sep 26, 2024

OMERO.tables read API: start/stop edge case support #432

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HdfStorage.read: fix rowNumbers for all start/end values #430

HdfStorage.read: fix rowNumbers for all start/end values #430

sbesson commented Sep 19, 2024

chris-allan left a comment

sbesson commented Sep 20, 2024

sbesson commented Sep 25, 2024

HdfStorage.read: fix rowNumbers for all start/end values #430

HdfStorage.read: fix rowNumbers for all start/end values #430

Conversation

sbesson commented Sep 19, 2024

chris-allan left a comment

Choose a reason for hiding this comment

sbesson commented Sep 20, 2024

sbesson commented Sep 25, 2024