Nested storing of chunks #19
Replies: 2 comments 1 reply
-
Few evening additions:
|
Beta Was this translation helpful? Give feedback.
-
I thought of another possibility that would allow the .zarray file to remain unchanged and still be able to quickly find out whether the chunks were written nested or flat. For this, one thing should be kept in mind from the beginning. 0-position pixel strategyFor this, a 0-position pixel with fill value would always have to be written when creating a zarr array. Python initialisation example for a 4 dimensional array:
When opening an array to read from it, an initialisation step could be to read the 0-position pixel. This initialisation step would then only need to interate once over the possible key types and initialise the array with the correct chunk name creator. This zarr array "create" initialisation would be easy to implement in any zarr implementation language, as everything needed for this is already available. Only the zarr array "open for reading" initialisation would have to iterate through the possible chunk name types once, when reading the 0-position pixel, to set the correct chunk key generator in the array. I think, this procedure should work for any storage implementation, because it is ultimately only a key string. |
Beta Was this translation helpful? Give feedback.
-
If chunks are stored flat (i.e. chunk 0.0.0 next to chunk 1.0.0 etc.), this can result in a large number of chunk files in a single directory. This can slow down a file system (e.g. local file system) considerably.
To prevent this, there should be a way to store the chunk files in a nested way.
Writing can be easily implemented by telling the array when it is created whether chunks are to be stored flat or nested.
When reading, it becomes more difficult.
If the .zarray file remains unchanged and contains no information about whether chunks were written flat or nested, the directory must be searched for chunk files in order to decide whether to read flat or nested.
Even if the .zarray file written by jZarr contains the information whether chunks files should be written flat or nested, the problem persists as soon as one tries to read from an array that was not written by jZarr but python zarr, left no information in the .zarray file, but the chunks were written nested.
The problem could be solved by having the chunk name generator of the FileSystemStore always return an array of two chunk file names as long as the status is unresolved.
Array size = 2.
The first name is the nested chunk name ... e.g. "1/4/3".
The second name is the flat chunk name ... e.g. "1.4.3".
The FileSystemStore then always tries to open a FileInputStream with the nested name first.
If this succeeds, the status is reported to the array, the .zarray file is adjusted (if write access exists) and the chunk name generator is replaced by a nested name generator.
The other way round, if a chunk file in flat name style is found first.
Then there still exist another edge case problem.
What if we open a zarr array to add data to it, which was initialized by someone else, but in which no chunks have been written yet.
If there is no information in the .zarray file whether chunks should be written flat or nested, it would be possible to calculate how many chunk files this would be and decide with a threshold value whether to write flat or nested. Such a threshold value could be specified in the form of a VM property.
Alternatively a default behavior for such cases can be defined.
E.g. default is flat ... zarr v2 specification
But this standard behaviour could also be overwritten by a VM parameter at java application start.
What do you think about it?
Good or not good?
Have I missed any cases?
Other suggestions?
Beta Was this translation helpful? Give feedback.
All reactions