Skip to content

Commit

Permalink
Update Zarr and add LINDI alternative backend discussion
Browse files Browse the repository at this point in the history
  • Loading branch information
oruebel committed Aug 22, 2024
1 parent 1a31a14 commit 6bc0c07
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 5 deletions.
1 change: 1 addition & 0 deletions docs/source/conf_extlinks.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@
'ibl-website': ('https://www.internationalbrainlab.com/%s', '%s'),
'mindscope-program': ('https://alleninstitute.org/what-we-do/brain-science/research/mindscope-program/%s', '%s'),
'jupyter-book': ('https://jupyterbook.org/en/stable/%s', '%s'),
'lindi-src': ('https://github.com/NeurodataWithoutBorders/lindi/%s', '%s'),
}

# Use this for mapping for links to commonly used documentation
Expand Down
21 changes: 16 additions & 5 deletions docs/source/faq_details/why_hdf5.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,21 +24,32 @@ As recordings enter the TB scale, it is essential that we use a backend storage
These features have proven to be very important for archiving large datasets. For instance, in raw data from Neuropixel recordings, it has been found to reduce the file size by up to 60%. As datasets grow in volume and in number, it will become increasingly important to utilize good data engineering principles to manage them at scale.

Alternative backends
--------------------
Below, we briefly explain the pros and cons of alternative backends. Depending on the particular application and storage needs, different backends are often preferable. In particular as part of :hdmf-docs:`HDMF <>`, teams are exploring the use of alternate storage solutions with NWB. For the broader NWB community, we have found that HDF5 provides a good standard solution for most common use cases.
---------------------
Below, we briefly explain the pros and cons of alternative storage formats. Depending on the particular application and storage needs, different backends are often preferable. In particular as part of :hdmf-docs:`HDMF <>`, teams are exploring the use of alternate storage solutions with NWB. For the broader NWB community, we have found that HDF5 provides a good standard solution for most common use cases.

Zarr
^^^^

:zarr:`Zarr <>` supports compression and chunking like HDF5. Zarr is the standard we have found that comes closest to HDF5’s level of support for complex hierarchical data structures. However, Zarr does not support Links natively and support for links is not on the Zarr development roadmap. Links are an important feature for NWB to facilitate linking of data and metadata across complex collections of neurophysiology data products. Furthermore, Zarr only supports Python and the neuroscience community requires APIs in MATLAB and other languages. Also, HDF5 is a much more mature standard with a track record for long-term accessibility.
:zarr:`Zarr <>` supports compression and chunking like HDF5. Zarr is the standard we have found that comes closest to HDF5’s level of support for complex hierarchical data structures. The :bdg-link-primary:`HDMF Zarr<https://hdmf-zarr.readthedocs.io>` library implements a Zarr backend for HDMF and provides convenience classes for integrating Zarr with the :bdg-link-primary:`PyNWB <https://pynwb.readthedocs.io/en/stable/>` Python API for NWB to support writing of NWB files to Zarr. Using Zarr, the NWB file is not stored as a single file, but as a collection many files organized into folders. This storage scheme has some key advantageous when using object-based storage solutions, e.g., cloud-based storage in AWS. Some main limitations of Zarr for NWB are: 1) Zarr only supports Python and the neuroscience community requires APIs in MATLAB and other languages, 2) HDF5 is a much more mature standard with a track record for long-term accessibility but the Zarr community is growing, 3) transferring Zarr files requires moving lots of small files, 4) Zarr does not support Links and Reference so that :hdmf-z-docs:`HDMF Zarr <>` must implement custom solutions to support this important feature for NWB. Whether HDF5 or Zarr is the right solution for you depends heavily on your use case.

LINDI
^^^^^

The :lindi-src:`Linked Data Interface (LINDI) <>` provides a JSON representation of NWB data where the large data chunks are stored separately from the main metadata. This enables efficient storage, composition, and sharing of NWB files on cloud systems such as DANDI without duplicating the large data blobs. LINDI can be used to index existing NWB HDF5 files to help speed-up remote access to HDF5 files stored in the cloud. LINDI provides dropin `LindiH5pyFile` feature such that LINDI files can be read via `PyNWB` using the standard `NWBHDF5IO` backend. LINDI is currently under development and should not yet be used in practice.


Other alternative storage formats
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The following alternative formats are not currently supported by NWB.

Binary files (.dat)
^^^^^^^^^^^^^^^^^^^^
~~~~~~~~~~~~~~~~~~~

Binary files do not allow for complex hierarchical data including Groups, Attributes, and Links. They also do not allow for chunking and compression, which makes them poorly suited for efficient handling of large data files. Furthermore, there is metadata needed to interpret binary files that can be missing, including shape, data type, and endianness. Zarr is an approach that uses binary files and deals with these limitations, using folders and json files to create a hierarchical structure that can manage data chunks and specify the essential parameters of binary files. See our response to Zarr.

Relational database (e.g. SQL)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The :hdmf-specification-language:`HDMF specification language <>` is inherently hierarchical, not tabular, and we
need a storage layer that can express the hierarchical nature of the data as well. There are some approaches for
Expand Down

0 comments on commit 6bc0c07

Please sign in to comment.