Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Add memory-mapping example to storage guide #2737

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

bendichter
Copy link

@bendichter bendichter commented Jan 20, 2025

Fixes #1245

Add documentation and tests for memory-mapped store, focusing on efficient access to small slices from large uncompressed chunks.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented in docs/release-notes.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

Fixes zarr-developers#1245

Add documentation and tests for memory-mapped store, focusing on efficient access to small slices from large uncompressed chunks.
Comment on lines +18 to +20
def _fromfile(self, fn: str) -> memoryview:
with open(fn, "rb") as fh:
return memoryview(mmap.mmap(fh.fileno(), 0, prot=mmap.PROT_READ))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think _fromfile will ever be invoked, since it's not part of the LocalStore API

@d-v-b
Copy link
Contributor

d-v-b commented Jan 20, 2025

thanks for this addition, however I do think it's worth thinking about whether a special store subclass is the right approach here. Presumably any store the supports range reads also supports reading slices from uncompressed chunks, so just subclassing LocalStore is leaving some performance on the table.

@bendichter
Copy link
Author

That's true. Would it be better to implement this as a codec?

@d-v-b
Copy link
Contributor

d-v-b commented Jan 20, 2025

I think this logic should probably live in the AsyncArray class, but I feel like it's not simple.

  • For reads, the array needs to statically check that individual slices of the array can be resolved to byte ranges in the file. Let's say for now that no array-array codecs or bytes-bytes codecs are allowed in this case, but sharding is allowed here. In the sharding case, the array would need to do a prefetch to figure out which byte range contained which chunks.
  • For writes, we would need to add something to the store classes to indicate whether they support setting byte ranges (local storage does, cloud storage AFAIK do not)
  • There's some point where making a large amount of small, random reads is less efficient than making a single large read, caching the result, and indexing in memory. I don't know where that point is.

@bendichter
Copy link
Author

For writes, we would need to add something to the store classes to indicate whether they support setting byte ranges (local storage does, cloud storage AFAIK do not)

True, but let's focus on reading for now.

There's some point where making a large amount of small, random reads is less efficient than making a single large read, caching the result, and indexing in memory. I don't know where that point is.

This will be very dependent on the use case. We can add a flag for this and have it set to false by default for backwards compatibility. Then the user can switch the flag as needed.

@bendichter
Copy link
Author

So it sounds like you want a more integrated solution than an ad hoc override? How would you feel about adding this here until we have a more robust and integrated solution?

@d-v-b
Copy link
Contributor

d-v-b commented Jan 21, 2025

I don't think the proposed changes here actually implement anything, since the _fromfile method you are overriding does not exist on the current LocalStore class. The store API has been completely refactored since #1245, and so the implementation there no longer works. For example, the basic store methods now use async / await, etc. You can check this by inserting a breakpoint or exception in the _fromfile method -- I bet it never gets hit. Thus, the tests are passing simply because none of the core methods of LocalStore (namely, set and get) are altered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add example to docs of using memory-mapping
2 participants