Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caching: centralize caching #688

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

caching: centralize caching #688

wants to merge 2 commits into from

Conversation

JoepVanlier
Copy link
Member

@JoepVanlier JoepVanlier commented Aug 9, 2024

0. Why two caches?

There is a size disparity between channel data and processed data, which is why I think they should live in separate caches.

1. Allows us to monitor memory consumption of all the caches.

By centralizing the channel caching, we can monitor the memory consumption by the caches.

2. Less memory use in some cases.

Loading the same slice twice, doesn't cost you twice the memory anymore (since things are stored by location).

3. It fixes subtle sources of bugs

We can also enforce that the slices are read only. This prevents the following source of "bugs":

ch1 = file["Force HF"]["Force 1x"]
ch2 = file["Force HF"]["Force 1x"]

ch1.data[0:100] = 0
print(ch2.data)  # Shows zeroes for the ones we overwrote in channel 1.

And instead prompts the user to explicitly make a copy if they want to change the data. We already do this for certain arrays coming from confocal objects.

ch1 = file["Force HF"]["Force 1x"]
ch2 = file["Force HF"]["Force 1x"]
ch1.data[0:100] = 0  # ValueError: assignment destination is read-only

copy = ch1.data.copy()  # Explicit copy!
copy[0:100] = 0
print(ch2.data)  # Data in ch1 was not modified

Note that in-place operations on slices are still fine though (which is nice, because those are often used in de/recalibrating). For example::

ch1 = file.force1x
ch2 = file.force1x

assert id(ch1.data) == id(ch2.data)
ch1 /= 5

assert id(ch1.data) != id(ch2.data)
np.testing.assert_allclose(ch1.data * 5, ch2.data)

4. It makes certain access patterns much more performant

Currently doing something like:

file = lk.File("file.h5")
for t in tracked_lines:
    track_force = t.sample_from_channel(file.force1x, include_dead_time=True)

Is horrendously slow clocking in at 42 +- 5 seconds for a small 30 second kymograph with 50 tracks. This reduces to 1.6 +- 0.1 seconds with the new cache. The following would have also worked to fix the problem without the cache:

file = lk.File("file.h5")
ch = file.force1x
for t in tracked_lines:
    track_force = t.sample_from_channel(ch, include_dead_time=True)

This would require additional awareness on the part of the user and doesn't have the benefit of allowing us to monitor how much data is going into these caches going forward.

Open questions

1. Small versus big slices

While doing this, I noticed that for Continuous slices, we always read the whole slice even if we take only a small segment (despite h5py supporting lazily slicing). Note that this was also the case before the caching was introduced (bug?).

At first, this seemed wasteful, so I set up the cache to allow slicing first and then grab the data (including start index and stop index in the location), but when you take many small slices, the performance is frequently far worse than reading the big slice as a whole at once. As a result, the case discussed above with the tracks still took 23 seconds.

One possible improvement to the cache system could be to include a start and stop index in the location again for continuous channels. Then keep track of how frequently a slice of a continuous channel with a particular path is accessed and if that exceeds some threshold assume that the user will be taking more chunks, dump the small chunks and load the whole slice instead. Then in the cache grabber, we could always check if the parent location (the one without additional indexing) is already present, and slice from that if it is.

Right now, from my testing so far, this additional complexity does not seem worth it though. Most h5 files contain one or two items and generally, you look at the whole thing before you start interacting with sub-slices. One iffy part would be if a single slice would exceed the memory capacity. But right now, you also have no way of loading those without going through the raw h5 structure.

2. What is a good default cache size?

I think making it proportional to a size in bytes makes more sense than number of items. Especially considering how much the items can vary in size.

I currently hardcoded the maximum at 1 GB, but I suspect this should depend on the user system somehow? Are there any guidelines for application memory usage?

3. Open file handles

It might be worth considering going one step further and only store locations in the lk.File object and handle opening/closing the resource in the function that looks up and caches these resources. Then we also don't need to keep the File open indefinitely. By handling file interactions in one place, it might also make it a lot easier to swap out file formats in the future.

4. sys.getsizeof

The way the property/method cache is set up is that it should only ever be used for immutable types whose size is reported correctly by sys.getsizeof. This is true for a numpy array (if we set writable to false) and basic types like integers and floats for instance, but not for tuples. This requires some diligence on us to make sure that we enforce this.

Some final notes

  1. When reading data from TimeTag channels, we actually read the data 3 times already even when we don't want to access it at all 4 by the time we actually read it. Note that we can fix this without having to resort to a global cache however.

  2. Considering that disallowing data modification is a breaking change, I would propose we iterate on this a bit now, and roll this out when 2.0 rolls around.

Thoughts?

@tiagobonetti
Copy link
Member

How about starting by adding the caching as an opt-in global config? This would allow us to test the caching in Lakeview.

@JoepVanlier
Copy link
Member Author

How about starting by adding the caching as an opt-in global config? This would allow us to test the caching in Lakeview.

Good idea. I'll do some exploratory work on that tomorrow.

Comment on lines 804 to 813
data = (
LazyCache.from_h5py_dset(dset, field="Value")
if caching.global_cache
else dset.fields("Value")
)
timestamps = (
LazyCache.from_h5py_dset(dset, field="Timestamp")
if caching.global_cache
else dset.fields("Timestamp")
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move the cache checking into functions in the caching package.
Those would be concerned about what caching to use while this code is only about loading the channel data.

I added a basic example below, but I think we can add other ones that cover the other usages from loading numpy arrays and such.

# in caching
def from_h5py(dset, field=None):
    global global_cache
    return (
        LazyCache.from_h5py_dset(dset, field=field)
        if global_cache
        else dset.fields(field)
    )

# here
    @staticmethod
    def from_dataset(dset, y_label="y", calibration=None) -> Slice:
        data = caching.from_h5py(dset, field="Value")
        timestamps = caching.from_h5py(dset, field="Timestamp")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point. That keeps it all in one place. Easier to change.

I've added the method you suggested and put the retrieval in caching as well. Decided to go with a mixin here, since then we can keep the storage with the code that stores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants