caching: centralize caching #688

JoepVanlier · 2024-08-09T16:27:52Z

0. Why two caches?

There is a size disparity between channel data and processed data, which is why I think they should live in separate caches.

1. Allows us to monitor memory consumption of all the caches.

By centralizing the channel caching, we can monitor the memory consumption by the caches.

2. Less memory use in some cases.

Loading the same slice twice, doesn't cost you twice the memory anymore (since things are stored by location).

3. It fixes subtle sources of bugs

We can also enforce that the slices are read only. This prevents the following source of "bugs":

ch1 = file["Force HF"]["Force 1x"]
ch2 = file["Force HF"]["Force 1x"]

ch1.data[0:100] = 0
print(ch2.data)  # Shows zeroes for the ones we overwrote in channel 1.

And instead prompts the user to explicitly make a copy if they want to change the data. We already do this for certain arrays coming from confocal objects.

ch1 = file["Force HF"]["Force 1x"]
ch2 = file["Force HF"]["Force 1x"]
ch1.data[0:100] = 0  # ValueError: assignment destination is read-only

copy = ch1.data.copy()  # Explicit copy!
copy[0:100] = 0
print(ch2.data)  # Data in ch1 was not modified

Note that in-place operations on slices are still fine though (which is nice, because those are often used in de/recalibrating). For example::

ch1 = file.force1x
ch2 = file.force1x

assert id(ch1.data) == id(ch2.data)
ch1 /= 5

assert id(ch1.data) != id(ch2.data)
np.testing.assert_allclose(ch1.data * 5, ch2.data)

4. It makes certain access patterns much more performant

Currently doing something like:

file = lk.File("file.h5")
for t in tracked_lines:
    track_force = t.sample_from_channel(file.force1x, include_dead_time=True)

Is horrendously slow clocking in at 42 +- 5 seconds for a small 30 second kymograph with 50 tracks. This reduces to 1.6 +- 0.1 seconds with the new cache. The following would have also worked to fix the problem without the cache:

file = lk.File("file.h5")
ch = file.force1x
for t in tracked_lines:
    track_force = t.sample_from_channel(ch, include_dead_time=True)

This would require additional awareness on the part of the user and doesn't have the benefit of allowing us to monitor how much data is going into these caches going forward.

Open questions

1. Small versus big slices

While doing this, I noticed that for Continuous slices, we always read the whole slice even if we take only a small segment (despite h5py supporting lazily slicing). Note that this was also the case before the caching was introduced (bug?).

At first, this seemed wasteful, so I set up the cache to allow slicing first and then grab the data (including start index and stop index in the location), but when you take many small slices, the performance is frequently far worse than reading the big slice as a whole at once. As a result, the case discussed above with the tracks still took 23 seconds.

One possible improvement to the cache system could be to include a start and stop index in the location again for continuous channels. Then keep track of how frequently a slice of a continuous channel with a particular path is accessed and if that exceeds some threshold assume that the user will be taking more chunks, dump the small chunks and load the whole slice instead. Then in the cache grabber, we could always check if the parent location (the one without additional indexing) is already present, and slice from that if it is.

Right now, from my testing so far, this additional complexity does not seem worth it though. Most h5 files contain one or two items and generally, you look at the whole thing before you start interacting with sub-slices. One iffy part would be if a single slice would exceed the memory capacity. But right now, you also have no way of loading those without going through the raw h5 structure.

2. What is a good default cache size?

I think making it proportional to a size in bytes makes more sense than number of items. Especially considering how much the items can vary in size.

I currently hardcoded the maximum at 1 GB, but I suspect this should depend on the user system somehow? Are there any guidelines for application memory usage?

3. Open file handles

It might be worth considering going one step further and only store locations in the lk.File object and handle opening/closing the resource in the function that looks up and caches these resources. Then we also don't need to keep the File open indefinitely. By handling file interactions in one place, it might also make it a lot easier to swap out file formats in the future.

4. `sys.getsizeof`

The way the property/method cache is set up is that it should only ever be used for immutable types whose size is reported correctly by sys.getsizeof. This is true for a numpy array (if we set writable to false) and basic types like integers and floats for instance, but not for tuples. This requires some diligence on us to make sure that we enforce this.

Some final notes

When reading data from TimeTag channels, we actually read the data 3 times already even when we don't want to access it at all 4 by the time we actually read it. Note that we can fix this without having to resort to a global cache however.
Considering that disallowing data modification is a breaking change, I would propose we iterate on this a bit now, and roll this out when 2.0 rolls around.

Thoughts?

tiagobonetti · 2024-09-30T07:54:39Z

How about starting by adding the caching as an opt-in global config? This would allow us to test the caching in Lakeview.

JoepVanlier · 2024-10-02T18:44:39Z

How about starting by adding the caching as an opt-in global config? This would allow us to test the caching in Lakeview.

Good idea. I'll do some exploratory work on that tomorrow.

tiagobonetti · 2024-10-23T09:37:48Z

lumicks/pylake/channel.py

+        data = (
+            LazyCache.from_h5py_dset(dset, field="Value")
+            if caching.global_cache
+            else dset.fields("Value")
+        )
+        timestamps = (
+            LazyCache.from_h5py_dset(dset, field="Timestamp")
+            if caching.global_cache
+            else dset.fields("Timestamp")
+        )


I think we should move the cache checking into functions in the caching package.
Those would be concerned about what caching to use while this code is only about loading the channel data.

I added a basic example below, but I think we can add other ones that cover the other usages from loading numpy arrays and such.

# in caching def from_h5py(dset, field=None): global global_cache return ( LazyCache.from_h5py_dset(dset, field=field) if global_cache else dset.fields(field) ) # here @staticmethod def from_dataset(dset, y_label="y", calibration=None) -> Slice: data = caching.from_h5py(dset, field="Value") timestamps = caching.from_h5py(dset, field="Timestamp")

Yeah, that's a good point. That keeps it all in one place. Easier to change.

I've added the method you suggested and put the retrieval in caching as well. Decided to go with a mixin here, since then we can keep the storage with the code that stores.

JoepVanlier changed the base branch from main to downsampled_over_track August 9, 2024 16:38

JoepVanlier force-pushed the downsampled_over_track branch from 010d514 to 12bbc65 Compare August 16, 2024 18:24

JoepVanlier force-pushed the simpler branch from 6523ac3 to 49f077a Compare August 16, 2024 18:25

JoepVanlier force-pushed the downsampled_over_track branch from 12bbc65 to cc89d1b Compare August 21, 2024 12:38

JoepVanlier force-pushed the simpler branch from 49f077a to 59092e1 Compare August 21, 2024 12:39

JoepVanlier force-pushed the downsampled_over_track branch from cc89d1b to 1969b09 Compare September 5, 2024 15:19

JoepVanlier force-pushed the simpler branch from 59092e1 to 12abfe2 Compare September 5, 2024 15:20

JoepVanlier force-pushed the downsampled_over_track branch from 1969b09 to 45abbb5 Compare September 18, 2024 15:40

Base automatically changed from downsampled_over_track to main September 18, 2024 16:15

JoepVanlier force-pushed the simpler branch from 12abfe2 to bd9b23c Compare September 27, 2024 14:21

JoepVanlier force-pushed the simpler branch 5 times, most recently from 77101ae to 71b6492 Compare October 7, 2024 12:12

JoepVanlier force-pushed the simpler branch 5 times, most recently from 6a1d696 to 6113fa4 Compare October 18, 2024 15:07

tiagobonetti reviewed Oct 23, 2024

View reviewed changes

JoepVanlier force-pushed the simpler branch 3 times, most recently from c23886f to 4b2f6e9 Compare October 29, 2024 21:04

JoepVanlier added 2 commits November 13, 2024 09:50

channel: add global cache for channels

075117c

caching: add global cache for confocal objects

7a8b91f

JoepVanlier force-pushed the simpler branch from 4b2f6e9 to 7a8b91f Compare November 13, 2024 08:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

caching: centralize caching #688

caching: centralize caching #688

JoepVanlier commented Aug 9, 2024 •

edited

Loading

tiagobonetti commented Sep 30, 2024

JoepVanlier commented Oct 2, 2024

tiagobonetti Oct 23, 2024

JoepVanlier Oct 25, 2024

caching: centralize caching #688

Are you sure you want to change the base?

caching: centralize caching #688

Conversation

JoepVanlier commented Aug 9, 2024 • edited Loading

0. Why two caches?

1. Allows us to monitor memory consumption of all the caches.

2. Less memory use in some cases.

3. It fixes subtle sources of bugs

4. It makes certain access patterns much more performant

Open questions

1. Small versus big slices

2. What is a good default cache size?

3. Open file handles

4. sys.getsizeof

Some final notes

tiagobonetti commented Sep 30, 2024

JoepVanlier commented Oct 2, 2024

tiagobonetti Oct 23, 2024

Choose a reason for hiding this comment

JoepVanlier Oct 25, 2024

Choose a reason for hiding this comment

JoepVanlier commented Aug 9, 2024 •

edited

Loading

4. `sys.getsizeof`