-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
caching: centralize caching #688
base: main
Are you sure you want to change the base?
Conversation
010d514
to
12bbc65
Compare
12bbc65
to
cc89d1b
Compare
cc89d1b
to
1969b09
Compare
1969b09
to
45abbb5
Compare
12abfe2
to
bd9b23c
Compare
How about starting by adding the caching as an opt-in global config? This would allow us to test the caching in Lakeview. |
Good idea. I'll do some exploratory work on that tomorrow. |
77101ae
to
71b6492
Compare
6a1d696
to
6113fa4
Compare
lumicks/pylake/channel.py
Outdated
data = ( | ||
LazyCache.from_h5py_dset(dset, field="Value") | ||
if caching.global_cache | ||
else dset.fields("Value") | ||
) | ||
timestamps = ( | ||
LazyCache.from_h5py_dset(dset, field="Timestamp") | ||
if caching.global_cache | ||
else dset.fields("Timestamp") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move the cache checking into functions in the caching
package.
Those would be concerned about what caching to use while this code is only about loading the channel data.
I added a basic example below, but I think we can add other ones that cover the other usages from loading numpy
arrays and such.
# in caching
def from_h5py(dset, field=None):
global global_cache
return (
LazyCache.from_h5py_dset(dset, field=field)
if global_cache
else dset.fields(field)
)
# here
@staticmethod
def from_dataset(dset, y_label="y", calibration=None) -> Slice:
data = caching.from_h5py(dset, field="Value")
timestamps = caching.from_h5py(dset, field="Timestamp")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good point. That keeps it all in one place. Easier to change.
I've added the method you suggested and put the retrieval in caching
as well. Decided to go with a mixin here, since then we can keep the storage with the code that stores.
c23886f
to
4b2f6e9
Compare
0. Why two caches?
There is a size disparity between channel data and processed data, which is why I think they should live in separate caches.
1. Allows us to monitor memory consumption of all the caches.
By centralizing the channel caching, we can monitor the memory consumption by the caches.
2. Less memory use in some cases.
Loading the same slice twice, doesn't cost you twice the memory anymore (since things are stored by location).
3. It fixes subtle sources of bugs
We can also enforce that the slices are read only. This prevents the following source of "bugs":
And instead prompts the user to explicitly make a copy if they want to change the data. We already do this for certain arrays coming from confocal objects.
Note that in-place operations on slices are still fine though (which is nice, because those are often used in de/recalibrating). For example::
4. It makes certain access patterns much more performant
Currently doing something like:
Is horrendously slow clocking in at
42 +- 5
seconds for a small30
second kymograph with50
tracks. This reduces to1.6 +- 0.1
seconds with the new cache. The following would have also worked to fix the problem without the cache:This would require additional awareness on the part of the user and doesn't have the benefit of allowing us to monitor how much data is going into these caches going forward.
Open questions
1. Small versus big slices
While doing this, I noticed that for
Continuous
slices, we always read the whole slice even if we take only a small segment (despiteh5py
supporting lazily slicing). Note that this was also the case before the caching was introduced (bug?).At first, this seemed wasteful, so I set up the cache to allow slicing first and then grab the data (including start index and stop index in the location), but when you take many small slices, the performance is frequently far worse than reading the big slice as a whole at once. As a result, the case discussed above with the tracks still took 23 seconds.
One possible improvement to the cache system could be to include a start and stop index in the location again for continuous channels. Then keep track of how frequently a slice of a continuous channel with a particular path is accessed and if that exceeds some threshold assume that the user will be taking more chunks, dump the small chunks and load the whole slice instead. Then in the cache grabber, we could always check if the parent location (the one without additional indexing) is already present, and slice from that if it is.
Right now, from my testing so far, this additional complexity does not seem worth it though. Most
h5
files contain one or two items and generally, you look at the whole thing before you start interacting with sub-slices. One iffy part would be if a single slice would exceed the memory capacity. But right now, you also have no way of loading those without going through the rawh5
structure.2. What is a good default cache size?
I think making it proportional to a size in bytes makes more sense than number of items. Especially considering how much the items can vary in size.
I currently hardcoded the maximum at 1 GB, but I suspect this should depend on the user system somehow? Are there any guidelines for application memory usage?
3. Open file handles
It might be worth considering going one step further and only store locations in the
lk.File
object and handle opening/closing the resource in the function that looks up and caches these resources. Then we also don't need to keep theFile
open indefinitely. By handling file interactions in one place, it might also make it a lot easier to swap out file formats in the future.4.
sys.getsizeof
The way the property/method cache is set up is that it should only ever be used for immutable types whose size is reported correctly by
sys.getsizeof
. This is true for anumpy
array (if we set writable to false) and basic types like integers and floats for instance, but not for tuples. This requires some diligence on us to make sure that we enforce this.Some final notes
When reading data from
TimeTag
channels, we actually read the data 3 times already even when we don't want to access it at all 4 by the time we actually read it. Note that we can fix this without having to resort to a global cache however.Considering that disallowing data modification is a breaking change, I would propose we iterate on this a bit now, and roll this out when
2.0
rolls around.Thoughts?