Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider caching expensive flywire id lookups #67

Open
jefferis opened this issue Jan 14, 2021 · 8 comments
Open

consider caching expensive flywire id lookups #67

jefferis opened this issue Jan 14, 2021 · 8 comments

Comments

@jefferis
Copy link
Collaborator

jefferis commented Jan 14, 2021

rootid->supervoxels

this can be cached essentially permanently since rootids mutate whenever the mapping changes

however they are quite large, so the object store could quickly start taking a lot of space. For example this fairly large neuron

> system.time(fl <- flywire_leaves('720575940619073968'), gcFirst = F)
   user  system elapsed 
  0.278   0.074   4.431 

generates ~ 270000 svids occupying ~ 20 Mb as character vector. This could be reduced to ~ 0.7 Mb by compression of binary data.

@jefferis
Copy link
Collaborator Author

Have taken a look at this. It makes sense to compress the data that is cached. Switching from character to compressed int64 reduces space by ~30x (10x for int64, 3x for compression). gzip is the best general purpose compression and adds about 5% penalty when there is a cache miss and is very fast decompressing. I found that brotli was actually a small improvement but only with a non-default quality=2; this took very slightly longer to decompress but compressed faster and smaller and I think on balance that makes it the winner. The brotli package is maintained by Jeroen Ooms, so although it is another dependency, I eventually opted to add it to suggests and make it the default if available.

For the data compression step specifically for flywire_leaves('720575940619073968', cache = TRUE) which typically takes about 4s:

  expression    min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time 
  <bch:expr> <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <lis>
1 brotli     18.4ms 19.4ms      50.0  693.54KB    0.251   199     1      3.98s <NULL> <Rpro… <bch…
2 gz         94.3ms 96.8ms      10.2    3.06MB    0.525    39     2      3.81s <NULL> <Rpro… <bch…

For, brotli(q=2) for :

bench::mark(brotli=flywire_leaves('720575940619073968', integer64 = TRUE, cache = TRUE, compression="brotli"), min_time = 5)

the cache read time was ~10 vs 15ms for gzip vs brotli, which was actually dwarfed by the character conversion time (~75ms) when character output was needed.

@jefferis
Copy link
Collaborator Author

The current default is 5000 items in the lru cache. I wonder if it would be wort making it smaller (1000?), but adding an option that could make it larger for expert use when there is plenty of memory available.

@jefferis
Copy link
Collaborator Author

Caching in action:

> kcsel=c("720575940623755722", "720575940609992371", "720575940625494549",
+ "720575940619442047", "720575940620517656", "720575940609793429",
+ "720575940617265029", "720575940631869024", "720575940637441955",
+ "720575940638892789")
> kcpreds=flywire_ntpred(kcsel)
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=32s  
> kcpreds2=flywire_ntpred(kcsel)
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02s  

@jefferis
Copy link
Collaborator Author

Should maybe consider cacheing to disk. Likely still very fast with a good backend and probably quite useful since these mappings are permanent. Also maybe it's better not to risk filling memory. cachem may be a good option now since it is the default for memoise (although not 100% certain it goes back to R 3.5).

@jefferis
Copy link
Collaborator Author

cachem installs ok on linux under R 3.5 and the cache_layered option seems like it would be a nice fit because when keys are set , they are set in all layers (i.e. on disk as well as memory) so this should give a persistent store on disk as well as a fast store in memory.

@jefferis jefferis changed the title consider cacheing expensive flywire id lookups consider caching expensive flywire id lookups Feb 1, 2021
@jefferis
Copy link
Collaborator Author

jefferis commented Feb 1, 2021

So far we have not been caching supervoxel id to root id lookups since they can change. However we could do this as follows:

  1. store svid -> rootid mapping using svid as key
  2. in future, check if the svid exists in the key list
  3. IF YES,
    3a. fetch the stored rootid
    3b. and then check if that root id has been edited using flywire_islatest
    3c. if it has been edited, drop the entry from the key value store and add the svid to the list that need to be queried remotely
    3d. note that in fact all keys having that value should be dropped.
  4. IF NO
    4a. run remote query
    4b. store the result in the key-value store

@jefferis
Copy link
Collaborator Author

jefferis commented Feb 1, 2021

Performance considerations:

  1. we want to persist the mapping on disk
  2. many (100s thousands) keys need to be mapped to a short value. Therefore we don't want to use a single object per cached value.
  3. we need to be able to drop keys that have a given value so we actually need fast bidirectional lookup
  4. we must store keys / values as bit64
  5. it would be better if we could use the

This feels like the standard memoise approach isn't going to work and we are going to need a standard database or specialised kv backend. Looking through the R ecosystem, these are what I can find:

@jefferis
Copy link
Collaborator Author

jefferis commented Feb 2, 2021

It might make more sense to update invalidated rootids with a signalling value 0 rather than drop them since that might cause a lot of churn and there is a strong chance that we will want to update all the supervoxel ids for an invalidated root id.

  • filehash DB1 is poorly suited because it cannot efficiently drop keys and inserting on the same key makes a new entry.
  • filehash SQLIte could work, but it doesn't allow searches by value as well as keys
  • storr looks interesting but probably not for this use case as the content hashing feels like a poor match. OTOH it abstracts away quite a lot of management issues.
  • RcppRedis is a bit bare bones

That leaves redux/redis which looks somewhat involved and thor which is potentially simpler, but unclear if LMDB supports the kind of reciprocal key/value lookups that we need.

All told, I wonder if a doubly indexed sqlite table might be the simplest option for this use case. The svids should be unique so that could be the primary key of the table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant