Skip to content

0.10.0 Jun 16, 2017: New Memory, new HLL, new weighted sampling

Compare
Choose a tag to compare
@AlexanderSaydakov AlexanderSaydakov released this 16 Jun 19:16
· 2237 commits to master since this release
  • The Memory package, which is used extensively by all the DataSketches library, has been completely rewritten and moved to its own repository.
    • The new Memory package now leverages Closeable and when used with try-with-resources blocks eliminates the need to close() resources external to the JVM (e.g., memory-mapped files and off-heap memory allocations). This totally replaces the freeMemory() requirements of the prior Memory implementation.
    • The API has been streamlined to allow simpler creation of regions (like ByteBuffer slices), which are views of the same underlying resource.
    • The internal architecture has been redesigned to eliminate redundancy and cleaner separation of the management of resources (off-heap memory, memory-mapped files, wrapped ByteBuffers and wrapped primitive arrays) from the specifics of the API implementation.
    • Currently there are two API implementations: Memory, which provides direct-addressed, primitive (and primitive array) access, and Buffer, which provides a relative positional interface for primitive (and primitive array) access.
    • This has required some API changes when using the Memory package: For example, instead of new NativeMemory(bytes) use Memory.wrap(bytes) or WritableMemory.wrap(bytes). Watch the distinction between the read-only wrap methods, which take Memory and updatable wrap methods, which take WritableMemory. Attempts to modify read-only objects will throw SketchesReadOnlyException.
  • Completely rewritten HLL sketches with improved speed and accuracy performance.
    • The prior version of HLL had some performance, usability and design issues that were problematic. In addition, our science team has developed some more advanced estimators that dramatically improve the accuracy of the HLL sketches, especially in the low-range. We decided that the best route was to redesign the HLL sketches from scratch.
  • Added weighted sampling sketch
    • VarOptItemsSketch creates a random sample of weighted items from a stream, with the inclusion probability approximately a function of the item's weight. The sketch can additionally apply a predicate to the sampled items to compute sums of weights over the subset, along with error bounds.
  • Added support for subset sums with error bounds to Reservoir sampling
    • Mirrors the (new) functionality for weighted sampling, back-ported to unweighted sampling.
  • Some API changes in the Builder.build() methods:
    • Builder.build() methods don't accept sketch size anymore, and optionally only accept a Memory object. This was changed to avoid an easy-to-create bug by a user that can be difficult to find. The initMemory(Memory) function is moved to the build(Memory) and the build(int k) function is moved to a builder.setK(int k) function.
  • To improve consistency and clarity of functionality across the library, we have changed factory method names from the generic getInstance() to newInstance() when a virgin instance is being created and heapify(), or wrap() when the result instance already contains data.