Some of the items in the list are taken from: https://github.com/analytics/analytics/blob/master/notes/papers.md, others I have added along the way.
All of these relate to consistency reasoning within the context of a given store, but the concepts map upwards hoefully...
- Replicated Data Types: Specification, Verification, Optimality
- Consistency Analysis in Bloom: a CALM and Collected Approach
- CRDTs: Consistency without concurrency control
- A comprehensive study of Convergent and Commutative Replicated Data Types
- PIQL – Scale Independent Query Processing
- Lingual research here...
- Materialization Optimizations for Feature Selection Workloads
- LINVIEW: Incremental View Maintenance for Complex Analytical Queries
- Opportunistic Physical Design for Big Data Analytics
- Semantic Integration of Heterogeneous Information Sources
- RICES: Reasoning about Information Consistency across Enterprise Solutions
- TODO: add more here
- Semantic-Integration Research in the Database Community, Alternate link
- TODO: Add tope paper and related here
- The Universal B-Tree for multidimensional indexing: General Concepts
- The BUB-Tree (dealing with dead space) by Robert Fenk
- Processing Relational OLAP Queries with UB-Trees and Multidimensional Hierarchical Clustering (2000) by Markl and Bayer
- UB-Tree Indexing for Semantic Query Optimization of Range Queries by S. Housseno, A. Simonet and M. Simonet
- Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm
- Skjellum provides a nice way to make a Morton order that works even for non-integral numbers of bits
- Adaptive query processing in point-transformation schemes by B. Yu covers a way we can incrementally refine compution of overlapping ranges in Morton order.
- Efficient Range Query Using Multiple Hilbert Curves shows that we can answer queries in fewer ranges if we store multiple copies of the data using different space filling curves or with rotations and shifts. This may point to a replication strategy. As with storing multiple copies for a column store using different orderings to get both fault tolerance, the same thing apply to space filling storage of the data.
- Cache-Oblivious Dense and Sparse Matrix Multiplication Based on Peano Curves
- Performance of Multi-Dimensional Space-Filling Curves by Mokbel, Aref and Kamel gives a number of general invariants over all space filling curves.
- The identity transformation
- Shor's explanation of the Hu-Tucker algorithm
- Golomb coding
- Arithmetic coding
- Range coding
- Order Preserving Key Compression describes ALM coding and references the ZIL coding. ALM coding violates the prefix property of Huffman and Hu-Tucker locally, but only needs the next symbol worth of lookahead to resolve the ambiguity.
- Dictionary-based Order-preserving String Compression for Main Memory Column Stores
- Query Optimization In Compressed Database Systems
- Sometimes we don't care about ordering. If we have categorical information, we could use Huffman directly on that dimension.
- How to Wring a Table Dry: Entropy Compression of Relations and Querying of Compressed Relations
- Binary Ordered Compression for Unicode provides an lexicographically order preserving compression scheme for unicode text, so long as you do not use the reset code! BOCU-1 is covered by US Patent #6,737,994, but is royalty-free. It may serve as a default unicode text column format in the absence of more coding information.
- An Alternative to Arithmetic Coding with Local Decodability
- Order Preserving Encryption for Numeric Data by Agrawal, Kiernan Srikant, and Xu covers how to make an order-preserving encryption scheme that permits standard database indexing, robust against inspection, but not against prior domain information about ranges of values or the ability to encrypt or decrypt known data.
- An Ideal-Security Protocol for Order-Preserving Encoding by Popa, Li, and Zeldovich covers a way to improve on the previous paper to leak nothing other than order by using "mutable cipher-text". This is currently realized in their cryptdb project.
- Fast string sorting using order-preserving compression
- Order Preserving Minimal Perfect Hash Functions and Information Retrieval by Fox, Chen, Daoud, and Heath max present a way to create still-order preserving hashing functions for categorical data with known categories.
- Fast Prefix Search in Little Space with Application (slides) describes a nice trade-off that permits 'weak' rank over prefix queries provided at least one key matches the prefix in the dataset and provides a nice way to balance returning all of the records that match a given query when mixed in with the other storage tools at our disposal.
- F. C. Botelho and N. Ziviani. External perfect hashing for very large key sets. In CIKM, pages 653–662, 2007. provides a suite of techniques that are used to good effect in
- Theory and Practise of Monotone Minimal Perfect Hashing, which provides a good template for how to put monotone minimal perfect hashing to use.
- The Wikipedia Article is okay, it includes a link to this
- FCA Homepage
- Extending and Implementing the Stable Model Semantics
- Safe Database Queries with Arithmetic Operations
- OLD Resolution with Tabulation
- A Thread In Time Saves Tabling Time
- More Efficient Datalog Queries: Subsumptive Tabling Beats Magic Sets
[21:27] nwf: Paper dump:
[21:27] nwf: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.71.7650 is a decent overview, though a little dated
[21:28] nwf: http://dl.acm.org/citation.cfm?id=973230 ties a lot of this in to NLP parsing, if you're curious
[21:28] nwf: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.704 I have just skimmed and has some nice examples
[21:30] nwf: http://dl.acm.org/citation.cfm?doid=322261.322272 has some more graph-based examples
- Fully Persistent B-Trees
- Confluently Persistent Tries for Efficient Version Control by Demaine, Langerman and Price is particularly interesting to me, because they provide O(log log n) movement time fully persistent hash table.
- Stratified B-trees and Versioned Dictionaries describes a technique for versioning that is consistent with the rest of the machinery in
analytics
.
- Cache-Oblivious Algorithms was Prokop's M.S. thesis and more or less started this whole field.
- Cache-Oblivious Sorting
- Ordered Files and Cache-Oblivious Priority Queues video
- A density control algorithm for doing insertions and deletions in a sequentially ordered file in a good worst-case time. This provides the O((log^2 N)/B) worst case bound for ordered file maintanence. Note that to do so it uses a "calibrator tree" which corresponds directly with the total number of children under a given branch. This could be-expressed like (or as part of) a Haar sketch.
- Cache-Oblivious Parallel Algorithms gives a decent overview of cache-oblivious parallel architecture and Blelloch et al.'s parallel cache complexity bounds.
- Exponential Structures for Efficient Cache-Oblivious Algorithms by Bender, Cole and Raman
- Insertion Sort is /O(n log n)/ by Bender, Farach-Colton and Mostiero.
- An Adaptive Packed Memory Array by Bender and Hu provides a PMA that for many access patterns offers faster /O(log n)/ amortized inserts.
- Partially De-amortized Packed Memory Array (alt) goes half-way to the Willard bounds with a lot less implementation effort.
- Predecessor search with distance-sensitive query time covers Z-Fast Trees (and unlike the original isn't behind a Springer pay-wall).
- Boldi's slides from BISS2012 give a good motivation for the fat binary search used in z-fast tries.
- Bloomier Filters: A second look
- Cache-Oblivious Databases: Limitations and Opportunities by He and Luo provides lots of hard numbers comparing COB+ trees with other structures, etc.
- Cache-Oblivious Streaming B-Trees
- Streaming B-Trees for Filesystem Grand Challenges
- Cache-Oblivious Query Processing by He and Luo
- Cache-Oblivious String Dictionaries (slides) (paper) by Brodal and Fagerberg use giraffe trees for fixed dictionaries.
- Fast Compressed Tries through Path Decompositions by Grossi and Ottaviano talks about centroid trees, which are used in many dictionary structures including
- Cache-Oblivious String B-Trees by Bender, Farach-Colton and Kuszmaul used centroid trees as part of their COSB-tree
- Compressed String Dictionaries by Brisaboa, Cánovas, Martínez-Prieto and Navarro uses a BWT-based transformation to get an efficient compressed string dictionary.
- Offline Dictionary-Based Compression by Larsson and Moffat covers RE-PAIR, which has very slow encoding, but reasonable decoding performance.
- On Searching Compressed String Collections Cache-Obliviously by Ferragina, Grossi, Gupta, Shah and Vitter talks about "distribution aware" search in section 5 and uses the centroid path decomposition of a tree. They also use the exact form of rear-coding we reinvented.
- The log-structured merge tree by O'Neil, Cheng, Gawlick and O'Neil provides great throughput for inserts in exchange for high read latency for write-mostly workloads.
We could store full-text-indices on some columns as a BWT transformed version of the leave chunk, then perform compressed-search techniques on top to permit efficient operations on them without paying for full decompression.
At ~4 meg leaves this actually fits pretty well with bzip2 window sizes.
- Searching BWT compressed text with the Boyer-Moore algorithm and binary search by Bell, Powell, Mukherjee and Adjeroh. Note: since our text fragments end with a given block, the concern about chunking BWT searches from the paper do not apply.
- A Comparison of BWT Approaches to String Pattern Matching by Firth, Bell, Mukherjee and Adjeroh
- The SBC-Tree: An Index for Run-Length Compressed Sequences
- Compressed Text Indexing and Range Searching
- Opportunistic Data Structures with Applications by Ferragina and Manzini describes how to work with 'compressed suffix arrays' for sublinear space overhead on compressible texts.
- The Myriad Virtues of Wavelet Trees by Ferragina, Giancarlo, and Manzini
- Wavelet Trees for All by Navarro is an amazingly good overview of the uses of rank/select.
- Geometric Burrows-Wheeler Transform: Linking Range Searching and Text Indexing builds on wavelet trees and lets us use the algorithms from
- Space-efficient Suffix Trees by Munro, Raman and Rao (2001).
- Kernel-based Similarity Search in Massive Graph Databases with Wavelet Trees talks a bit about rank-select structures.
- Broadword Implementation of Rank/Select Queries by Vigna
- Succincter by Mihai Pătraşcu
- Dynamic Entropy Compressed Sequences and Full-Text Indices by Mäkinen and Navarro
- Succinct Dictionary Matching With No Slowdown
- The Pizza & Chili Corpus provides implementations for many succinct structures.
- X86-optimized Rank-Select Dictionaries for Bit Sequences
- Taiju Succint Trie c++ lib and references
- The Buffer Tree: A Technique for Designing Batched External Data Structures
- MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing uses smarter hashing to get a nice write-focused memcached variant.
- High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) by Jaleel, Theobald, Steely and Emer provides a nice replacement for LRU policies for an explicit buffer cache with low overhead.
- Decoupled Dynamic Cache Segmentation by Khan, Wang and Jiminéz is resistant to both scanning and thrashing, but is also "decoupled" from the choice of replacement policy.
- Fully-Functional Succinct Trees provide a nice leg-up from DFUDS and LOUDS in terms of additional functionality.
- The data locality of work stealing by Acar and Blelloch introduced the affinity scheduler now used by Intel's TBB.
- Dynamic circular work-stealing deque by Chase and Lev described a novel circular work-stealing deque. However, my attempts to get it to work in Haskell have been fraught with difficulty.
- A Dynamic-Sized Nonblocking Work Stealing Deque by Hendler et al.
- Flights Data Set might be a good benchmark
- Don’t Thrash: How to Cache your Hash on Flash provides the notion of a Quotient Filter