Reading

Some of the items in the list are taken from: https://github.com/analytics/analytics/blob/master/notes/papers.md, others I have added along the way.

Reasonning about consistency

All of these relate to consistency reasoning within the context of a given store, but the concepts map upwards hoefully...

Query processing

PIQL – Scale Independent Query Processing
Lingual research here...

Machine Learning - Storage and computational foundations

Hetrogeneous Data Store Integration

Data cleaning, data management

Semantic-Integration Research in the Database Community, Alternate link
TODO: Add tope paper and related here

Time, clocks and ordering

Time, clocks and odering of the events in a distributed system

Functional Pearls / Other programming

UB-Trees

The Universal B-Tree for multidimensional indexing: General Concepts
The BUB-Tree (dealing with dead space) by Robert Fenk
Processing Relational OLAP Queries with UB-Trees and Multidimensional Hierarchical Clustering (2000) by Markl and Bayer
UB-Tree Indexing for Semantic Query Optimization of Range Queries by S. Housseno, A. Simonet and M. Simonet
Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm

Space-Filling

Skjellum provides a nice way to make a Morton order that works even for non-integral numbers of bits
Adaptive query processing in point-transformation schemes by B. Yu covers a way we can incrementally refine compution of overlapping ranges in Morton order.
Efficient Range Query Using Multiple Hilbert Curves shows that we can answer queries in fewer ranges if we store multiple copies of the data using different space filling curves or with rotations and shifts. This may point to a replication strategy. As with storing multiple copies for a column store using different orderings to get both fault tolerance, the same thing apply to space filling storage of the data.
Cache-Oblivious Dense and Sparse Matrix Multiplication Based on Peano Curves
Performance of Multi-Dimensional Space-Filling Curves by Mokbel, Aref and Kamel gives a number of general invariants over all space filling curves.

Order Preserving Codes

The identity transformation
Shor's explanation of the Hu-Tucker algorithm
Golomb coding
Arithmetic coding
Range coding
Order Preserving Key Compression describes ALM coding and references the ZIL coding. ALM coding violates the prefix property of Huffman and Hu-Tucker locally, but only needs the next symbol worth of lookahead to resolve the ambiguity.
Dictionary-based Order-preserving String Compression for Main Memory Column Stores
Query Optimization In Compressed Database Systems
Sometimes we don't care about ordering. If we have categorical information, we could use Huffman directly on that dimension.
How to Wring a Table Dry: Entropy Compression of Relations and Querying of Compressed Relations
Binary Ordered Compression for Unicode provides an lexicographically order preserving compression scheme for unicode text, so long as you do not use the reset code! BOCU-1 is covered by US Patent #6,737,994, but is royalty-free. It may serve as a default unicode text column format in the absence of more coding information.
An Alternative to Arithmetic Coding with Local Decodability

Other Order-Preserving Structures:

Order Preserving Encryption for Numeric Data by Agrawal, Kiernan Srikant, and Xu covers how to make an order-preserving encryption scheme that permits standard database indexing, robust against inspection, but not against prior domain information about ranges of values or the ability to encrypt or decrypt known data.
An Ideal-Security Protocol for Order-Preserving Encoding by Popa, Li, and Zeldovich covers a way to improve on the previous paper to leak nothing other than order by using "mutable cipher-text". This is currently realized in their cryptdb project.
Fast string sorting using order-preserving compression

Fat binary search and monotone minimal perfect hashing

Order Preserving Minimal Perfect Hash Functions and Information Retrieval by Fox, Chen, Daoud, and Heath max present a way to create still-order preserving hashing functions for categorical data with known categories.
Fast Prefix Search in Little Space with Application (slides) describes a nice trade-off that permits 'weak' rank over prefix queries provided at least one key matches the prefix in the dataset and provides a nice way to balance returning all of the records that match a given query when mixed in with the other storage tools at our disposal.
F. C. Botelho and N. Ziviani. External perfect hashing for very large key sets. In CIKM, pages 653–662, 2007. provides a suite of techniques that are used to good effect in
Theory and Practise of Monotone Minimal Perfect Hashing, which provides a good template for how to put monotone minimal perfect hashing to use.

Formal Concept Analysis

The Wikipedia Article is okay, it includes a link to this
FCA Homepage

Incremental Computation

Weighted Datalog and Generalized Annotated Programs

Stable Models

Top-Down Evaluation

Efficient Top-Down Computation of Queries Under the Well-Founded Semantics

Tabling

*-Semirings and C-Semirings/ω-continuous semirings

[21:27] nwf:	 Paper dump:
[21:27] nwf:	 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.71.7650 is a decent overview, though a little dated
[21:28] nwf:	 http://dl.acm.org/citation.cfm?id=973230 ties a lot of this in to NLP parsing, if you're curious
[21:28] nwf:	 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.704 I have just skimmed and has some nice examples
[21:30] nwf:	 http://dl.acm.org/citation.cfm?doid=322261.322272 has some more graph-based examples

Weighted Datalog and Levels of Trust

Provenance semirings and x-fast trees

Consistency and distribution

Distributed Algorithms in NoSQL Databases

Latch-free architecture

High-Performance Concurrency Control Mechanisms for Main-Memory Databases

Persistence and Versioning

Fully Persistent B-Trees
Confluently Persistent Tries for Efficient Version Control by Demaine, Langerman and Price is particularly interesting to me, because they provide O(log log n) movement time fully persistent hash table.
Stratiﬁed B-trees and Versioned Dictionaries describes a technique for versioning that is consistent with the rest of the machinery in analytics.

Cache-Obliviousness

Cache-Oblivious Algorithms was Prokop's M.S. thesis and more or less started this whole field.
Cache-Oblivious Sorting
Ordered Files and Cache-Oblivious Priority Queues video
A density control algorithm for doing insertions and deletions in a sequentially ordered file in a good worst-case time. This provides the O((log^2 N)/B) worst case bound for ordered file maintanence. Note that to do so it uses a "calibrator tree" which corresponds directly with the total number of children under a given branch. This could be-expressed like (or as part of) a Haar sketch.
Cache-Oblivious Parallel Algorithms gives a decent overview of cache-oblivious parallel architecture and Blelloch et al.'s parallel cache complexity bounds.
Exponential Structures for Efﬁcient Cache-Oblivious Algorithms by Bender, Cole and Raman

Packed Memory Arrays

Insertion Sort is /O(n log n)/ by Bender, Farach-Colton and Mostiero.
An Adaptive Packed Memory Array by Bender and Hu provides a PMA that for many access patterns offers faster /O(log n)/ amortized inserts.
Partially De-amortized Packed Memory Array (alt) goes half-way to the Willard bounds with a lot less implementation effort.

Encoding functions with static support, Monotone Minimal Perfect Hashing and Bloomier Filters

Predecessor search with distance-sensitive query time covers Z-Fast Trees (and unlike the original isn't behind a Springer pay-wall).
Boldi's slides from BISS2012 give a good motivation for the fat binary search used in z-fast tries.
Bloomier Filters: A second look

Cache-Oblivious Databases

Cache-Oblivious Databases: Limitations and Opportunities by He and Luo provides lots of hard numbers comparing COB+ trees with other structures, etc.
Cache-Oblivious Streaming B-Trees
Streaming B-Trees for Filesystem Grand Challenges
Cache-Oblivious Query Processing by He and Luo

String Dictionaries

Cache-Oblivious String Dictionaries (slides) (paper) by Brodal and Fagerberg use giraffe trees for fixed dictionaries.
Fast Compressed Tries through Path Decompositions by Grossi and Ottaviano talks about centroid trees, which are used in many dictionary structures including
Cache-Oblivious String B-Trees by Bender, Farach-Colton and Kuszmaul used centroid trees as part of their COSB-tree
Compressed String Dictionaries by Brisaboa, Cánovas, Martínez-Prieto and Navarro uses a BWT-based transformation to get an efficient compressed string dictionary.
Offline Dictionary-Based Compression by Larsson and Moffat covers RE-PAIR, which has very slow encoding, but reasonable decoding performance.
On Searching Compressed String Collections Cache-Obliviously by Ferragina, Grossi, Gupta, Shah and Vitter talks about "distribution aware" search in section 5 and uses the centroid path decomposition of a tree. They also use the exact form of rear-coding we reinvented.

Logging and Metrics

The log-structured merge tree by O'Neil, Cheng, Gawlick and O'Neil provides great throughput for inserts in exchange for high read latency for write-mostly workloads.

Compressed Search

We could store full-text-indices on some columns as a BWT transformed version of the leave chunk, then perform compressed-search techniques on top to permit efficient operations on them without paying for full decompression.

At ~4 meg leaves this actually fits pretty well with bzip2 window sizes.

Searching BWT compressed text with the Boyer-Moore algorithm and binary search by Bell, Powell, Mukherjee and Adjeroh. Note: since our text fragments end with a given block, the concern about chunking BWT searches from the paper do not apply.
A Comparison of BWT Approaches to String Pattern Matching by Firth, Bell, Mukherjee and Adjeroh
The SBC-Tree: An Index for Run-Length Compressed Sequences
Compressed Text Indexing and Range Searching
Opportunistic Data Structures with Applications by Ferragina and Manzini describes how to work with 'compressed suffix arrays' for sublinear space overhead on compressible texts.

Wavelet Trees and Succinct Data Structures

The Myriad Virtues of Wavelet Trees by Ferragina, Giancarlo, and Manzini
Wavelet Trees for All by Navarro is an amazingly good overview of the uses of rank/select.
Geometric Burrows-Wheeler Transform: Linking Range Searching and Text Indexing builds on wavelet trees and lets us use the algorithms from
Space-efficient Suffix Trees by Munro, Raman and Rao (2001).
Kernel-based Similarity Search in Massive Graph Databases with Wavelet Trees talks a bit about rank-select structures.
Broadword Implementation of Rank/Select Queries by Vigna
Succincter by Mihai Pătraşcu
Dynamic Entropy Compressed Sequences and Full-Text Indices by Mäkinen and Navarro
Succinct Dictionary Matching With No Slowdown
The Pizza & Chili Corpus provides implementations for many succinct structures.
X86-optimized Rank-Select Dictionaries for Bit Sequences
Taiju Succint Trie c++ lib and references

Compressed Computation

Computable Compressed Matrices

Batched buffer management

The Buffer Tree: A Technique for Designing Batched External Data Structures
MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing uses smarter hashing to get a nice write-focused memcached variant.
High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) by Jaleel, Theobald, Steely and Emer provides a nice replacement for LRU policies for an explicit buffer cache with low overhead.
Decoupled Dynamic Cache Segmentation by Khan, Wang and Jiminéz is resistant to both scanning and thrashing, but is also "decoupled" from the choice of replacement policy.

Succinct Tree Representations

Fully-Functional Succinct Trees provide a nice leg-up from DFUDS and LOUDS in terms of additional functionality.

Scheduling and Work-Stealing

The data locality of work stealing by Acar and Blelloch introduced the affinity scheduler now used by Intel's TBB.
Dynamic circular work-stealing deque by Chase and Lev described a novel circular work-stealing deque. However, my attempts to get it to work in Haskell have been fraught with difficulty.
A Dynamic-Sized Nonblocking Work Stealing Deque by Hendler et al.

Data Sets (for future demo / benchmarking)

Flights Data Set might be a good benchmark

Bloom Filters

Don’t Thrash: How to Cache your Hash on Flash provides the notion of a Quotient Filter

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
papers		papers
project		project
src/main/scala/learn		src/main/scala/learn
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
articles.bib		articles.bib
base.tex		base.tex
build.sbt		build.sbt
notes.md		notes.md
notes.tex		notes.tex
sandbox.tex		sandbox.tex
sbt		sbt
test.bib		test.bib
todo.org		todo.org
version.sbt		version.sbt

License

quintona/reading

Folders and files

Latest commit

History

Repository files navigation