Skip to content

Bock sort based Indexing vs Block sort based single pass in memory indexing

Le Thu Nguyen edited this page Apr 26, 2018 · 1 revision

Block sort based indexing

Describe

Use termID instead of term. There are two approaches, fist, the vocabulary is compiled in the first pass. Second, the inverted index is constructed in the second pass. Running time is O(TlogT)

Pros and Cons:

  • Reduce storage requirements
  • Dictionary commonly kept in memory while posting list keep on disk
  • Main memory is insufficient to collect termID-docID pair, need an external sorting algorithm that uses a disk.

Block sort based single-pass in-memory indexing

Describe

Using terms instead of termIDs, it writes each block’s dictionary to disk, then starts a new dictionary for the next block. Finally, Index collection of size as long as the disk space available

Pros and Cons:

  • No need to maintain term ID mapping across each block. Keeping track of the terms a posting list belongs to → save memory
  • No sorting required → faster, and accumulate posting list as they occur.

Reference:

Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. Available at http://nlp.stanford.edu/IR-book/information-retrieval-book.html