Bock sort based Indexing vs Block sort based single pass in memory indexing

Block sort based indexing

Describe

Use termID instead of term. There are two approaches, fist, the vocabulary is compiled in the first pass. Second, the inverted index is constructed in the second pass. Running time is O(TlogT)

Pros and Cons:

Reduce storage requirements
Dictionary commonly kept in memory while posting list keep on disk
Main memory is insufficient to collect termID-docID pair, need an external sorting algorithm that uses a disk.

Block sort based single-pass in-memory indexing

Describe

Using terms instead of termIDs, it writes each block’s dictionary to disk, then starts a new dictionary for the next block. Finally, Index collection of size as long as the disk space available

Pros and Cons:

No need to maintain term ID mapping across each block. Keeping track of the terms a posting list belongs to → save memory
No sorting required → faster, and accumulate posting list as they occur.

Reference:

Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. Available at http://nlp.stanford.edu/IR-book/information-retrieval-book.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bock sort based Indexing vs Block sort based single pass in memory indexing

Block sort based indexing

Describe

Pros and Cons:

Block sort based single-pass in-memory indexing

Describe

Pros and Cons:

Reference:

Clone this wiki locally