Skip to content

Latest commit

 

History

History
28 lines (22 loc) · 1.01 KB

usage-collection.md

File metadata and controls

28 lines (22 loc) · 1.01 KB

Pyserini: Usage of the Collection API

The collection classes provide interfaces for iterating over a collection and processing documents. Here's a demonstration on the CACM collection:

wget -O cacm.tar.gz https://github.com/castorini/anserini/blob/master/src/main/resources/cacm/cacm.tar.gz?raw=true
mkdir collections/cacm
tar xvfz cacm.tar.gz -C collections/cacm
rm cacm.tar.gz

Let's iterate through all documents in the collection:

from pyserini import collection, index

collection = collection.Collection('HtmlCollection', 'collections/cacm/')
generator = index.Generator('DefaultLuceneDocumentGenerator')

for (i, fs) in enumerate(collection):
    for (j, doc) in enumerate(fs):
        parsed = generator.create_document(doc)
        docid = parsed.get('id')            # FIELD_ID
        raw = parsed.get('raw')             # FIELD_RAW
        contents = parsed.get('contents')   # FIELD_BODY
        print('{} {} -> {} {}...'.format(i, j, docid, contents.strip().replace('\n', ' ')[:50]))