authors:
- Matteo Romanello, matteo.romanello@gmail.com
- Eric Rebillard
The main purpose of this corpus is to support the extraction of named entities--of interest to classical scholars--from secondary sources such as commentaries, journal papers, etc.
catalog.csv
: CSV file with four column- ID
- COLLECTION : (legacy information)
- TOKEN_COUNT : number of tokens
- LANG : abstract language
- BiBLIO : bibliographic information about the publication the abstract is about
iob/
: contains the corpus one record per file stored as IOB format (3 columns: token, POS tag, NE label)- the name of each file--excluded the file extension--has a corresponding record in the
catalog.csv
file
- the name of each file--excluded the file extension--has a corresponding record in the
txt/
: contains the corpus as plain text, one record per file- the name of each file--excluded the file extension--has a corresponding record in the
catalog.csv
file
- the name of each file--excluded the file extension--has a corresponding record in the
ann/
extra/
To parse the IOB files using NLTK's conll reader:
import nltk
corpus = nltk.corpus.reader.conll.ConllCorpusReader('./iob/', '.*\.txt',('words','pos','chunk'))
corpus.sents()
corpus.chunked_sents()
len(corpus.chunked_sents())
- manual correction of POS tags
- improve quality and readability of the
biblio
field