Part of an ongoing project for Ryerson Capstone project EB02.
This repo contains a flask application which allows a user to obtain the
document entities of a set of documents in our Clueweb09 corpus.
The user only needs to know of the specific document name that they wish
to obtain entities for. Tagme-doc-web caches each specified documentID : entity
set inside a Sqlite database server, for faster document entity
retrieval.
More information about the API is below.
The application is structured as a flask application. Hence, there are flask command line commands available as well as remote endpoint commands.
Initializes the SQLite database with the default models. These models are User
, Entity
and Document
.
Initializes the SQLite database with dummy data.
Drops all the tables created, if they exist.
The bread and butter of the API. By calling with any amount of document names, the server will check if each
of them exist within the cached database. For any documents not in the local database, a separate call to TAGME will
be made. Once all documents' entities are available, the server will return the requested entities.
If top is specified, this will change the behavior of the API to return only the top n hits of entities per
document.
Hint: the TAGME server itself is somewhat slow, hence, it is advisable that for hundreds of documents, a high timeout
is configured. As more and more documents are cached into the database, this delay is expected to decrease.
The format that the api returns documents is a wrapped dictionary of
{documentname : entitie(s) : occurrence of entities}
, as below:
{
"clueweb09-en0000-00-15766": {
"Frost/Nixon (film)": 9,
"Nixon, Pennsylvania": 1,
"Frank Langella": 4,
"Michael Sheen": 4,
...
},
"clueweb09-en0000-00-18760": {
"Frost/Nixon (film)": 3,
"Nixon, Pennsylvania": 2,
"Frank Langella": 89,
"Michael Sheen": 4,
...
},
...
}
Returns a list of all the cached document entities found in the corpus.
Returns a list of all cached document titles found in the corpus.
To get started with the API, first clone this github repo to an appropriate location on your computer.
You will need to run the requirements.txt file with pip install -r requirements
. Make sure you're in a virtual
environment!
To run the project, go to the base directory where the app.py script is located, and on a command line, type
flask run
.
The project is tested with Python 3.7
.