This repository presents a tool for the development of applications focused on extracting information from text documents. The back end is fully developed using python as well as the models for the information extraction task. The front end uses VUE and a bit of java, where the main tagging component is based on the work published in the github of Doccano (https://github.com/doccano/doccano).
The following image present a short overview about the flow of the tool since the user annotates the document to how the information of the tags is stored in the data base.
- Download the repository and Build the images:
docker-compose -f dockerfiles/prod/docker-compose.yml build
- Run the aplication:
docker-compose -f dockerfiles/prod/docker-compose.yml up
If you wante insert directly the dataset into the data base:
DATASET={filename.csv} docker-compose -f dockerfiles/prod/docker-compose.yml up
Or
export DATASET={filename.csv}
docker-compose -f dockerfiles/prod/docker-compose.yml up
Container | PORT |
---|---|
webapi | http://localhost:8001/webapi |
webui | http://localhost:8082 |
mongo | 0.0.0.0:27018 |
mongo express | http://localhost:8083/ |
If you start from an empty mongo DB, you need to populate it:
- In the python Devcontainer
- Load cases into local MongoDB
python src/extraction/scripts/documents_to_mongo.py
Currently the tags that are shown in the front are defined in a src/webui/src/config/config.js
Using the same structure, you can set any tag, for example 'conference' with the associated 'id', 'suffixKey', 'backgroundColor', an,ño..,kd 'textColor':
{
text: 'Conference',
backText: 'conference',
id: 1000,
suffixKey: 'c',
backgroundColor: '#B7C859',
textColor: '#000000',
},
-
Add functionality to the next and previous doc buttons.
-
Avoid having predefined code elements such as tags in config.js. Use instead the database to have all the information for the configuration of the tool.
-
Automate tagging using models to assist the user. These annotations are stored in the annotation space instead of the user's annotation space.
-
Add components to the tool to aproach other tasks.
-
Develop the possibility to have multiple taks and datasets running at the same time.