Sumo it's a tool for the semantic analysis of web articles. It extracts the content from an article web page and analyzing it an returning: frequency words, entity recognition, automatic summarization. It returns also the releted articles previously analized, using the term vector distance.
MongoDB >=2.6.5 Python >=2.7.5
for debian and ubuntu:
apt-get install mongodb python python-dev python-virtualenv libxml2-dev libxslt-dev zlib1g-dev libjpeg-dev gcc
We provide a Dockerfile to run a dockerized Sumo server.
docker build -t sumoserver . docker run -p 5000:5000 sumoserver
git clone https://github.com/gdamdam/sumo.git cd sumo virtualenv ./venv source venv/bin/activate pip install -r requirements.txt python requirements_nltk.py
Just lunch the server
sudo service mongodb start python ./sumo_server.py -s IP
for help and all the options you can use
python ./sumo_server.py --help
The server provides a REST resource for analyze and store the analysis data of a web document.
The following comand returns the list of all the documents stored
curl http://host:5000/sumo
The stored documents are labeled with a ID_DOC, where the / caracter in the URL are substitued with __ (double underscore).
e.g.:
TARGET_URL: www.google.com/test ID_DOC: www.google.com__test
To analyze and store a document and store it on the db:
curl http://host:5000/sumo -X POST -d 'url=TARGET_URL'
HTTP Status returned:
201: Created - the document at TARGET_URL sucessfully analyzed and stored 409: Conflict - if the TARGET_URL already exists in the storade 415: Unsupported - the TARGET_URL is malformed
To retrieve a stored document analysis:
curl http://host:500/sumo/ID_DOC
HTTP Status returned:
200: OK 404: Not Found - the document does not exist
To delete a stored document:
curl http://host:500/sumo/ID_DOC -X DELETE
HTTP Status returned:
204: No Content - document deleted 404: Not Found - the document does not exist
It is possible retrieve the cluster of similar documents using the cluster resource
curl http://host:500/sumo/cluster/ID_DOC
HTTP Status returned:
200: OK 404: Not Found - the document does not exist
The running server provides also a very minimal javascript web interface to interact with the API. The interface is reacheable at:
http://host:5000
Tips:
- single click on an ID_DOC in the index to fill the form and click analyze to retrieve the analysis.
- double click on an ID_DOC in the index to delete it.