Sumo 0.1

Sumo it's a tool for the semantic analysis of web articles. It extracts the content from an article web page and analyzing it an returning: frequency words, entity recognition, automatic summarization. It returns also the releted articles previously analized, using the term vector distance.

Main requirements

MongoDB >=2.6.5 Python >=2.7.5

for debian and ubuntu:

apt-get install mongodb python python-dev python-virtualenv libxml2-dev libxslt-dev zlib1g-dev libjpeg-dev gcc

Using Docker

We provide a Dockerfile to run a dockerized Sumo server.

docker build -t sumoserver .
docker run -p 5000:5000 sumoserver

Basic Installation

git clone https://github.com/gdamdam/sumo.git
cd sumo
virtualenv ./venv
source venv/bin/activate
pip install -r requirements.txt
python requirements_nltk.py

Start

Just lunch the server

sudo service mongodb start
python ./sumo_server.py -s IP

for help and all the options you can use

python ./sumo_server.py --help

The server provides a REST resource for analyze and store the analysis data of a web document.

API Usage

The following comand returns the list of all the documents stored

curl http://host:5000/sumo

The stored documents are labeled with a ID_DOC, where the / caracter in the URL are substitued with __ (double underscore).

e.g.:

 TARGET_URL: www.google.com/test
     ID_DOC: www.google.com__test

To analyze and store a document and store it on the db:

curl http://host:5000/sumo -X POST -d 'url=TARGET_URL'

HTTP Status returned:

	201:	Created		- the document at TARGET_URL sucessfully analyzed and stored
	409:	Conflict	- if the TARGET_URL already exists in the storade
	415:	Unsupported	- the TARGET_URL is malformed

To retrieve a stored document analysis:

curl http://host:500/sumo/ID_DOC

HTTP Status returned:

	200:	OK			
	404:	Not Found 	- the document does not exist

To delete a stored document:

curl http://host:500/sumo/ID_DOC -X DELETE

HTTP Status returned:

	204:	No Content	- document deleted 
	404:	Not Found 	- the document does not exist

It is possible retrieve the cluster of similar documents using the cluster resource

curl http://host:500/sumo/cluster/ID_DOC

HTTP Status returned:

	200:	OK
	404:	Not Found 	- the document does not exist

Web Interface

The running server provides also a very minimal javascript web interface to interact with the API. The interface is reacheable at:

http://host:5000

Tips:

single click on an ID_DOC in the index to fill the form and click analyze to retrieve the analysis.
double click on an ID_DOC in the index to delete it.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
static		static
templates		templates
tools		tools
.gitignore		.gitignore
COPYING		COPYING
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt
requirements_nltk.py		requirements_nltk.py
sumo_server.py		sumo_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sumo 0.1

Main requirements

Using Docker

Basic Installation

Start

API Usage

Web Interface

About

Releases

Packages

Languages

License

gdamdam/sumo

Folders and files

Latest commit

History

Repository files navigation

Sumo 0.1

Main requirements

Using Docker

Basic Installation

Start

API Usage

Web Interface

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages