Squirrel - Crawler of linked data.

Introduction

Squirrel is a crawler for the linked web. It provides several tools to search and collect data from the heterogeneous content of the linked web.

Build notes

You can build the project with a simple mvn clean install and then you can use the makefile

  $ make build dockerize
  $ docker-compose build
  $ docker-compose up

Run

You can run by using the docker-compose file.

  $ docker-compose -f docker-compose-sparql.yml up

Squirrel uses spring context configuration to define the implementation of its components in Runtime. you can check the default implementation file in spring-config/sparqlStoreBased.xml and define your own beans on it.

You can also define a different context for each one of the workers. Check the docker-compose file and change an implementation file in each worker's env variable.

These are the components of Squirrel that can be customized:

Fetcher

HTTPFetcher - Fetches data from html sources.
FTPFetcher - Fetches data from html sources.
SparqlBasedFetcher - Fetches data from Sparql endpoints.
Note: The fetchers are not managed as spring beans yet, since only three are available.

Analyzer

Analyses the fetched data and extract triples from it. Note: the analyzer implementations are managed by the SimpleAnalyzerManager. Any implementations should be passed in the constructor of this class, like the example below:

<bean id="analyzerBean" class="org.aksw.simba.squirrel.analyzer.manager.SimpleAnalyzerManager">
        <constructor-arg index="0" ref="uriCollectorBean" />
        <constructor-arg index="1" >
        	<array value-type="java.lang.String">
			  <value>org.aksw.simba.squirrel.analyzer.impl.HDTAnalyzer</value>
			  <value>org.aksw.simba.squirrel.analyzer.impl.RDFAnalyzer</value>
			  <value>org.aksw.simba.squirrel.analyzer.impl.HTMLScraperAnalyzer</value>
		</array>
       	</constructor-arg>
</bean>

Also, if you want to implement your own analyzer, it is necessary to implement the method isEligible(), that checks if that analyzer matches the condition to call the analyze method.

RDFAnalyzer - Analyses RDF formats.
HTMLScraperAnalyzer - Analyses and scrapes HTML data base on Jsoup selector-synthax (see: https://github.com/dice-group/Squirrel/wiki/HtmlScraper_how_to)
HDTAnalyzer - Analyses HDT binary RDF format.

Collectors

Collects new URIs found during the analysis process and serialize it before they are sent to the Frontier.

SimpleUriCollector - Serialize uri's and stores it in memory (mainly used for testing purposes).
SqlBasedUriCollector - Serialize uri's and stores it in a hsqldb database.

Sink

Responsible for persisting the collected RDF data.

FileBasedSink - persists the triples in NT files,
InMemorySink - persists the triples only in memory, not in disk (mainly used for testing purposes).
HdtBasedSink - persists the triples in a HDT file (compressed RDF format - http://www.rdfhdt.org/).
SparqlBasedSink - persists the triples in a SparqlEndPoint.

Name		Name	Last commit message	Last commit date
Latest commit History 662 Commits
bin/src/test/resources		bin/src/test/resources
crash_logs		crash_logs
data		data
deployment		deployment
docs		docs
scripts		scripts
seed		seed
spring-config		spring-config
squirrel.api		squirrel.api
squirrel.deduplication		squirrel.deduplication
squirrel.frontier		squirrel.frontier
squirrel.mockup		squirrel.mockup
squirrel.web-api		squirrel.web-api
squirrel.web		squirrel.web
squirrel.worker		squirrel.worker
src		src
whitelist		whitelist
yaml		yaml
.editorconfig		.editorconfig
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
Dockerfile.deduplicator		Dockerfile.deduplicator
Dockerfile.frontier		Dockerfile.frontier
Dockerfile.mockup		Dockerfile.mockup
Dockerfile.web		Dockerfile.web
Dockerfile.worker		Dockerfile.worker
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build-squirrel		build-squirrel
docker-compose-sparql.yml		docker-compose-sparql.yml
docker-compose-web.yml		docker-compose-web.yml
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
foundUris.lobs		foundUris.lobs
my-rethinkdb.pp		my-rethinkdb.pp
my-rethinkdb.te		my-rethinkdb.te
pom.xml		pom.xml
virtuoso-server.sh		virtuoso-server.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Squirrel - Crawler of linked data.

Introduction

Build notes

Run

Fetcher

Analyzer

Collectors

Sink

About

Releases

Packages

Languages

License

ajrox090/Squirrel

Folders and files

Latest commit

History

Repository files navigation

Squirrel - Crawler of linked data.

Introduction

Build notes

Run

Fetcher

Analyzer

Collectors

Sink

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages