We all use Wikipedia in our everyday life to look up some information, but the problem
arises when your questions become more complex, i.e., how many cities in Germany have
the population more than 5 million or what cities in Europe were visited the most for the
last year. In this case, Wikipedia can't give you the answer, because for that it
needs to look up thousands of articles and analyze the information to find the answer.
Although most of the factual information is present in so-called info-boxes, they are useful
only when a question is related to a single article. So Wikipedia presents by itself the unstructured
(in the form of plain text) and semi-structured (info-boxes, tables, graphs) free source of encyclopedic data.
And as it transpired the questions, or queries if you will, like the above mentioned are
pretty standard, and so the DBpedia project was created to make all information available from
Wikipedia more structured and to put in such a form that even more complex queries can
be answered quickly by the machine.
The DBvoyage project tries to do the same what DBpedia did, but this time with another source of data, travel data taken from Wikivoyage. Wikivoyage data format already has some structure, so we would be happy that our task is almost done, but that is only on the surface. The data Wikivoyage has semi-structured; it is formed in accord with what they call it, JSON listing, and as it was discovered, not all contributors strictly follow it. So here we propose our solution -- DBvoyage.
With DBvoyage user interacts,
making SPARQL queries or directly referencing URL node in DBvoyage graph
to view all triples it has.
Also all Wikivoyage information about activities and attractions is present in the DBvoyage graph and can be referenced with URL found in the response. Example for activity in Aarhus: Bicycle Tour Example for attraction in Aarhus: ARoS
- See section
- Do section
- Article abstract
- Cities section
- Region section
- Article title
- Wikivoyage article link
They process all data from Wikivoyage XML dump. Next step is transitive closure of graph from created triples, with at most 12 inner vertices in path. The final step is serialization of triples in N-Triples format.
All triples are loaded to the graph over which SPARQL queries can be executed.
Basic nodejs server providing interface with project information, SPARQL queries examples and link to editor for writing SPARQL queries over dbvoyage graph.
In order to view all attractions DBvoyage knows are located in Ukraine: Go to SPARQL Editor and write the following query
select
?attractions
where
{
<http://ec2-13-59-194-238.us-east-2.compute.amazonaws.com/ontology/article/Ukraine>
<http://ec2-13-59-194-238.us-east-2.compute.amazonaws.com/ontology/property/hasAttraction>
?attractions
}
- It needs a significant amount of time to compute the partial transitive closure of DBvoyage. In essence, it tries to create the relation between node A and node C, if node C is reachable from A by going through no more than n intermediate nodes. For example, France has Paris, Paris has Eiffel Tower; therefore we need to create the connection between node France and node Eiffel Tower, but this task becomes more compute-heavy when the number of intermediate nodes goes from one to five or ten, for such task no better than O(n^3) algorithm exists. Optimization of this task would significantly improve the time needed to update existing nodes and connections between them in case of the publication of the article about a new country/city/place.
- For some section of JSON listings the article writers made it very hard to develop the extractor that generalizes for all articles: unknowable usage of the tabs, different ways of formatting the tabular information and URLs, and other cases of deviating from the Wikivoyage JSON listing format were making Virtuoso complain about the triples not conforming to any of the known RDF serialization standards.
The current version of DBvoyage managed to extract 1 688 488 semantic triples. In compressed form, all the data has a size of 38.5 MB. The link to download DBvoyage compressed dump: dbvoyage.zip
- UI/UX of the DBvoyage can be improved: design of homepage, SPARQL editor and SPARQL response page
- Improvement of the quality of existing extractors
- Addition of the extractors for other sections of Wikivoyage JSON listings
- Improvement of the relevance of the connections between the graph vertices created by running transitive closure
- Optimization of the time needed for running transitive closure on the DBvoyage graph
Feel free to create pull requests with any tasks accomplished related to the steps described above.