The main objective of this work is to carry out a preprocessing and subsequent analysis of the Wikidata database to see the feasibility of its use as a potential data source for carrying out an entity geolocation project. In addition, once the data source is obtained, the aim is to analyze the performance of generating a world map with georeferenced instances.
To carry out the preprocessing of the Wikidata database, the truthy dump is first downloaded and read to obtain all the triples that contain the property P625 (coordinate location) as a predicate. In this way, all the Wikidata entities that are potentially georeferenceable on a world map are obtained. These are saved in a tsv file for later analysis.
Once the georeferenced entities were obtained, an analysis of the types to which these entities correspond was carried out. For this, the same previous procedure of reading the dump was repeated and for each triple that had the property P31 (instance of) as a predicate, the type was saved in a dictionary as key and the count as value. It should be noted that an entity may have more than one P31 property, for example the P31 properties of the University of Chile are that it is a public university, open access publisher and research institute.
To perform the visualizations, the D3.js and Folium tools were used.
The following images show the geolocation of 500 thousand Wikidata entities using d3.js and Folium respectively.
As explained above, from the analysis of the data it was possible to obtain the distribution of the types of georeferenced entities. The following table shows the 25 types of entities that are most repeated in Wikidata: