Skip to content

Minoan ER is an Entity Resolution (ER) framework, built by researchers in Crete (the land of the ancient Minoan civilization). Entity resolution aims to identify descriptions that refer to the same entity within or across knowledge bases.

License

Notifications You must be signed in to change notification settings

vefthym/MinoanER

Repository files navigation

MinoanER

Minoan ER is an Entity Resolution (ER) framework, built by researchers in Crete (the land of the ancient Minoan civilization). Entity resolution aims to identify descriptions that refer to the same entity within or across knowledge bases.

The website of the project is http://www.csd.uoc.gr/~vefthym/minoanER/

The functionality of this framework is described in details in the followig PhD thesis (mostly in Chapter 4):
http://csd.uoc.gr/~vefthym/DissertationEfthymiou.pdf

MinoanER is implemented in Java 8+, using Apache Spark. We assume that a Spark cluster is available. Our code has been tested in a Spark cluster with HDFS and Mesos.

The steps followed by MinoanER are Blocking, Meta-blocking and Matching. Currently, the step of (token) blocking is taken from https://github.com/vefthym/ERframework/blob/master/src/NewApproaches/ExportDatasets.java but it can be easily incorporated in this repository, as a Spak task, as well.

Reference

To cite this work, please use the following reference:
"Vasilis Efthymiou, George Papadakis, Kostas Stefanidis, Vassilis Christophides: MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities. EDBT 2019: 373-384"
Pdf available here: https://openproceedings.org/2019/conf/edbt/EDBT19_paper_44.pdf

Running MinoanER

The main file is https://github.com/vefthym/MinoanER/blob/master/src/main/java/minoaner/workflow/Main.java. As documented in this file, it assumes 5 input paths and 1 output path, taken as runtime arguments:

inputBlocking:
The resulting blocks from token blocking. You can generate such a file from https://github.com/vefthym/ERframework/blob/master/src/NewApproaches/ExportDatasets.java. Each line corresponds to a block and its contents. The formatting should be:
blockId TAB entityIdFromD1#entityIdFromD1# ... ;entityIdFromD2#entityIdFromD2# ...
All those Ids should be positive integers.

inputTriples1/2:
The raw RDF triples of the first/second KB in N-triples format (without the trailing " ." part).

entityIds1/2:
To save some space, we replace all entity URLs with numeric (positive integer) ids. This file contains this mapping that you should provide. Each line corresponds to one mapping and should be in the form:
entityURL TAB numericId
The same numericId should not be assigned to two different entityURLs and the entityURLs should be the ones appearing in the raw RDF input (inputTriples1/2).

outputPath:
The (HDFS) path in which the output mappings will be stored. The format of the generated output is:
entityIdFromD1 TAB entityIdFromD2
for each pair of entities that have been found to match.
WARNING: the outputPath directory is deleted on each run.

Example datasets:
You can find examples of datasets used in MinoanER in our project's website: http://csd.uoc.gr/~vefthym/minoanER/datasets.html. If you use those datasets, here are some helpful tips for pre-processing the data:

You can covert RDF files into classes of the form EntityProfile using this DataReader. There is another reader for the ground-truth files.

Both readers have a storeSerializedObject method to store the result on the disk.

Setup and Tuning

You can tune the Spark session parameters (number of workers, executors, memory, parallelism, etc) by calling the setUpSpark method in https://github.com/vefthym/MinoanER/blob/master/src/main/java/minoaner/utils/Utils.java. The body of this method should be adjusted to reflect the resources of your Spark cluster.

In the main method, you will find some hardcoded attributes that act as entity names (labels) for the datasets that we have tested. Those attributes have been generated automatically by getting the top attributes of each KB based on the harmonic mean of support and discriminability (see related publications). You can hardcode the corresponding attributes for your KBs, or find them automatically by calling the methods found in the class https://github.com/vefthym/MinoanER/blob/master/src/main/java/minoaner/relationsWeighting/RelationsRank.java.

About

Minoan ER is an Entity Resolution (ER) framework, built by researchers in Crete (the land of the ancient Minoan civilization). Entity resolution aims to identify descriptions that refer to the same entity within or across knowledge bases.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages