This repository contains the code and resources for implementing a contextual pre-filtering technique in data discovery tasks, using a modified MECCH model and data profiling strategies.
- MECCH: Contains the modified MECCH code adapted from the original MECCH repository. Modifications were made to support our custom graph with entities and attributes.
- EntityDiscovery: The EntityDiscovery folder includes code for generating the ground truth, building graph-based schemas, and profiling datasets. This code is essential for understanding how the data was processed and prepared for training the MECCH model.
The training and testing datasets, along with their graph-based schemas and metadata, are provided for replicating our experiments and training the MECCH model.
The training ground truth files can be downloaded from the following links:
-
Datasets Zip: Contains
datasetInfo.csv
,entityAffinityLinks.csv
and the datasets.-
datasetInfo.csv
: Lists each dataset with its corresponding ID and name. The ID is crucial, as the URI resources for the graph-based schemas are constructed using the format:http://www.essi.upc.edu/DTIM/NextiaDI/DataSource/Schema/<datasetID>/<resource name>
-
entityAffinityLinks.csv
: Containssrc_node_iri
anddst_node_iri
columns, indicating affinity links between entities.- Each row in this file specifies the IRIs of two entity nodes (
src_node_iri
anddst_node_iri
) in a graph-based schema. - For example:
src_node_iri, dst_node_iri http://www.essi.upc.edu/DTIM/NextiaDI/DataSource/Schema/1/movements, http://www.essi.upc.edu/DTIM/NextiaDI/DataSource/Schema/3/movements
- The number after
Schema
corresponds to the dataset ID listed indatasetInfo.csv
. - For instance,
http://.../Schema/1/...
refers to dataset with ID1
, andhttp://.../Schema/3/...
refers to dataset with ID3
. - These IRIs belong to resources in the graph-based schema. If you check the graph-based schema for dataset
1
, you’ll find a resource with the IRIhttp://www.essi.upc.edu/DTIM/NextiaDI/DataSource/Schema/1/movements
.
- The number after
- Each row in this file specifies the IRIs of two entity nodes (
-
Download Link: Training Datasets Zip
-
-
Graph-Based Schemas: Schemas generated from the datasets in TTL format.
- Download Link: Graph-Based Schemas
-
Graph-Based Schemas with Extra Metadata: Includes additional metadata, such as empty attributes.
- Download Link: Graph-Based Schemas with Metadata
The data used for validating our experiments is organized as follows:
-
Testing Datasets: Includes
datasetInfo.csv
,entityAffinityLinks.csv
, and the dataset files.- Download Link: Testing Datasets
-
Entity Tables: Datasets provided as individual entity tables.
- Download Link: Entity Tables
-
Graph-Based Schemas for Testing: Graph-based schemas in TTL format for testing datasets.
- Download Link: Graph-Based Schemas for Testing
-
Graph-Based Schemas with Extra Metadata: Contains metadata highlighting empty attributes.
- Download Link: Testing Graph-Based Schemas with Metadata
For details on using this data to train and test the MECCH model, refer to the README inside the MECCH folder in this repository.
The MECCH folder is a copy of the original MECCH codebase with modifications to handle our type of graph structure. This includes:
- Extending the model to recognize entities and attributes
- Modifying internal code to align with our graph schema requirements
To prepare the training data for MECCH, use the script at: MECCH/entity_prefiltering_code/entity_with_nums_strs_inv.py
This script processes raw data and generates the necessary input format for MECCH. Place the generated data files in MECCH/data/nextia_entity_context
.
Raw data required for generating the MECCH input structure is located in MECCH/data/nextia_entity_context/raw
, containing the following files:
- entityNodes.csv: Includes node ID, name, alias, IRI, dataset ID, dataset name, and node attributes.
- strNodes.csv: String nodes with attributes.
- numNodes.csv: Numerical nodes with attributes.
- g_attr_links.csv, g_num_links.csv, g_rel_links.csv: Link files for attributes, numbers, and relationships.
- g_test_alignments_links.csv, g_train_alignments_links.csv, g_val_alignments_links.csv: Alignment files for training, validation, and testing.
- helpers.csv: Refer to the MECCH documentation for further details.
Alternatively, you can download the pre-processed data from this link and unzip it into MECCH/data/nextia_entity_context
.
To train the model, follow the MECCH guidelines and use the command below with the specified parameters:
python main.py -m MECCH -t link_prediction -d nextia -g 0
- -m MECCH: Specifies the MECCH model.
- -t link_prediction: Defines the task as link prediction.
- -d nextia: Uses the nextia dataset, representing our custom graph schema.
- -g 0: Specifies GPU usage (if available); 0 is the first GPU.
The generated model can be found in MECCH/entity_prefiltering_code/entity_model or downloaded from this link.
Making Predictions
To make predictions with the trained MECCH model, follow these steps:
- Data Transformation: Transform your experiment testing data into the MECCH-compatible format using:
MECCH/entity_prefiltering_code/entity_with_nums_strs_inv_experiment_Test.py
This will output the necessary structures in MECCH/data/nextia_entity_context_experiment_test, or you can download them from this link.
- Prediction Execution: Use the script:
MECCH/entity_prefiltering_code/model_test_experiment_execution.py
This will execute predictions based on the trained model and the transformed test data.
For more detailed information on parameter configurations and MECCH requirements, refer to the MECCH documentation.