Contextual Pre-Filtering for Data Discovery

This repository contains the code and resources for implementing a contextual pre-filtering technique in data discovery tasks, using a modified MECCH model and data profiling strategies.

Repository Structure

MECCH: Contains the modified MECCH code adapted from the original MECCH repository. Modifications were made to support our custom graph with entities and attributes.
EntityDiscovery: The EntityDiscovery folder includes code for generating the ground truth, building graph-based schemas, and profiling datasets. This code is essential for understanding how the data was processed and prepared for training the MECCH model.

Data information

The training and testing datasets, along with their graph-based schemas and metadata, are provided for replicating our experiments and training the MECCH model.

Training Data

The training ground truth files can be downloaded from the following links:

Datasets Zip: Contains datasetInfo.csv, entityAffinityLinks.csv and the datasets.
- datasetInfo.csv: Lists each dataset with its corresponding ID and name. The ID is crucial, as the URI resources for the graph-based schemas are constructed using the format:
```
http://www.essi.upc.edu/DTIM/NextiaDI/DataSource/Schema/<datasetID>/<resource name>
```
- entityAffinityLinks.csv: Contains src_node_iri and dst_node_iri columns, indicating affinity links between entities.
  - Each row in this file specifies the IRIs of two entity nodes (src_node_iri and dst_node_iri) in a graph-based schema.
  - For example:
```
src_node_iri, dst_node_iri
http://www.essi.upc.edu/DTIM/NextiaDI/DataSource/Schema/1/movements, http://www.essi.upc.edu/DTIM/NextiaDI/DataSource/Schema/3/movements
```
    - The number after Schema corresponds to the dataset ID listed in datasetInfo.csv.
    - For instance, http://.../Schema/1/... refers to dataset with ID 1, and http://.../Schema/3/... refers to dataset with ID 3.
    - These IRIs belong to resources in the graph-based schema. If you check the graph-based schema for dataset 1, you’ll find a resource with the IRI http://www.essi.upc.edu/DTIM/NextiaDI/DataSource/Schema/1/movements.
- Download Link: Training Datasets Zip
Graph-Based Schemas: Schemas generated from the datasets in TTL format.
- Download Link: Graph-Based Schemas
Graph-Based Schemas with Extra Metadata: Includes additional metadata, such as empty attributes.
- Download Link: Graph-Based Schemas with Metadata

Test Data

The data used for validating our experiments is organized as follows:

Testing Datasets: Includes datasetInfo.csv, entityAffinityLinks.csv, and the dataset files.
- Download Link: Testing Datasets
Entity Tables: Datasets provided as individual entity tables.
- Download Link: Entity Tables
Graph-Based Schemas for Testing: Graph-based schemas in TTL format for testing datasets.
- Download Link: Graph-Based Schemas for Testing
Graph-Based Schemas with Extra Metadata: Contains metadata highlighting empty attributes.
- Download Link: Testing Graph-Based Schemas with Metadata

For details on using this data to train and test the MECCH model, refer to the README inside the MECCH folder in this repository.

MECCH Folder Overview

The MECCH folder is a copy of the original MECCH codebase with modifications to handle our type of graph structure. This includes:

Extending the model to recognize entities and attributes
Modifying internal code to align with our graph schema requirements

Data Preparation

To prepare the training data for MECCH, use the script at: MECCH/entity_prefiltering_code/entity_with_nums_strs_inv.py

This script processes raw data and generates the necessary input format for MECCH. Place the generated data files in MECCH/data/nextia_entity_context.

Raw Data Files

Raw data required for generating the MECCH input structure is located in MECCH/data/nextia_entity_context/raw, containing the following files:

entityNodes.csv: Includes node ID, name, alias, IRI, dataset ID, dataset name, and node attributes.
strNodes.csv: String nodes with attributes.
numNodes.csv: Numerical nodes with attributes.
g_attr_links.csv, g_num_links.csv, g_rel_links.csv: Link files for attributes, numbers, and relationships.
g_test_alignments_links.csv, g_train_alignments_links.csv, g_val_alignments_links.csv: Alignment files for training, validation, and testing.
helpers.csv: Refer to the MECCH documentation for further details.

Alternatively, you can download the pre-processed data from this link and unzip it into MECCH/data/nextia_entity_context.

Training the MECCH Model

To train the model, follow the MECCH guidelines and use the command below with the specified parameters:

python main.py -m MECCH -t link_prediction -d nextia -g 0

-m MECCH: Specifies the MECCH model.
-t link_prediction: Defines the task as link prediction.
-d nextia: Uses the nextia dataset, representing our custom graph schema.
-g 0: Specifies GPU usage (if available); 0 is the first GPU.

The generated model can be found in MECCH/entity_prefiltering_code/entity_model or downloaded from this link.

Making Predictions

To make predictions with the trained MECCH model, follow these steps:

Data Transformation: Transform your experiment testing data into the MECCH-compatible format using:

MECCH/entity_prefiltering_code/entity_with_nums_strs_inv_experiment_Test.py

This will output the necessary structures in MECCH/data/nextia_entity_context_experiment_test, or you can download them from this link.

Prediction Execution: Use the script:

MECCH/entity_prefiltering_code/model_test_experiment_execution.py

This will execute predictions based on the trained model and the transformed test data.

For more detailed information on parameter configurations and MECCH requirements, refer to the MECCH documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
EntityDiscovery		EntityDiscovery
MECCH		MECCH
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contextual Pre-Filtering for Data Discovery

Repository Structure

Data information

Training Data

Test Data

MECCH Folder Overview

Data Preparation

Raw Data Files

Training the MECCH Model

About

Releases

Packages

Languages

dtim-upc/context-prefiltering-data-discovery

Folders and files

Latest commit

History

Repository files navigation

Contextual Pre-Filtering for Data Discovery

Repository Structure

Data information

Training Data

Test Data

MECCH Folder Overview

Data Preparation

Raw Data Files

Training the MECCH Model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages