-
Notifications
You must be signed in to change notification settings - Fork 2
Home
DPDS (Data Provenance for Data Science) is a library capable of capturing fine-grained provenance in a preprocessing pipeline. Built on top of pandas, DPDS provides a clear and clean interface for capturing provenance without the need to invoke additional functions. The data scientist simply needs to implement the pipeline and, after executing it, can analyze the corresponding graph using Neo4j.
Currently, the types of functions captured are as follows:
Category | Function | Description | Examples |
---|---|---|---|
Data Reduction | Feature Selection | One or more features are removed. | |
Data Reduction | Instance Drop | One or more records are removed. | |
Data Augmentation | Feature Augmentation | One or more features are added. | |
Data Augmentation | Instance Generation | One or more records are added. | |
Space Transformation | Dimensionality Reduction | Features and records are added/removed. The overall number of removed features and records is greater than those added. | |
Space Transformation | Space Augmentation | Features and records are added/removed. The overall number of added features and records is greater than those removed. | |
Space Transformation | Space Transformation | Features and records are added/removed. In this case, there can be a reduction in dimensionality for one axis and a space augmentation for the other. | |
Data Transformation | Value Transformation | The values of one or more features are transformed. | Examples |
Data Transformation | Imputation | Missing values in one or more features are filled with estimated values. | |
Feature Manipulation | Feature Rename | One or more features are renamed. | |
Data Combination | Join | Two or more datasets are combined based on a common attribute or key. |
- neo4j >= 5.7.x
- pandas 1.5.0
For further details, refer to the requirements.txt file.
To create a new virtual environment (venv), use the guide at the following link.
source activate venv/bin/activate
pip install -r requirements.txt
It is recommended to install Docker using the official guide at the following link.
To change the options related to the Neo4j Docker image, modify the file neo4j/docker-compose.yml.
Start Neo4j in background:
cd neo4j
docker compose up -d
Stop Neo4j:
cd neo4j
docker compose down
Default credentials:
- User: neo4j
- Password: admin
To access the Neo4j web interface: