Vectograph

Vectograph is an open-source software library for automatically creating a graph structured data from a given tabular data.

Creating Structured Data from Tabular Data
Installation
Examples

Creating Structured Data from Tabular Data

Let X be a m by n matrix representing the input tabular, the structured data is created by following these steps:

Apply QCUT algorithm for each column that has at least min_unique_val_per_column number of unique values.
Consider
1. the i.th row as the i.th concise bounded description of the i.th event.
2. the j.th column as the j.th relation/predicate/edge.
3. A triple is modeled as event_i -> relation_j -> X_ij.

Assume that we have the first row of fetch_california_housing is

[   8.3252       41.            6.98412698    1.02380952  322. 2.55555556   37.88       -122.23      ]

Applying the QCUT algorithm with default parameters min_unique_val_per_column=6, num_quantile=5 generates 0.th CBD

<Event_0> <Feature_Category_0> <0_quantile_4> .
<Event_0> <Feature_Category_1> <1_quantile_4> .
<Event_0> <Feature_Category_2> <2_quantile_4> .
<Event_0> <Feature_Category_3> <3_quantile_1> .
<Event_0> <Feature_Category_4> <4_quantile_0> .
<Event_0> <Feature_Category_5> <5_quantile_1> .
<Event_0> <Feature_Category_6> <6_quantile_4> .
<Event_0> <Feature_Category_7> <7_quantile_0> .

that consist of n triples. <Feature_Category_0> represents the 0.th relation, i.e., 0.th column, whereas <0_quantile_4> represents a tail entity , i.e., the 4.th bin of the 0.th column of the tabular data. . After the data conversion, we store each bin values. For instance, running examples/sklearn_example.py generates Feature_Category_0_Mapping.csv that indicates 0_quantile_4 corresponds a bin that cover all values greater or equal than 5.10972.

Installation

git clone https://github.com/dice-group/Vectograph.git
conda create -n vectograph python=3.6 # Or be sure that your have Python => 3.6.
conda activate vectograph
pip install -e . 
python -c "import vectograph"
python -m pytest tests

Examples

API Example

from vectograph.transformers import GraphGenerator
from vectograph.quantizer import QCUT
import pandas as pd
from sklearn import datasets

X, y = datasets.fetch_california_housing(return_X_y=True)
X_transformed = QCUT(min_unique_val_per_column=6, num_quantile=5).transform(pd.DataFrame(X))
# Add prefix
X_transformed.index = 'Event_' + X_transformed.index.astype(str)
kg = GraphGenerator().transform(X_transformed)

for s, p, o in kg:
    print(s, p, o)

Scripting Example

Create a toy dataset via sklearn. Available datasets: boston, iris, diabetes, digits, wine, and breast_cancer.

python create_toy_data.py --toy_dataset_name "boston"
# Discretize each column having at least 12 unique values into 10 quantiles, otherwise do nothing
python main.py --tabularpath "boston.csv" --kg_name "boston.nt" --num_quantile=10 --min_unique_val_per_column=12

Integration

From tabular data to knowledge graph embeddings : Scripting Vectograph & Knowledge Graph Embeddings at Scale

# (1) Clone the repositories.
git clone https://github.com/dice-group/dice-embeddings
git clone https://github.com/dice-group/vectograph.git
# (3) Create a virtual environment and install the dependencies pertaining frameworks.
conda create -n dice python=3.9.12 && conda activate dice
pip3 install -r dice-embeddings/requirements.txt
pip3 install -e vectograph/.
# (5) Create a knowledge graph by using an example dataset from sklearn.datasets wine or fetch_california_housing
python vectograph/create_toy_data.py --toy_dataset_name "boston"
python vectograph/main.py --tabularpath "boston.csv" --kg_name "boston.nt" --num_quantile=10 --min_unique_val_per_column=12
# (6) Preparation for KGE
# (6.1) Create an experiment folder and Move the RDF knowledge graph into (6.1) and rename it
mkdir Example && mv boston.nt Example/train.txt
# (7) Generate Embeddings
python dice-embeddings/main.py --path_dataset_folder "Example" --model "ConEx"
# A folder named with current time created that contains following files
# ConEx_entity_embeddings.npz    entity_to_idx.gzip  relation_to_idx.gzip
# ConEx_relation_embeddings.csv  idx_train_df.gzip   report.json
# configuration.json             model.pt            train_df.gzip

How to cite

For any further questions, please contact: caglar.demir@upb.de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Vectograph

Creating Structured Data from Tabular Data

Installation

Examples

API Example

Scripting Example

Integration

How to cite

Files

README.md

Latest commit

History

README.md

File metadata and controls

Vectograph

Creating Structured Data from Tabular Data

Installation

Examples

API Example

Scripting Example

Integration

How to cite