GitHub - northeastern-datalab/expand-t: Repository for Expand-T: Demonstrating Table Reclamation and Expansion

This repository contains code for Expand-T: Demonstrating Table Reclamation and Expansion.

Currently, we have a Jupyter Notebook (sample_demo.ipynb) that runs Expand-T end-to-end. Upon acceptance, we will create a web application for users to thoroughly explore Expand-T.

We also have a demonstration video in which we step through the Jupyter Notebook and discuss Expand-T's functionalities: https://tinyurl.com/expand-t-demo

Expand-T consists of three steps: Table Reclamation, Table Expansion, and Table Exploration. For Table Reclamation, we adopt code from Gen-T: Table Reclamation in Data Lakes, which is publicly available here.

Given a Source Table and a data lake, we first discover relevant tables (termed candidate tables) from the data lake and prune this set to only include tables needed to fully reclaim the Source Table (termed originating tables). This code can be found in the discovery/ folder:

We first retrieve an initial set of relevant tables from the data lake, using an existing table discovery method. By default, we use Starmie (https://github.com/megagonlabs/starmie).
In discovery/discover_candidates.py, we get a set of Candidate Tables either from the set of tables returned from step (1) or from the data lake. To do so, we find tables containing columns with high set overlap with columns in the Source Table.
Gen-T now prunes the set of candidate tables to a set of originating tables, found in (discovery/prune_candidates.py). Here, Gen-T only keeps tables that are needed to reproduce the Source Table when integrated.

With a set of originating tables, we now integrate them with the goal of reproducing the Source Table (Table Reclamation) and expanding the Source Table (Table Expansion) We also include operations that users can use to explore the expanded Source Table (Table Exploration). This code can be found in the integration/ folder:

In targeted_integration.py, we use an adaptation of Gen-T to integrate the set of originating tables. Specifically to integrate the tables in order to reclaim the Source Table, we run integrate_tables().
In the same file, we also expand the source table using expand_tables().
Lastly, also in the same file, we can rank tuples (rank_tuples()) by a specified column and filter tuples (filter_tuples()) by column values.
To see the result of directly applying Outer Join on the set of candidate tables (thus replacing Gen-T and Table Expansion), we run integration_utils.outerjoin().

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
discovery		discovery
integration		integration
.gitignore		.gitignore
README.md		README.md
evaluatePaths.py		evaluatePaths.py
sample_demo.ipynb		sample_demo.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

northeastern-datalab/expand-t

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages