SANTOS: Relationship-based Semantic Table Union Search

This repository contains the implementation of our paper SANTOS: Relationship-based Semantic Table Union Search, appeared at SIGMOD 2023.

Authors: Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald

Abstract

Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover semantic relationship between pairs of columns: The first uses an existing knowledge base (KB), the second (which we call a “synthesized KB”) uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm called SANTOS outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically in all benchmarks that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating a synthesized KBs from data lakes with limited KB coverage and using them for union search.

Repository Organization

benchmark folder contains subfolders for SANTOS Small Benchmark (santos_benchmark), SANTOS Large Benchmark (real_data_lake_benchmark) and TUS Benchmark (tus_benchmark).
codes folder contains SANTOS source codes for preprocessing yago, creating synthesized knowledge base, preprocessing data lake tables using yago and querying top-k SANTOS unionable tables.
groundtruth folder contains the groundtruth files used to evaluate precision and recall.
hashmap folder contains indexes built during the preprocessing phase.
images folder contains supplementary images submitted with the paper.
stats folder contains SANTOS output files related to top-k search results and efficiency.
yago folder contains the original and indexed yago files.
README.md file explains the repository.
requirements.txt file contains necessary packages to run the project.

Benchmark

Please visit this link to download Real Data Lake Benchmark (aka SANTOS Large), SANTOS Benchmark (aka SANTOS Small) and relabeled TUS Benchmark. The original TUS benchmark is available at https://github.com/RJMillerLab/table-union-search-benchmark.

Setup

Clone the repo
CD to the repo directory. Create and activate a virtual environment for this project

On macOS or Linux:

python3 -m venv env
source env/bin/activate
which python

On windows:

python -m venv env
.\env\Scripts\activate.bat
where.exe python

Install necessary packages. We recommend using python version 3.7 or higher.
```
pip install -r requirements.txt
```

Reproducibility

If you want to run SANTOS interactively on SANTOS benchmark, you can check our Demo: DIALITE, which is available as a web API. DIALITE is a table discovery and integration pipeline that uses SANTOS for disovering unionable tables from data lakes. For reproducing SANTOS on your machine, please follow the following steps.

Download benchmark tables and upload them to their respective subfolders inside benchmark folder. You can download both SANTOS benchmarks manually from zenodo. For convenience, you can also run the following commands on your terminal that are based on zenodo_get package. The commands automatically download SANTOS Large and SANTOS Small benchmarks, uncompress them and replace placeholder folders with the folders having tables. As the first command takes you to benchmark folder before downloading the benchmarks, make sure that you are in home of the repo.
```
cd benchmark && zenodo_get 7758091 && rm -r santos_benchmark && unzip santos_benchmark && cd santos_benchmark && rm *.csv && cd .. && rm -r real_tables_benchmark && unzip real_data_lake_benchmark && cd real_data_lake_benchmark && rm *.csv && cd .. && mv real_data_lake_benchmark real_tables_benchmark && rm *.zip && cd ..
```
For TUS benchmark, download them from this page and upload them to their respective subfolders.
Download, unzip and upload YAGO knowledge base to yago/yago_original folder.
Run preprocess_yago.py to create entity dictionary, type dictionary, inheritance dictionary and relationship dictionary. Then run Yago_type_counter.py, Yago_subclass_extractor.py and Yago_subclass_score.py one after another to generate the type penalization scores. The created dictionaries are stored in yago/yago_pickle. You may delete the downloaded yago files after this step as we do not need orignal yago in yago/yago_original anymore.
Run data_lake_processing_yago.py to create yago inverted index.
Run data_lake_processing_synthesized_kb.py to create synthesized type dictionary, relationship dictionary and synthesized inverted index.
Run query_santos.py to get top-k SANTOS unionable table search results.

Citation

@article{DBLP:journals/pacmmod/KhatiwadaFSCGMR23,
  author       = {Aamod Khatiwada and
                  Grace Fan and
                  Roee Shraga and
                  Zixuan Chen and
                  Wolfgang Gatterbauer and
                  Ren{\'{e}}e J. Miller and
                  Mirek Riedewald},
  title        = {{SANTOS:} Relationship-based Semantic Table Union Search},
  journal      = {Proc. {ACM} Manag. Data},
  volume       = {1},
  number       = {1},
  pages        = {9:1--9:25},
  year         = {2023},
  url          = {https://doi.org/10.1145/3588689},
  doi          = {10.1145/3588689},
  timestamp    = {Thu, 15 Jun 2023 21:57:48 +0200},
  biburl       = {https://dblp.org/rec/journals/pacmmod/KhatiwadaFSCGMR23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SANTOS: Relationship-based Semantic Table Union Search

Abstract

Repository Organization

Benchmark

Setup

Reproducibility

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
benchmark		benchmark
codes		codes
groundtruth		groundtruth
hashmap		hashmap
images		images
stats		stats
yago		yago
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

northeastern-datalab/santos

Folders and files

Latest commit

History

Repository files navigation

SANTOS: Relationship-based Semantic Table Union Search

Abstract

Repository Organization

Benchmark

Setup

Reproducibility

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages