This repository contains the implementation of our paper SANTOS: Relationship-based Semantic Table Union Search, appeared at SIGMOD 2023.
Authors: Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald
Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover semantic relationship between pairs of columns: The first uses an existing knowledge base (KB), the second (which we call a “synthesized KB”) uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm called SANTOS outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically in all benchmarks that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating a synthesized KBs from data lakes with limited KB coverage and using them for union search.
- benchmark folder contains subfolders for SANTOS Small Benchmark (santos_benchmark), SANTOS Large Benchmark (real_data_lake_benchmark) and TUS Benchmark (tus_benchmark).
- codes folder contains SANTOS source codes for preprocessing yago, creating synthesized knowledge base, preprocessing data lake tables using yago and querying top-k SANTOS unionable tables.
- groundtruth folder contains the groundtruth files used to evaluate precision and recall.
- hashmap folder contains indexes built during the preprocessing phase.
- images folder contains supplementary images submitted with the paper.
- stats folder contains SANTOS output files related to top-k search results and efficiency.
- yago folder contains the original and indexed yago files.
- README.md file explains the repository.
- requirements.txt file contains necessary packages to run the project.
Please visit this link to download Real Data Lake Benchmark (aka SANTOS Large), SANTOS Benchmark (aka SANTOS Small) and relabeled TUS Benchmark. The original TUS benchmark is available at https://github.com/RJMillerLab/table-union-search-benchmark.
- Clone the repo
- CD to the repo directory. Create and activate a virtual environment for this project
- On macOS or Linux:
python3 -m venv env source env/bin/activate which python
- On windows:
python -m venv env .\env\Scripts\activate.bat where.exe python
- Install necessary packages. We recommend using python version 3.7 or higher.
pip install -r requirements.txt
If you want to run SANTOS interactively on SANTOS benchmark, you can check our Demo: DIALITE, which is available as a web API. DIALITE is a table discovery and integration pipeline that uses SANTOS for disovering unionable tables from data lakes. For reproducing SANTOS on your machine, please follow the following steps.
-
Download benchmark tables and upload them to their respective subfolders inside benchmark folder. You can download both SANTOS benchmarks manually from zenodo. For convenience, you can also run the following commands on your terminal that are based on zenodo_get package. The commands automatically download SANTOS Large and SANTOS Small benchmarks, uncompress them and replace placeholder folders with the folders having tables. As the first command takes you to benchmark folder before downloading the benchmarks, make sure that you are in home of the repo.
cd benchmark && zenodo_get 7758091 && rm -r santos_benchmark && unzip santos_benchmark && cd santos_benchmark && rm *.csv && cd .. && rm -r real_tables_benchmark && unzip real_data_lake_benchmark && cd real_data_lake_benchmark && rm *.csv && cd .. && mv real_data_lake_benchmark real_tables_benchmark && rm *.zip && cd ..
For TUS benchmark, download them from this page and upload them to their respective subfolders.
-
Download, unzip and upload YAGO knowledge base to yago/yago_original folder.
-
Run preprocess_yago.py to create entity dictionary, type dictionary, inheritance dictionary and relationship dictionary. Then run Yago_type_counter.py, Yago_subclass_extractor.py and Yago_subclass_score.py one after another to generate the type penalization scores. The created dictionaries are stored in yago/yago_pickle. You may delete the downloaded yago files after this step as we do not need orignal yago in yago/yago_original anymore.
-
Run data_lake_processing_yago.py to create yago inverted index.
-
Run data_lake_processing_synthesized_kb.py to create synthesized type dictionary, relationship dictionary and synthesized inverted index.
-
Run query_santos.py to get top-k SANTOS unionable table search results.
@article{DBLP:journals/pacmmod/KhatiwadaFSCGMR23,
author = {Aamod Khatiwada and
Grace Fan and
Roee Shraga and
Zixuan Chen and
Wolfgang Gatterbauer and
Ren{\'{e}}e J. Miller and
Mirek Riedewald},
title = {{SANTOS:} Relationship-based Semantic Table Union Search},
journal = {Proc. {ACM} Manag. Data},
volume = {1},
number = {1},
pages = {9:1--9:25},
year = {2023},
url = {https://doi.org/10.1145/3588689},
doi = {10.1145/3588689},
timestamp = {Thu, 15 Jun 2023 21:57:48 +0200},
biburl = {https://dblp.org/rec/journals/pacmmod/KhatiwadaFSCGMR23.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}