Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add duckdb build/concepts and use SQLGlot to convert BigQuery SQL into other dialects #1689

Merged
merged 47 commits into from
Feb 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
463609d
First "working" version (all SQL runs except for a few pivot concepts…
Apr 26, 2023
b7fffcc
Stripped dead code
Apr 26, 2023
ecbfe4d
Simplify pathing, strip more dead code
Apr 26, 2023
0e02dad
Align with shell script version
Apr 26, 2023
10e7f4e
Add requirements.txt
Apr 26, 2023
17ac9f4
Pulled sql out of .sh plus minor related mods
Apr 27, 2023
f777f46
Add table creation/loading
Apr 27, 2023
dc53b7b
Schema support (option) to mirror psql version
Apr 27, 2023
c52d421
Another chunk of dead code
Apr 27, 2023
2d3ef83
Big bug in DatetimeDiff implementation!
Apr 28, 2023
a173fd0
Adding missing fluid_balance views
Apr 28, 2023
f10eeb4
missed deletion
Apr 28, 2023
3aefb39
Rename in line with Postgres
Apr 28, 2023
0645b6c
Adding indexes
Apr 28, 2023
06a7422
Added checks
Apr 28, 2023
212f8bd
Updating README.md
Apr 28, 2023
f7a9729
Missed file rename
Apr 29, 2023
d4b5426
Move fake CHARTEVENTS PK to indexes script -- this may fail on machin…
SphtKr May 1, 2023
6bdcf14
Fixed outright errors
May 2, 2023
71e8d54
Experimental option to use integer or fractional DATETIME_DIFF function
May 8, 2023
2854015
simplify parse to use DATETIME_TRUNC and use unique alias
alistairewj Nov 24, 2023
4166507
explicitly cast seconds to integer
alistairewj Nov 24, 2023
b918099
move duckdb concept file to new python package folder
alistairewj Nov 24, 2023
62f4f59
tidy up readme and update to v2.2
alistairewj Nov 24, 2023
c79355a
init python package to support converting SQL scripts across dialects
alistairewj Nov 24, 2023
ae7a4c0
add step to test mimic_utils
alistairewj Nov 24, 2023
24de255
remove unneeded duckdb concept files
alistairewj Nov 24, 2023
f2f9148
move sqlglot monkey patching to individual modules
alistairewj Nov 24, 2023
3b133f8
reorganize classes to top
alistairewj Nov 27, 2023
12bdf7e
init duckdb transforms
alistairewj Nov 27, 2023
0654e16
add mimic_utils module name to import
alistairewj Nov 27, 2023
084a15f
add duckdb to transpilation
alistairewj Nov 27, 2023
7d54f86
remove semi-colon
alistairewj Nov 27, 2023
dcb8cc0
explicitly cast upper limit of generate series as an integer
alistairewj Nov 27, 2023
eec813b
add derived schema name to scripts by default
alistairewj Nov 27, 2023
1bcceb0
rename subfolder as it contains sqlglot transformations
alistairewj Nov 27, 2023
1ff7f57
update import
alistairewj Nov 27, 2023
9d2d4dc
refactor mimic-iv SQL queries into subfolders of concepts
alistairewj Dec 18, 2023
24d177b
formatting fixes by sqlfluff
alistairewj Dec 18, 2023
7487bbe
update postgres concepts with new transpile method
alistairewj Dec 19, 2023
51c6c4f
add duckdb concepts for mimic-iv
alistairewj Dec 19, 2023
7bb12b8
update README for mimic-iv
alistairewj Dec 23, 2023
9b083be
add help text for the mimic_utils entry points
alistairewj Jan 6, 2024
f1f07b8
switch to hatchling backend and only include mimic_utils package
alistairewj Jan 6, 2024
ad08bae
readme for pypi package
alistairewj Jan 6, 2024
f826074
various typo fixes and clearer language
alistairewj Feb 20, 2024
1dfa41c
add setup python to workflow
alistairewj Feb 20, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 16 additions & 1 deletion .github/workflows/psql.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,11 @@ jobs:
- name: Check out repository code
uses: actions/checkout@v3

- name: Install Python
uses: actions/setup-python@v5
with:
python-version: '3.10'

- name: Download demo data
uses: ./.github/actions/download-demo

Expand Down Expand Up @@ -60,7 +65,7 @@ jobs:
PGPASSWORD: postgres
BUILDCODE_PATH: mimic-iv/buildmimic/postgres

- name: Build mimic-iv concepts
- name: mimic-iv/concepts psql build
run: |
psql -h $POSTGRES_HOST -U postgres -f postgres-functions.sql
psql -h $POSTGRES_HOST -U postgres -f postgres-make-concepts.sql
Expand All @@ -69,6 +74,16 @@ jobs:
POSTGRES_HOST: postgres
PGPASSWORD: postgres

- name: mimic_utils - convert mimic-iv concepts to PostgreSQL and rebuild
run: |
pip install .
mimic_utils convert_folder mimic-iv/concepts mimic-iv/concepts_postgres --source_dialect bigquery --destination_dialect postgres
psql -h $POSTGRES_HOST -U postgres -f mimic-iv/concepts_postgres/postgres-make-concepts.sql
working-directory: ./
env:
POSTGRES_HOST: postgres
PGPASSWORD: postgres

- name: Load ed data into PostgreSQL
run: |
echo "Loading data into psql."
Expand Down
3 changes: 3 additions & 0 deletions README_mimic_utils.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# mimic_utils package

This package contains utilities for working with the MIMIC datasets.
85 changes: 41 additions & 44 deletions mimic-iii/buildmimic/duckdb/README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,51 @@
# DuckDB
# MIMIC-III in DuckDB

The script in this folder creates the schema for MIMIC-IV and
The scripts in this folder create the schema for MIMIC-III and
loads the data into the appropriate tables for
[DuckDB](https://duckdb.org/).

DuckDB, like SQLite, is serverless and
stores all information in a single file.
Unlike SQLite, an OLTP database,
DuckDB is an OLAP database, and therefore optimized for analytical queries.
This will result in faster queries for researchers using MIMIC-IV
This will result in faster queries for researchers using MIMIC-III
with DuckDB compared to SQLite.
To learn more, please read their ["why duckdb"](https://duckdb.org/docs/why_duckdb)
page.

The instructions to load MIMIC-III into a DuckDB
only require:
1. DuckDB to be installed and
## Download MIMIC-III files

[Download](https://physionet.org/content/mimiciii/1.4/)
the CSV files for MIMIC-III by any method you wish.
(These scripts should also work with the much smaller
[demo version](https://physionet.org/content/mimiciii-demo/1.4/#files-panel)
of the dataset.)

The easiest way to download them is to open a terminal then run:

```
wget -r -N -c -np -nH --cut-dirs=1 --user YOURUSERNAME --ask-password https://physionet.org/files/mimiciii/1.4/
```

Replace `YOURUSERNAME` with your physionet username.

The rest of these intructions assume the CSV files are in the folder structure as follows:

```
mimic_data_dir/
ADMISSIONS.csv.gz
CALLOUT.csv.gz
...
```

By default, the above `wget` downloads the data into `mimiciii/1.4` (as we used `--cut-dirs=1` to remove the base folder). Thus, by default, `mimic_data_dir` is `mimiciii/1.4` (relative to the current folder). The CSV files can be uncompressed (end in `.csv`) or compressed (end in `.csv.gz`).


## Shell script method (`import_duckdb.sh`)

Using this script to load MIMIC-III into a DuckDB
only requires:
1. DuckDB to be installed (the `duckdb` executable must be in your PATH)
2. Your computer to have a POSIX-compliant terminal shell,
which is already found by default on any Mac OSX, Linux, or BSD installation.

Expand All @@ -24,14 +55,6 @@ which you can obtain by either installing
[Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/install-win10)
or [Cygwin](https://www.cygwin.com/).

## Set-up

### Quick overview

1. [Install](https://duckdb.org/docs/installation/) the CLI version of DuckDB
2. [Download](https://physionet.org/content/mimiciii/1.4/) the MIMIC-III files
3. Create DuckDB database and load data

### Install DuckDB

Follow instructions on their website to
Expand All @@ -41,37 +64,10 @@ the CLI version of DuckDB.
You will need to place the `duckdb` binary in a folder on your environment path,
e.g. `/usr/local/bin`.

### Download MIMIC-III files

[Download](https://physionet.org/content/mimiciii/1.4/)
the CSV files for MIMIC-III by any method you wish.

The intructions assume the CSV files are in the folder structure as follows:

```
mimic_data_dir
ADMISSIONS.csv.gz
...
```

The CSV files can be uncompressed (end in `.csv`) or compressed (end in `.csv.gz`).
### Create DuckDB database and load data

The easiest way to download them is to open a terminal then run:

```
wget -r -N -c -np -nH --cut-dirs=1 --user YOURUSERNAME --ask-password https://physionet.org/files/mimiciii/1.4/
```

Replace `YOURUSERNAME` with your physionet username.

This will make you `mimic_data_dir` be `mimiciii/1.4`.

# Create DuckDB database and load data

The last step requires creating a DuckDB database and
loading the data into it.

You can do all of this will one shell script, `import_duckdb.sh`,
You can do all of this with one shell script, `import_duckdb.sh`,
located in this repository.

See the help for it below:
Expand Down Expand Up @@ -102,6 +98,7 @@ The script will print out progress as it goes.
Be patient, this can take minutes to hours to load
depending on your computer's configuration.


# Help

Please see the [issues page](https://github.com/MIT-LCP/mimic-iii/issues) to discuss other issues you may be having.
Please see the [issues page](https://github.com/MIT-LCP/mimic-code/issues) to discuss other issues you may be having.
Loading
Loading