Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add duckdb build/concepts and use SQLGlot to convert BigQuery SQL into other dialects #1689

Merged
merged 47 commits into from
Feb 20, 2024
Merged
Show file tree
Hide file tree
Changes from 45 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
463609d
First "working" version (all SQL runs except for a few pivot concepts…
Apr 26, 2023
b7fffcc
Stripped dead code
Apr 26, 2023
ecbfe4d
Simplify pathing, strip more dead code
Apr 26, 2023
0e02dad
Align with shell script version
Apr 26, 2023
10e7f4e
Add requirements.txt
Apr 26, 2023
17ac9f4
Pulled sql out of .sh plus minor related mods
Apr 27, 2023
f777f46
Add table creation/loading
Apr 27, 2023
dc53b7b
Schema support (option) to mirror psql version
Apr 27, 2023
c52d421
Another chunk of dead code
Apr 27, 2023
2d3ef83
Big bug in DatetimeDiff implementation!
Apr 28, 2023
a173fd0
Adding missing fluid_balance views
Apr 28, 2023
f10eeb4
missed deletion
Apr 28, 2023
3aefb39
Rename in line with Postgres
Apr 28, 2023
0645b6c
Adding indexes
Apr 28, 2023
06a7422
Added checks
Apr 28, 2023
212f8bd
Updating README.md
Apr 28, 2023
f7a9729
Missed file rename
Apr 29, 2023
d4b5426
Move fake CHARTEVENTS PK to indexes script -- this may fail on machin…
SphtKr May 1, 2023
6bdcf14
Fixed outright errors
May 2, 2023
71e8d54
Experimental option to use integer or fractional DATETIME_DIFF function
May 8, 2023
2854015
simplify parse to use DATETIME_TRUNC and use unique alias
alistairewj Nov 24, 2023
4166507
explicitly cast seconds to integer
alistairewj Nov 24, 2023
b918099
move duckdb concept file to new python package folder
alistairewj Nov 24, 2023
62f4f59
tidy up readme and update to v2.2
alistairewj Nov 24, 2023
c79355a
init python package to support converting SQL scripts across dialects
alistairewj Nov 24, 2023
ae7a4c0
add step to test mimic_utils
alistairewj Nov 24, 2023
24de255
remove unneeded duckdb concept files
alistairewj Nov 24, 2023
f2f9148
move sqlglot monkey patching to individual modules
alistairewj Nov 24, 2023
3b133f8
reorganize classes to top
alistairewj Nov 27, 2023
12bdf7e
init duckdb transforms
alistairewj Nov 27, 2023
0654e16
add mimic_utils module name to import
alistairewj Nov 27, 2023
084a15f
add duckdb to transpilation
alistairewj Nov 27, 2023
7d54f86
remove semi-colon
alistairewj Nov 27, 2023
dcb8cc0
explicitly cast upper limit of generate series as an integer
alistairewj Nov 27, 2023
eec813b
add derived schema name to scripts by default
alistairewj Nov 27, 2023
1bcceb0
rename subfolder as it contains sqlglot transformations
alistairewj Nov 27, 2023
1ff7f57
update import
alistairewj Nov 27, 2023
9d2d4dc
refactor mimic-iv SQL queries into subfolders of concepts
alistairewj Dec 18, 2023
24d177b
formatting fixes by sqlfluff
alistairewj Dec 18, 2023
7487bbe
update postgres concepts with new transpile method
alistairewj Dec 19, 2023
51c6c4f
add duckdb concepts for mimic-iv
alistairewj Dec 19, 2023
7bb12b8
update README for mimic-iv
alistairewj Dec 23, 2023
9b083be
add help text for the mimic_utils entry points
alistairewj Jan 6, 2024
f1f07b8
switch to hatchling backend and only include mimic_utils package
alistairewj Jan 6, 2024
ad08bae
readme for pypi package
alistairewj Jan 6, 2024
f826074
various typo fixes and clearer language
alistairewj Feb 20, 2024
1dfa41c
add setup python to workflow
alistairewj Feb 20, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion .github/workflows/psql.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ jobs:
PGPASSWORD: postgres
BUILDCODE_PATH: mimic-iv/buildmimic/postgres

- name: Build mimic-iv concepts
- name: mimic-iv/concepts psql build
run: |
psql -h $POSTGRES_HOST -U postgres -f postgres-functions.sql
psql -h $POSTGRES_HOST -U postgres -f postgres-make-concepts.sql
Expand All @@ -69,6 +69,16 @@ jobs:
POSTGRES_HOST: postgres
PGPASSWORD: postgres

- name: mimic_utils - convert mimic-iv concepts to PostgreSQL and rebuild
run: |
pip install .
mimic_utils convert_folder mimic-iv/concepts mimic-iv/concepts_postgres --source_dialect bigquery --destination_dialect postgres
psql -h $POSTGRES_HOST -U postgres -f mimic-iv/concepts_postgres/postgres-make-concepts.sql
working-directory: ./
env:
POSTGRES_HOST: postgres
PGPASSWORD: postgres

- name: Load ed data into PostgreSQL
run: |
echo "Loading data into psql."
Expand Down
3 changes: 3 additions & 0 deletions README_mimic_utils.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# mimic_utils package

This package contains utilities for working with the MIMIC datasets.
147 changes: 105 additions & 42 deletions mimic-iii/buildmimic/duckdb/README.md
Original file line number Diff line number Diff line change
@@ -1,75 +1,78 @@
# DuckDB
# MIMIC-III in DuckDB

The script in this folder creates the schema for MIMIC-IV and
The scripts in this folder create the schema for MIMIC-III and
loads the data into the appropriate tables for
[DuckDB](https://duckdb.org/).

The Python script (`import_duckdb.py`) also includes the option to
add the [concepts views](../../concepts/README.md) to the database.
This makes it much easier to use the concepts views as you do not
have to install and setup PostgreSQL or use BigQuery.

DuckDB, like SQLite, is serverless and
stores all information in a single file.
Unlike SQLite, an OLTP database,
DuckDB is an OLAP database, and therefore optimized for analytical queries.
This will result in faster queries for researchers using MIMIC-IV
This will result in faster queries for researchers using MIMIC-III
with DuckDB compared to SQLite.
To learn more, please read their ["why duckdb"](https://duckdb.org/docs/why_duckdb)
page.

The instructions to load MIMIC-III into a DuckDB
only require:
1. DuckDB to be installed and
2. Your computer to have a POSIX-compliant terminal shell,
which is already found by default on any Mac OSX, Linux, or BSD installation.

To use these instructions on Windows,
you need a Unix command line environment,
which you can obtain by either installing
[Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/install-win10)
or [Cygwin](https://www.cygwin.com/).

## Set-up

### Quick overview

1. [Install](https://duckdb.org/docs/installation/) the CLI version of DuckDB
2. [Download](https://physionet.org/content/mimiciii/1.4/) the MIMIC-III files
3. Create DuckDB database and load data
## Download MIMIC-III files

### Install DuckDB
[Download](https://physionet.org/content/mimiciii/1.4/)
the CSV files for MIMIC-III by any method you wish.
(These scripts should also work with the much smaller
[demo version](https://physionet.org/content/mimiciii-demo/1.4/#files-panel)
of the dataset.)

Follow instructions on their website to
[install](https://duckdb.org/docs/installation/)
the CLI version of DuckDB.
The easiest way to download them is to open a terminal then run:

You will need to place the `duckdb` binary in a folder on your environment path,
e.g. `/usr/local/bin`.
```
wget -r -N -c -np -nH --cut-dirs=1 --user YOURUSERNAME --ask-password https://physionet.org/files/mimiciii/1.4/
```

### Download MIMIC-III files
Replace `YOURUSERNAME` with your physionet username.

[Download](https://physionet.org/content/mimiciii/1.4/)
the CSV files for MIMIC-III by any method you wish.
This will make you `mimic_data_dir` be `mimiciii/1.4`.

alistairewj marked this conversation as resolved.
Show resolved Hide resolved
The intructions assume the CSV files are in the folder structure as follows:
The rest of these intructions assume the CSV files are in the folder structure as follows:

```
mimic_data_dir
mimic_data_dir/
ADMISSIONS.csv.gz
CALLOUT.csv.gz
...
```

The CSV files can be uncompressed (end in `.csv`) or compressed (end in `.csv.gz`).

The easiest way to download them is to open a terminal then run:

```
wget -r -N -c -np -nH --cut-dirs=1 --user YOURUSERNAME --ask-password https://physionet.org/files/mimiciii/1.4/
```
## Shell script method (`import_duckdb.sh`)

Replace `YOURUSERNAME` with your physionet username.
Using this script to load MIMIC-III into a DuckDB
only requires:
1. DuckDB to be installed (the `duckdb` executable must be in your PATH)
2. Your computer to have a POSIX-compliant terminal shell,
which is already found by default on any Mac OSX, Linux, or BSD installation.

This will make you `mimic_data_dir` be `mimiciii/1.4`.
To use these instructions on Windows,
you need a Unix command line environment,
which you can obtain by either installing
[Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/install-win10)
or [Cygwin](https://www.cygwin.com/).

### Install DuckDB

Follow instructions on their website to
[install](https://duckdb.org/docs/installation/)
the CLI version of DuckDB.

You will need to place the `duckdb` binary in a folder on your environment path,
e.g. `/usr/local/bin`.

# Create DuckDB database and load data

The last step requires creating a DuckDB database and
loading the data into it.
### Create DuckDB database and load data

You can do all of this will one shell script, `import_duckdb.sh`,
located in this repository.
alistairewj marked this conversation as resolved.
Show resolved Hide resolved
Expand Down Expand Up @@ -102,6 +105,66 @@ The script will print out progress as it goes.
Be patient, this can take minutes to hours to load
depending on your computer's configuration.

## Python script method (`import_duckdb.py`)

This method does not require the DuckDB executable, it only requires the DuckDB Python
module and the [SQLGlot](https://github.com/tobymao/sqlglot) Python module, both of which can be
alistairewj marked this conversation as resolved.
Show resolved Hide resolved
easily installed with `pip`.

### Install dependencies

Install the dependencies by using the included `requirements.txt` file:

```sh
python3 -m pip install -r ./requirements.txt
```

### Create DuckDB database and load data

Create the MIMIC-III database with `import_duckdb.py` like so:

```sh
python ./import_duckdb.py /path/to/mimic_data_dir ./mimic3.db
```

...where `/path/to/mimic_data_dir` is the path containing the .csv or .csv.gz
data files downloaded above.

This command will create the `mimic3.db` file in the current directory. Be aware that
for the full MIMIC-III v1.4 dataset the resulting file will be about 34GB in size.
This process will take some time, as with the shell script version.

The default options will create only the tables and load the data, and assume
that you are running the script from the same directory where this README.md
is located. See the full options below if the defaults are insufficient.

### Create the concepts views

In most cases you will want to create the concepts views at the same time as
the database. To do this, add the `--make-concepts` option:

```sh
python ./import_duckdb.py /path/to/mimic_data_dir ./mimic3.db --make-concepts
```

If you want to add the concepts to a database already created without this
option (or created with the shell script version), you can add the
`--skip-tables` option as well:

```sh
python ./import_duckdb.py /path/to/mimic_data_dir ./mimic3.db --make-concepts --skip-tables
```

### Additional options

There are a few additional options for special situations:

| Option | Description
| - | -
| `--skip-indexes` | Don't create additional indexes when creating tables and loading data. This may be useful in memory-constrained systems or to save a little time.
| `--mimic-code-root [path]` | This argument specifies the location of the mimic-code repository files. This is needed to find the concepts SQL files. This is useful if you are running the script from a different directory than the one where this README.md file is located (the default is `../../../`)
| `--schema-name [name]` | This puts the tables and concepts views into a named schema in the database. This is mainly useful to mirror the behavior of the PostgreSQL version of the database, which places objects in a schema named `mimiciii` by default--if you have existing code designed for the PostgreSQL version, this may make migration easier. Note that--like the PostgreSQL version--the `ccs_dx` view is *not* placed in the specified schema, but in the default schema (which is `main` in DuckDB, not `public` as in PostgreSQL).

# Help

Please see the [issues page](https://github.com/MIT-LCP/mimic-iii/issues) to discuss other issues you may be having.
Loading
Loading