From 96c74e2f60890876db86a7feaa99d2728805f683 Mon Sep 17 00:00:00 2001 From: Adam Marcus Date: Wed, 13 Apr 2022 18:16:17 -0400 Subject: [PATCH] Grouping sets docs and 0.1.5 release notes (#45) * Grouping sets docs and 0.1.5 release notes * Not really unfortunate :) --- HISTORY.md | 4 ++ README.md | 2 +- docs/examples.md | 100 +++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 105 insertions(+), 1 deletion(-) diff --git a/HISTORY.md b/HISTORY.md index e66ed3f..b4c47a5 100644 --- a/HISTORY.md +++ b/HISTORY.md @@ -1,5 +1,9 @@ # History +## 0.1.5 (2022-04-13) +* Support for PostgreSQL! The test suite now runs against PostgreSQL, and `datools.explanations.diff` now allows you to ask "why" about data stored in Postgres. Get excited! +* `datools.sqlalchemy_utils.grouping_sets_query` will now generate a GROUPING SETs query for databases that support grouping sets (e.g., Postgres, DuckDB) or the equivalent UNION ALL version for databases without grouping sets support (e.g., SQLite). For more, check out the [example in the docs](https://datools.readthedocs.io/en/latest/examples.html#grouping-sets-query). + ## 0.1.4 (2022-02-27) * Python 3.10 support. * Updated test suite to run tests against multiple databases, in particular expanding from SQLite only to DuckDB and SQLite. diff --git a/README.md b/README.md index 67e6674..7643e64 100644 --- a/README.md +++ b/README.md @@ -18,5 +18,5 @@ following databases: | ----------- | ----------- | | SQLite | Since v0.1.2 | | DuckDB | Since v0.1.4 | -| PostgreSQL | *Planned for next release* | +| PostgreSQL | Since v0.1.5 | | Redshift, Snowflake | *You provide an instance, I'll make the tests pass* | diff --git a/docs/examples.md b/docs/examples.md index ee5fbdf..000df6e 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -1,5 +1,105 @@ # Examples +## `diff` We'll add more examples, but the best places to look for now are: * [The blog post that introduces data diffing](https://blog.marcua.net/2022/02/20/data-diffs-algorithms-for-explaining-what-changed-in-a-dataset.html), and * [A Jupyter Notebook showing an end-to-end example](https://github.com/marcua/datools/blob/main/examples/diff/intel-sensor.ipynb). + +## `grouping_sets_query` + +[Grouping +sets](https://www.geeksforgeeks.org/postgresql-grouping-sets/) are a +neat feature of some databases that allow you to GROUP BY multiple +combinations of columns in a single pass over your data. Some +databases, like PostgreSQL and DuckDB, support them natively, whereas +others, like SQLite, don't. `datools.sqlalchemy.grouping_sets_query` +will generate a GROUPING SETs query if your database allows it or +create a synthetic equivalent using a UNION ALL of several GROUP BY +queries. + +This is concept best explained by example, and we'll use the [test +suite](https://github.com/marcua/datools/blob/14752f0e841a89a9c991bc9893e58d3b708cac7d/tests/test_sqlalchemy_utils.py#L15) +for our example. Say you have an underlying relation like `SELECT * +FROM sensor readings`, and you want to `COUNT(*)` across multiple +combinations of `created_at` and `sensor_id`. In datools, you'd write: + +```python +from datools.sqlalchemy_utils import grouping_sets_query +query, set_index = grouping_sets_query( + db_engine, + 'SELECT * FROM sensor_readings', + ( + (Column('created_at'), Column('sensor_id')), + (Column('created_at'),), + (Column('sensor_id'),), + (), + ), + (Aggregate(AggregateFunction.COUNT, Column('*'), Column('num_rows')), ) +) + +print('Query:', query) +print('Set index:', set_index) +``` + + +On PostgreSQL (which supports GROUPING SETS), this would result in: +```sql +Query: +WITH query AS (SELECT * FROM sensor_readings) +SELECT + GROUPING(created_at, sensor_id) AS grouping_id, + created_at, sensor_id, + COUNT(*) AS num_rows +FROM query +GROUP BY GROUPING SETS ((created_at, sensor_id), (created_at), (sensor_id), ()) +``` + +```python +Set index: {7: (Column(name='created_at'), Column(name='sensor_id')), 11: (Column(name='created_at'),), 13: (Column(name='sensor_id'),), 14: ()} +``` + + + + +On SQLite (which doesn't support GROUPING SETS), this would result in: +```sql +Query: +WITH query AS (SELECT * FROM sensor_readings) + +SELECT + 0 AS grouping_id, + created_at, sensor_id, + COUNT(*) AS num_rows +FROM query +GROUP BY created_at, sensor_id + +UNION ALL + +SELECT + 1 AS grouping_id, + created_at, NULL AS sensor_id, + COUNT(*) AS num_rows +FROM query +GROUP BY created_at + +UNION ALL + +SELECT + 2 AS grouping_id, + NULL AS created_at, sensor_id, + COUNT(*) AS num_rows +FROM query +GROUP BY sensor_id + +UNION ALL + +SELECT + 3 AS grouping_id, + NULL AS created_at, NULL AS sensor_id, + COUNT(*) AS num_rows +FROM query +``` + +```python +Set index: {0: (Column(name='created_at'), Column(name='sensor_id')), 1: (Column(name='created_at'),), 2: (Column(name='sensor_id'),), 3: ()} +```