From 35489ad9afc6efb32c7961066a916ba9ec0c5174 Mon Sep 17 00:00:00 2001
From: James Lamb <jaylamb20@gmail.com>
Date: Wed, 6 Mar 2024 22:58:37 -0600
Subject: [PATCH] add Data Mishaps night

---
 README.md                      |  2 ++
 that-table-was-empty/README.md |  9 +++++
 that-table-was-empty/SETUP.md  | 64 ++++++++++++++++++++++------------
 3 files changed, 53 insertions(+), 22 deletions(-)
diff --git a/README.md b/README.md
index 81939c5..d8e7156 100644
--- a/README.md
+++ b/README.md
@@ -33,6 +33,7 @@ Click the "title" links below for links to slides, code, and other background in
 | ["Road to a Data Science Career"][3]                                                      |                  [iRisk Lab Hack Night (Aug 2020)][4]                  |
 | ["Scaling LightGBM with Python and Dask"][5]                                              |    [PyData Montreal (Jan 2021)][21]<br>[Chicago ML (Jan 2021)][22]     |
 | ["Scaling Machine Learning with Python and Dask"][5]                                      | [Chicago Cloud Conference (Sep 2020)][6]<br>[ODSC East (Apr 2021)][25] |
+| ["That table was empty!"][47]                                                             | ---  |
 | ["Using Retry Logic in R HTTP Clients"][17]                                               |                                  ---                                   |
 | ["You can do Open Source"][1]                                                             |                    [satRdays Chicago (Apr 2019)][2]                    |
 | ["You can and should do Open Source"][32]                                                 |                     [CatBoost: от 0 до 1.0.0][33]                      |
@@ -94,3 +95,4 @@ Click the "title" links below for links to slides, code, and other background in
 [44]: https://opendatascience.com/scaling-lightgbm-with-dask/
 [45]: https://mlops.community/james-lamb-machine-learning-engineer/
 [46]: https://onceamaintainer.substack.com/p/once-a-maintainer-james-lamb
+[47]: ./that-table-was-empty
diff --git a/that-table-was-empty/README.md b/that-table-was-empty/README.md
index e69de29..5675d24 100644
--- a/that-table-was-empty/README.md
+++ b/that-table-was-empty/README.md
@@ -0,0 +1,9 @@
+# Those tables were empty! ... for some definition of empty
+
+## Description
+
+> This talk describes a real experience I had on the Data Engineering team at SpotHero in early 2023. I deleted some tables that were "empty" and it ended up leading to a bunch of on-call alerts and pain and misery
+
+## Where this talk has been given:
+
+* (virtual) [Data Mishaps Night, March 2024](https://datamishapsnight.com/) ([slides](https://docs.google.com/presentation/d/1zTUQk_4t-deOdYWtmqwa2Vwsdio_CeDIB9sixdFu_DM/edit?usp=sharing))
diff --git a/that-table-was-empty/SETUP.md b/that-table-was-empty/SETUP.md
index 8ffd520..4de4a0d 100644
--- a/that-table-was-empty/SETUP.md
+++ b/that-table-was-empty/SETUP.md
@@ -4,15 +4,9 @@ To support this talk, I tried to replicate this situation with a little Redshift
 
 * nodes: 1
 * instance type: `dc2.large`
-* 
+* region: `us-east-1`
 
-Created a database (https://docs.aws.amazon.com/redshift/latest/gsg/t_creating_database.html):
-
-```sql
-CREATE DATABASE prod;
-```
-
-* created a role per https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-iam-policies.html
+Created an IAM role per https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-iam-policies.html and attached it to the cluster.
 
 ```json
 {
@@ -63,14 +57,22 @@ CREATE DATABASE prod;
 }
 ```
 
+I went into the Redshift query editor in the AWS console to run SQL.
+
+Created a database (https://docs.aws.amazon.com/redshift/latest/gsg/t_creating_database.html):
+
+```sql
+CREATE DATABASE prod;
+```
+
 Then ran this to create both a Redshift schema and a database in Glue data catalog
 
 ```sql
 CREATE EXTERNAL SCHEMA
-    conda
+    search_tracking
 FROM DATA CATALOG
 DATABASE
-    'conda' 
+    'search_tracking'
 IAM_ROLE
     'arn:aws:iam::${ACCOUNT}:role/redshift-glue-access'
 CREATE EXTERNAL DATABASE IF NOT EXISTS;
@@ -80,26 +82,44 @@ Then created a table for the `conda` download stats.
 
 ```sql
 CREATE EXTERNAL TABLE
-    conda.download_stats
-(data_source VARCHAR, time TIMESTAMP, pkg_name VARCHAR, pkg_version VARCHAR, pkg_platform VARCHAR, pkg_python VARCHAR, counts BIGINT)
+    search_tracking.download_stats
+(data_source VARCHAR, time TIMESTAMP, pkg_name VARCHAR,
+ pkg_version VARCHAR, pkg_platform VARCHAR, pkg_python VARCHAR,
+ counts BIGINT)
 PARTITIONED BY (
-    date_year 
+    date_year int,
+    date_month int,
+    date_day TIMESTAMP
 )
 STORED AS PARQUET
-LOCATION { 's3://anaconda-package-data/conda/hourly/' }
+LOCATION 's3://anaconda-package-data/conda/hourly/'
 ```
 
-data_source: anaconda for Anaconda distribution, conda-forge for the conda-forge channel on Anaconda.org, and bioconda for the bioconda channel on Anaconda.org.
-time: UTC time, binned by hour
-pkg_name: Package name (Ex: pandas)
-pkg_version: Package version (Ex: 0.23.0)
-pkg_platform: One of linux-32, linux-64, osx-64, win-32, win-64, linux-armv7, linux-ppcle64, linux-aarch64, or noarch
-pkg_python: Python version required by the package, if any (Ex: 3.7)
-counts: Number of downloads for this combination of attributs
+Then rregistered a partition
+
+```shell
+aws --region us-west-2 \
+    glue create-partition \
+        --database-name 'search_tracking' \
+        --table-name 'download_stats' \
+        --partition-input '
+            {
+                "Values": ["2017", "01", "2017-01-01"],
+                "StorageDescriptor": {
+                    "Location": "s3://anaconda-package-data/conda/hourly/2017/01/2017-01-01",
+                    "InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
+                    "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
+                    "SerdeInfo": {
+                        "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
+                    }
+                }
+            }'
+```
 
-Registered this data in Glue:
+References:
 
 * https://www.anaconda.com/blog/announcing-public-anaconda-package-download-data
 * https://anaconda-package-data.s3.amazonaws.com/
 * https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-schemas.html
 * https://github.com/ContinuumIO/anaconda-package-data
+* https://docs.aws.amazon.com/cli/latest/reference/glue/create-partition.html