diff --git a/README.md b/README.md index 81939c5..d8e7156 100644 --- a/README.md +++ b/README.md @@ -33,6 +33,7 @@ Click the "title" links below for links to slides, code, and other background in | ["Road to a Data Science Career"][3] | [iRisk Lab Hack Night (Aug 2020)][4] | | ["Scaling LightGBM with Python and Dask"][5] | [PyData Montreal (Jan 2021)][21]
[Chicago ML (Jan 2021)][22] | | ["Scaling Machine Learning with Python and Dask"][5] | [Chicago Cloud Conference (Sep 2020)][6]
[ODSC East (Apr 2021)][25] | +| ["That table was empty!"][47] | --- | | ["Using Retry Logic in R HTTP Clients"][17] | --- | | ["You can do Open Source"][1] | [satRdays Chicago (Apr 2019)][2] | | ["You can and should do Open Source"][32] | [CatBoost: от 0 до 1.0.0][33] | @@ -94,3 +95,4 @@ Click the "title" links below for links to slides, code, and other background in [44]: https://opendatascience.com/scaling-lightgbm-with-dask/ [45]: https://mlops.community/james-lamb-machine-learning-engineer/ [46]: https://onceamaintainer.substack.com/p/once-a-maintainer-james-lamb +[47]: ./that-table-was-empty diff --git a/that-table-was-empty/README.md b/that-table-was-empty/README.md index e69de29..5675d24 100644 --- a/that-table-was-empty/README.md +++ b/that-table-was-empty/README.md @@ -0,0 +1,9 @@ +# Those tables were empty! ... for some definition of empty + +## Description + +> This talk describes a real experience I had on the Data Engineering team at SpotHero in early 2023. I deleted some tables that were "empty" and it ended up leading to a bunch of on-call alerts and pain and misery + +## Where this talk has been given: + +* (virtual) [Data Mishaps Night, March 2024](https://datamishapsnight.com/) ([slides](https://docs.google.com/presentation/d/1zTUQk_4t-deOdYWtmqwa2Vwsdio_CeDIB9sixdFu_DM/edit?usp=sharing)) diff --git a/that-table-was-empty/SETUP.md b/that-table-was-empty/SETUP.md index 8ffd520..4de4a0d 100644 --- a/that-table-was-empty/SETUP.md +++ b/that-table-was-empty/SETUP.md @@ -4,15 +4,9 @@ To support this talk, I tried to replicate this situation with a little Redshift * nodes: 1 * instance type: `dc2.large` -* +* region: `us-east-1` -Created a database (https://docs.aws.amazon.com/redshift/latest/gsg/t_creating_database.html): - -```sql -CREATE DATABASE prod; -``` - -* created a role per https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-iam-policies.html +Created an IAM role per https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-iam-policies.html and attached it to the cluster. ```json { @@ -63,14 +57,22 @@ CREATE DATABASE prod; } ``` +I went into the Redshift query editor in the AWS console to run SQL. + +Created a database (https://docs.aws.amazon.com/redshift/latest/gsg/t_creating_database.html): + +```sql +CREATE DATABASE prod; +``` + Then ran this to create both a Redshift schema and a database in Glue data catalog ```sql CREATE EXTERNAL SCHEMA - conda + search_tracking FROM DATA CATALOG DATABASE - 'conda' + 'search_tracking' IAM_ROLE 'arn:aws:iam::${ACCOUNT}:role/redshift-glue-access' CREATE EXTERNAL DATABASE IF NOT EXISTS; @@ -80,26 +82,44 @@ Then created a table for the `conda` download stats. ```sql CREATE EXTERNAL TABLE - conda.download_stats -(data_source VARCHAR, time TIMESTAMP, pkg_name VARCHAR, pkg_version VARCHAR, pkg_platform VARCHAR, pkg_python VARCHAR, counts BIGINT) + search_tracking.download_stats +(data_source VARCHAR, time TIMESTAMP, pkg_name VARCHAR, + pkg_version VARCHAR, pkg_platform VARCHAR, pkg_python VARCHAR, + counts BIGINT) PARTITIONED BY ( - date_year + date_year int, + date_month int, + date_day TIMESTAMP ) STORED AS PARQUET -LOCATION { 's3://anaconda-package-data/conda/hourly/' } +LOCATION 's3://anaconda-package-data/conda/hourly/' ``` -data_source: anaconda for Anaconda distribution, conda-forge for the conda-forge channel on Anaconda.org, and bioconda for the bioconda channel on Anaconda.org. -time: UTC time, binned by hour -pkg_name: Package name (Ex: pandas) -pkg_version: Package version (Ex: 0.23.0) -pkg_platform: One of linux-32, linux-64, osx-64, win-32, win-64, linux-armv7, linux-ppcle64, linux-aarch64, or noarch -pkg_python: Python version required by the package, if any (Ex: 3.7) -counts: Number of downloads for this combination of attributs +Then rregistered a partition + +```shell +aws --region us-west-2 \ + glue create-partition \ + --database-name 'search_tracking' \ + --table-name 'download_stats' \ + --partition-input ' + { + "Values": ["2017", "01", "2017-01-01"], + "StorageDescriptor": { + "Location": "s3://anaconda-package-data/conda/hourly/2017/01/2017-01-01", + "InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat", + "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat", + "SerdeInfo": { + "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe" + } + } + }' +``` -Registered this data in Glue: +References: * https://www.anaconda.com/blog/announcing-public-anaconda-package-download-data * https://anaconda-package-data.s3.amazonaws.com/ * https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-schemas.html * https://github.com/ContinuumIO/anaconda-package-data +* https://docs.aws.amazon.com/cli/latest/reference/glue/create-partition.html