diff --git a/README.md b/README.md index 81939c5..f2b3818 100644 --- a/README.md +++ b/README.md @@ -33,6 +33,7 @@ Click the "title" links below for links to slides, code, and other background in | ["Road to a Data Science Career"][3] | [iRisk Lab Hack Night (Aug 2020)][4] | | ["Scaling LightGBM with Python and Dask"][5] | [PyData Montreal (Jan 2021)][21]
[Chicago ML (Jan 2021)][22] | | ["Scaling Machine Learning with Python and Dask"][5] | [Chicago Cloud Conference (Sep 2020)][6]
[ODSC East (Apr 2021)][25] | +| ["Those tables were empty!"][47] | --- | | ["Using Retry Logic in R HTTP Clients"][17] | --- | | ["You can do Open Source"][1] | [satRdays Chicago (Apr 2019)][2] | | ["You can and should do Open Source"][32] | [CatBoost: от 0 до 1.0.0][33] | @@ -94,3 +95,4 @@ Click the "title" links below for links to slides, code, and other background in [44]: https://opendatascience.com/scaling-lightgbm-with-dask/ [45]: https://mlops.community/james-lamb-machine-learning-engineer/ [46]: https://onceamaintainer.substack.com/p/once-a-maintainer-james-lamb +[47]: ./those-tables-were-empty diff --git a/those-tables-were-empty/README.md b/those-tables-were-empty/README.md new file mode 100644 index 0000000..5675d24 --- /dev/null +++ b/those-tables-were-empty/README.md @@ -0,0 +1,9 @@ +# Those tables were empty! ... for some definition of empty + +## Description + +> This talk describes a real experience I had on the Data Engineering team at SpotHero in early 2023. I deleted some tables that were "empty" and it ended up leading to a bunch of on-call alerts and pain and misery + +## Where this talk has been given: + +* (virtual) [Data Mishaps Night, March 2024](https://datamishapsnight.com/) ([slides](https://docs.google.com/presentation/d/1zTUQk_4t-deOdYWtmqwa2Vwsdio_CeDIB9sixdFu_DM/edit?usp=sharing)) diff --git a/those-tables-were-empty/SETUP.md b/those-tables-were-empty/SETUP.md new file mode 100644 index 0000000..4de4a0d --- /dev/null +++ b/those-tables-were-empty/SETUP.md @@ -0,0 +1,125 @@ +# setup + +To support this talk, I tried to replicate this situation with a little Redshift in my personal account. + +* nodes: 1 +* instance type: `dc2.large` +* region: `us-east-1` + +Created an IAM role per https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-iam-policies.html and attached it to the cluster. + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "s3:GetBucketLocation", + "s3:GetObject", + "s3:ListMultipartUploadParts", + "s3:ListBucket", + "s3:ListBucketMultipartUploads" + ], + "Resource": [ + "arn:aws:s3:::anaconda-package-data", + "arn:aws:s3:::anaconda-package-data/*" + ] + }, + { + "Effect": "Allow", + "Action": [ + "glue:CreateDatabase", + "glue:DeleteDatabase", + "glue:GetDatabase", + "glue:GetDatabases", + "glue:UpdateDatabase", + "glue:CreateTable", + "glue:DeleteTable", + "glue:BatchDeleteTable", + "glue:UpdateTable", + "glue:GetTable", + "glue:GetTables", + "glue:BatchCreatePartition", + "glue:CreatePartition", + "glue:DeletePartition", + "glue:BatchDeletePartition", + "glue:UpdatePartition", + "glue:GetPartition", + "glue:GetPartitions", + "glue:BatchGetPartition" + ], + "Resource": [ + "*" + ] + } + ] +} +``` + +I went into the Redshift query editor in the AWS console to run SQL. + +Created a database (https://docs.aws.amazon.com/redshift/latest/gsg/t_creating_database.html): + +```sql +CREATE DATABASE prod; +``` + +Then ran this to create both a Redshift schema and a database in Glue data catalog + +```sql +CREATE EXTERNAL SCHEMA + search_tracking +FROM DATA CATALOG +DATABASE + 'search_tracking' +IAM_ROLE + 'arn:aws:iam::${ACCOUNT}:role/redshift-glue-access' +CREATE EXTERNAL DATABASE IF NOT EXISTS; +``` + +Then created a table for the `conda` download stats. + +```sql +CREATE EXTERNAL TABLE + search_tracking.download_stats +(data_source VARCHAR, time TIMESTAMP, pkg_name VARCHAR, + pkg_version VARCHAR, pkg_platform VARCHAR, pkg_python VARCHAR, + counts BIGINT) +PARTITIONED BY ( + date_year int, + date_month int, + date_day TIMESTAMP +) +STORED AS PARQUET +LOCATION 's3://anaconda-package-data/conda/hourly/' +``` + +Then rregistered a partition + +```shell +aws --region us-west-2 \ + glue create-partition \ + --database-name 'search_tracking' \ + --table-name 'download_stats' \ + --partition-input ' + { + "Values": ["2017", "01", "2017-01-01"], + "StorageDescriptor": { + "Location": "s3://anaconda-package-data/conda/hourly/2017/01/2017-01-01", + "InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat", + "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat", + "SerdeInfo": { + "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe" + } + } + }' +``` + +References: + +* https://www.anaconda.com/blog/announcing-public-anaconda-package-download-data +* https://anaconda-package-data.s3.amazonaws.com/ +* https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-schemas.html +* https://github.com/ContinuumIO/anaconda-package-data +* https://docs.aws.amazon.com/cli/latest/reference/glue/create-partition.html