Skip to content

Commit

Permalink
Add Data Mishaps Night talk (#53)
Browse files Browse the repository at this point in the history
  • Loading branch information
jameslamb authored Mar 7, 2024
1 parent 98c3015 commit 92340a9
Show file tree
Hide file tree
Showing 3 changed files with 136 additions and 0 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ Click the "title" links below for links to slides, code, and other background in
| ["Road to a Data Science Career"][3] | [iRisk Lab Hack Night (Aug 2020)][4] |
| ["Scaling LightGBM with Python and Dask"][5] | [PyData Montreal (Jan 2021)][21]<br>[Chicago ML (Jan 2021)][22] |
| ["Scaling Machine Learning with Python and Dask"][5] | [Chicago Cloud Conference (Sep 2020)][6]<br>[ODSC East (Apr 2021)][25] |
| ["Those tables were empty!"][47] | --- |
| ["Using Retry Logic in R HTTP Clients"][17] | --- |
| ["You can do Open Source"][1] | [satRdays Chicago (Apr 2019)][2] |
| ["You can and should do Open Source"][32] | [CatBoost: от 0 до 1.0.0][33] |
Expand Down Expand Up @@ -94,3 +95,4 @@ Click the "title" links below for links to slides, code, and other background in
[44]: https://opendatascience.com/scaling-lightgbm-with-dask/
[45]: https://mlops.community/james-lamb-machine-learning-engineer/
[46]: https://onceamaintainer.substack.com/p/once-a-maintainer-james-lamb
[47]: ./those-tables-were-empty
9 changes: 9 additions & 0 deletions those-tables-were-empty/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Those tables were empty! ... for some definition of empty

## Description

> This talk describes a real experience I had on the Data Engineering team at SpotHero in early 2023. I deleted some tables that were "empty" and it ended up leading to a bunch of on-call alerts and pain and misery
## Where this talk has been given:

* (virtual) [Data Mishaps Night, March 2024](https://datamishapsnight.com/) ([slides](https://docs.google.com/presentation/d/1zTUQk_4t-deOdYWtmqwa2Vwsdio_CeDIB9sixdFu_DM/edit?usp=sharing))
125 changes: 125 additions & 0 deletions those-tables-were-empty/SETUP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# setup

To support this talk, I tried to replicate this situation with a little Redshift in my personal account.

* nodes: 1
* instance type: `dc2.large`
* region: `us-east-1`

Created an IAM role per https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-iam-policies.html and attached it to the cluster.

```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListMultipartUploadParts",
"s3:ListBucket",
"s3:ListBucketMultipartUploads"
],
"Resource": [
"arn:aws:s3:::anaconda-package-data",
"arn:aws:s3:::anaconda-package-data/*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:CreateDatabase",
"glue:DeleteDatabase",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:UpdateDatabase",
"glue:CreateTable",
"glue:DeleteTable",
"glue:BatchDeleteTable",
"glue:UpdateTable",
"glue:GetTable",
"glue:GetTables",
"glue:BatchCreatePartition",
"glue:CreatePartition",
"glue:DeletePartition",
"glue:BatchDeletePartition",
"glue:UpdatePartition",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
}
]
}
```

I went into the Redshift query editor in the AWS console to run SQL.

Created a database (https://docs.aws.amazon.com/redshift/latest/gsg/t_creating_database.html):

```sql
CREATE DATABASE prod;
```

Then ran this to create both a Redshift schema and a database in Glue data catalog

```sql
CREATE EXTERNAL SCHEMA
search_tracking
FROM DATA CATALOG
DATABASE
'search_tracking'
IAM_ROLE
'arn:aws:iam::${ACCOUNT}:role/redshift-glue-access'
CREATE EXTERNAL DATABASE IF NOT EXISTS;
```

Then created a table for the `conda` download stats.

```sql
CREATE EXTERNAL TABLE
search_tracking.download_stats
(data_source VARCHAR, time TIMESTAMP, pkg_name VARCHAR,
pkg_version VARCHAR, pkg_platform VARCHAR, pkg_python VARCHAR,
counts BIGINT)
PARTITIONED BY (
date_year int,
date_month int,
date_day TIMESTAMP
)
STORED AS PARQUET
LOCATION 's3://anaconda-package-data/conda/hourly/'
```

Then rregistered a partition

```shell
aws --region us-west-2 \
glue create-partition \
--database-name 'search_tracking' \
--table-name 'download_stats' \
--partition-input '
{
"Values": ["2017", "01", "2017-01-01"],
"StorageDescriptor": {
"Location": "s3://anaconda-package-data/conda/hourly/2017/01/2017-01-01",
"InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
}
}
}'
```

References:

* https://www.anaconda.com/blog/announcing-public-anaconda-package-download-data
* https://anaconda-package-data.s3.amazonaws.com/
* https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-schemas.html
* https://github.com/ContinuumIO/anaconda-package-data
* https://docs.aws.amazon.com/cli/latest/reference/glue/create-partition.html

0 comments on commit 92340a9

Please sign in to comment.