-
Notifications
You must be signed in to change notification settings - Fork 0
/
sql_workshop_slides.Rmd
645 lines (443 loc) · 11.5 KB
/
sql_workshop_slides.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
---
title: "SQL Workshop"
author: "Oscar O'Brien"
date: "2023-05-14"
output:
revealjs::revealjs_presentation:
theme: night
highlight: zenburn
transition: slide
increment: true
---
# SQL
- SQL == Structured Query Language
- Made by IBM in 1974 to handle large datasets
- Works with data in 3 ways:
- Query
- Maintenance
- Definition
. . .
| **Category** | **Statement** | **Purpose** |
|---------------|---------------|--------------------------|
| _Query_ | Select | Display rows of table(s) |
| _Maintenance_ | Insert | Add rows to a table |
| | Update | Change rows in a table |
| | Delete | Remove rows from a table |
| _Definition_ | Create | Add tables |
| | Drop | Remove tables |
# Definitions
Here is a very basic query structure:
```{r eval = F}
SELECT columns
FROM table
WHERE comparisons
```
- Keywords: reserved special words
- SELECT, FROM, WHERE, etc...
- SELECT == select == SeLEcT (but usually uppercase)
- Clauses: are name after their keywords
- SELECT clause, FROM clause, WHERE clause
- Typically each clause gets its own line for readability, but not necessary
- Parameters: variable part of clauses
- Ex: specific column names, values, etc...
# Table Structure
- Data is stored similar to the tidy format from R
- Rows are individual observations
- Columns are variables
- Cells are values
- Normalization in SQL is the process of reducing data redundancy
- Typically, data is stored across multiple tables
- The relationship of the data dictates table structure
- Keys are columns that are used to join tables
- Primary keys uniquely identify rows
- Composite keys occur when multiple columns identify a row
- Foreign keys are primary keys stored outside of their primary table
## Basic Table Example
![](sql_images/basic_sql_table.png)
## Basic Relationships
- One:Many
- Add a table to store repeated information by referencing its key
- Shown in the example above
- Many:Many
- Create a "transaction" table to store keys from two tables
. . .
![](sql_images/many-to-many.png)
# Getting Started with SQL in R
```{r}
# install.packages("DBI")
library(DBI)
```
- DBI is an interface with these database backends:
- RPostgres
- RMariaDB
- RSQLite
- odbc
- bigrquery
- and more
# Creating a temporary table in memory
```{r}
# opening a connection
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
# creating a new table from a dataframe
dbWriteTable(con, "mtcars", mtcars)
```
- Here we open a connection to a SQLite database in memory
- Usually you provide username, password, host, port to connect to online server
- con is initially empty, so we write the built-in mtcars table to a table called "mtcars"
## A More Realistic Connection
![](sql_images/sql_connection(1).png)
- This is a SQL endpoint from AWS
## Connection Continued
![](sql_images/sql_connection.png)
- This information is used in MySQL to connect to data not stored locally
# Basic Functionality
```{r}
dbListTables(con) # list tables
dbListFields(con, "mtcars") # list columns of mtcars table
dbReadTable(con, "mtcars") # show data from mtcars table
```
# Writing Basic Queries
```{r}
sql <- "
SELECT *
FROM mtcars
LIMIT 5
"
query <- dbSendQuery(con, sql)
dbFetch(query)
```
- The `*` is a special symbol for all columns
## Similarly, ...
```{r}
dbClearResult(query)
sql <- "
SELECT *
FROM mtcars
"
query <- dbSendQuery(con, sql)
dbFetch(query, n = 5)
```
- You can specify how many rows to retrieve in `dbFetch()`
- The default behavior is `n=-1`
## Selecting Specific Columns
```{r}
dbClearResult(query)
sql <- "
SELECT mpg, cyl AS cylinders, hp AS horsepower
FROM mtcars
"
query <- dbSendQuery(con, sql)
dbFetch(query, n = 5)
```
- Separate columns with a comma
- Column output can be renamed with `AS`
## Selecting Unique Values
```{r}
dbClearResult(query)
sql <- "
SELECT DISTINCT cyl AS cylinders
FROM mtcars
"
query <- dbSendQuery(con, sql)
dbFetch(query)
```
- Use the keyword `DISTINCT` to find the unique values in a column
# Filtering Rows
```{r}
dbClearResult(query)
sql <- "
SELECT mpg, cyl AS cylinders, hp AS horsepower
FROM mtcars
WHERE mpg > 20 AND cyl = 6 AND hp <> 90
"
query <- dbSendQuery(con, sql)
dbFetch(query, n = 5)
```
- The results display only rows where the WHERE clause evaluates to TRUE (removes missing values)
- Not equals works with `<>` and `!=` (in this SQL version)
- Test for equivalence with a single `=`
## More Filtering Details
- Place non-numeric date in quotes `''`
- `WHERE country = 'United States'`
- `WHERE date = '2023-05-14'`
- `OR` can be used instead of `AND`
- To check for missing values use `IS NULL` or `IS NOT NULL`
- For negation, use the keyword `NOT`
## More Filtering
```{r}
dbClearResult(query)
sql <- "
SELECT mpg, cyl AS cylinders, hp AS horsepower
FROM mtcars
WHERE cyl IN (4, 6)
"
query <- dbSendQuery(con, sql)
dbFetch(query, n = 5)
```
- `IN` is more concise than `cyl = 4 OR cyl = 6`
## Even More Filtering
```{r}
dbClearResult(query)
sql <- "
SELECT mpg, cyl AS cylinders, hp AS horsepower
FROM mtcars
WHERE mpg BETWEEN 10 AND 20
"
query <- dbSendQuery(con, sql)
dbFetch(query, n = 5)
```
- `BETWEEN` is more concise than `MPG > 10 AND MPG < 20`
# Ordering Results
```{r}
dbClearResult(query)
sql <- "
SELECT mpg, cyl AS cylinders, hp AS horsepower
FROM mtcars
ORDER BY mpg
"
query <- dbSendQuery(con, sql)
dbFetch(query, n = 5)
```
- The default ordering is ascending
## Descending Order
```{r}
dbClearResult(query)
sql <- "
SELECT mpg, cyl AS cylinders, hp AS horsepower
FROM mtcars
ORDER BY mpg DESC
"
query <- dbSendQuery(con, sql)
dbFetch(query, n = 5)
```
- Use the `DESC` keyword
## Ordering by Multiple Columns
```{r}
dbClearResult(query)
sql <- "
SELECT mpg, cyl AS cylinders, hp AS horsepower
FROM mtcars
ORDER BY mpg DESC, cyl
"
query <- dbSendQuery(con, sql)
dbFetch(query, n = 10)
```
- The results with be sorted by mpg descending
- For equal mpg values, the cyl values with by sorted in ascending order
# Basic Arithmetic Expressions
```{r}
dbClearResult(query)
sql <- "
SELECT mpg + cyl, cyl - drat, (disp * carb) / qsec
FROM mtcars
ORDER BY mpg + cyl
"
query <- dbSendQuery(con, sql)
dbFetch(query, n = 5)
```
- `+-*/()` all work as expected
- Calculates values per row by default
- Creates new columns for the results
- Columns can be renamed
# Aggregate Functions
```{r}
dbClearResult(query)
sql <- "
SELECT COUNT(*), SUM(cyl), MIN(mpg), MAX(mpg), AVG(hp) AS avg_hp
FROM mtcars
"
query <- dbSendQuery(con, sql)
dbFetch(query, n = 5)
```
- `SUM()`, `MIN()`, `MAX()`, `AVG()` work as expected
- `COUNT()` is tricky
- `COUNT(*)` counts all rows
- `COUNT(mpg)` counts all non empty rows in `mpg`
- `COUNT(DISTINCT mpg)` counts all distinct values in `mpg`
# Grouping Results
```{r}
dbClearResult(query)
sql <- "
SELECT cyl, AVG(hp) AS avg_hp
FROM mtcars
GROUP BY cyl
"
query <- dbSendQuery(con, sql)
dbFetch(query, n = 5)
```
- Non grouped columns must be in aggregate functions
## Grouping by Multiple Columns
```{r}
dbClearResult(query)
sql <- "
SELECT cyl, vs, AVG(hp) AS avg_hp
FROM mtcars
GROUP BY cyl, vs
ORDER BY avg_hp DESC
"
query <- dbSendQuery(con, sql)
dbFetch(query)
```
- Groups by cyl first, then vs
- Similar to `ORDER BY` clause from earlier
- Ordering happens at the end, so can use calculated columns
## Filtering Grouped Results
```{r}
dbClearResult(query)
sql <- "
SELECT cyl, vs, AVG(hp) AS avg_hp
FROM mtcars
GROUP BY cyl, vs
HAVING avg_hp > 100
"
query <- dbSendQuery(con, sql)
dbFetch(query)
```
- `HAVING` clause is like `WHERE` clause, but happens after grouping
# Summary of Clause Order
```{r eval = F}
"SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT"
```
- The order cannot be swapped
## Full Example
```{r}
dbClearResult(query)
sql <- "
SELECT cyl, vs, AVG(hp) AS avg_hp
FROM mtcars
WHERE am = 1 AND gear >= 3
GROUP BY cyl, vs
HAVING avg_hp > 100
ORDER BY cyl
LIMIT 5
"
query <- dbSendQuery(con, sql)
dbFetch(query)
```
# Joining Tables
Consider a different dataset with multiple tables about flights in and out of NYC
```{r}
dbDisconnect(con)
# install.packages(nycflights13)
library(nycflights13)
flights_subset <- nycflights13::flights[1:2000, ]
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
dbWriteTable(con, "flights", flights_subset)
dbWriteTable(con, "airports", nycflights13::airports)
```
- We will consider the flights and airports tables
## Flight Data Structure
![](sql_images/nycflights.png)
## Types of Joins
![](sql_images/join_venn_diagram.svg)
## Join Example
```{r}
sql <- "
SELECT month, day, dep_time, origin, dest
FROM flights AS f
INNER JOIN airports AS a ON a.faa = f.dest
WHERE tzone = 'America/Chicago'
LIMIT 10
"
query <- dbSendQuery(con, sql)
dbFetch(query)
```
- Joins need `ON` to know how to join
- `ON` criteria can be more than 1 column
- AUS is Austin, TX
# Common Table Expressions
```{r}
dbClearResult(query)
sql <- "
WITH highest_avg_delay_dest AS (
SELECT dest, AVG(dep_delay)
FROM flights
WHERE dep_delay > 0
GROUP BY dest
LIMIT 1
)
SELECT f.dest, day, COUNT(*) as num_flights, SUM(dep_delay) as total_delay, AVG(dep_delay) as avg_delay
FROM flights f
INNER JOIN highest_avg_delay_dest hadd ON hadd.dest = f.dest
WHERE dep_delay > 0
GROUP BY f.dest, day
ORDER BY day ASC
"
query <- dbSendQuery(con, sql)
dbFetch(query)
```
- Common Table Expressions allow for queries to happen in sequence for more complex queries
- Remember the data here is a small subset, so we only get two days of data
- There are other types of sub-queries, but I like CTEs for their readability
# Exercises
Consider a smaller dataset about flowers
```{r}
dbDisconnect(con)
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
dbWriteTable(con, "iris", iris)
dbReadTable(con, "iris")
```
## Exercise 1
How many flowers are there for each species and what is their average sepal length?
-Note: \`column.name\` forces SQL to consider `column.name` is a column, not as table `column` with column `name`
## Exercise 1 Solution
```{r}
sql <- "
SELECT Species, COUNT(*), AVG(`Sepal.Length`) as avg_length
FROM iris
GROUP BY Species
"
query <- dbSendQuery(con, sql)
dbFetch(query)
```
## Exercise 2
Which rows have the sum of sepal length, sepal width, petal length, petal width larger than 10 order by petal length descending and limit to 10 results?
## Exercise 2 Solution
```{r}
dbClearResult(query)
sql <- "
SELECT *
FROM iris
WHERE (`Sepal.Length` + `Sepal.Width` + `Petal.Length` + `Petal.Width`) > 10
ORDER BY `Petal.Length` DESC
LIMIT 10
"
query <- dbSendQuery(con, sql)
dbFetch(query)
```
## Exercise 3
What two species of have highest sepal length to sepal width ratio?
## Exercise 3 Solution
```{r}
dbClearResult(query)
sql <- "
SELECT Species, `Sepal.Length`/`Sepal.Width` AS ratio
FROM iris
GROUP BY Species
ORDER BY ratio DESC
LIMIT 2
"
query <- dbSendQuery(con, sql)
dbFetch(query)
```
# Conclusion
- We learned:
- the purpose of SQL
- SQL's basic functions
- how data is stored
- how to query databases with `SELECT`
- Not covered:
- maintaining tables
- defining tables
# Bonus
SQL Murder Mystery
https://mystery.knightlab.com/
# References
- https://cdncontribute.geeksforgeeks.org/wp-content/uploads/SQL-Manual.pdf
- https://dbi.r-dbi.org/