Jorge Castro DAPT NOV2021
In this lab, you will be using the Sakila database of movie rentals. You have been using this database for a couple labs already, but if you need to get the data again, refer to the official installation link.
- Instructions:
- 1. Create a table rentals_may
- 2. Insert values in the table rentals_may
- 3. Create a table rentals_june
- 4. Insert values in the table rentals_june
- 5. Check the number of rentals for each customer for May
- 6. Check the number of rentals for each customer for June
- 7. Create a Python connection with your SQL database
- 8. Write a function that checks if customer borrowed more or less films in the month of June as compared to May
- 9. Connecting to a Microsoft Azure SQL Database with pyodbc
The database is structured as follows:
In this lab we will find the customers who were active in consecutive months of May and June. Follow the steps to complete the analysis.
to store the data from rental table with information for the month of May
Answer:
CREATE TABLE rentals_may AS
SELECT
*
FROM
rental
LIMIT
0;
to verify the table:
DESCRIBE rental;
using the table rental, filtering values only for the month of May
Answer:
INSERT INTO rentals_may
SELECT
*
FROM
rental
WHERE
monthname(rental_date) = 'May';
to verify that the data was inserted into the table:
SELECT
*
FROM
rentals_may
LIMIT
3;
💡
|
the last two questions can be done in only one query as follow: |
Answer:
CREATE TABLE rentals_may AS
SELECT
*
FROM
rental
WHERE
monthname(rental_date) = 'May';
To verify that the table was created along with the data:
SELECT
*
FROM
rentals_may
LIMIT
5;
to store the data from rental table with information for the month of June
Answer:
CREATE TABLE rentals_june AS
SELECT
*
FROM
rental
WHERE
monthname(rental_date) = 'June';
to verify the table and data:
SELECT
*
FROM
rentals_june
LIMIT
3;
using the table rental, filtering values only for the month of June
Answer:
Already done in the previous question
Answer:
SELECT
concat((b.last_name), ' ', (b.first_name)) AS customer_name,
count(*) AS number_of_films_rented_may
FROM
rentals_may a
INNER JOIN customer b ON a.customer_id = b.customer_id
GROUP BY
1
ORDER BY
1
LIMIT
5;
Answer:
SELECT
concat((b.last_name), ' ', (b.first_name)) AS customer_name,
count(*) AS number_of_films_rented_june
FROM
rentals_june a
INNER JOIN customer b ON a.customer_id = b.customer_id
GROUP BY
1
ORDER BY
1
LIMIT
5;
and retrieve the results of the last two queries
(also mentioned below) as dataframes:
Answer:
from sqlalchemy import create_engine
import pandas as pd
import getpass
password = getpass.getpass()
DATABASE_LOCATION = 'mysql+pymysql://root:' + password + '@localhost/sakila'
engine = create_engine(DATABASE_LOCATION, echo=True)
Answer:
query_1 = """
SELECT
concat((b.last_name), ' ', (b.first_name)) AS customer_name,
count(*) AS number_of_films_rented_may
FROM
rentals_may a
INNER JOIN customer b ON a.customer_id = b.customer_id
GROUP BY
1
ORDER BY
1;
"""
data_1 = pd.read_sql_query(query_1, engine)
data_1.head()
query_2 = """
SELECT
concat((b.last_name), ' ', (b.first_name)) AS customer_name,
count(*) AS number_of_films_rented_june
FROM
rentals_june a
INNER JOIN customer b ON a.customer_id = b.customer_id
GROUP BY
1
ORDER BY
1;
"""
data_2 = pd.read_sql_query(query_2, engine)
data_2.head()
8. Write a function that checks if customer borrowed more or less films in the month of June as compared to May
Answer:
Declaring the two SQL queries into separate variables:
query_1 = """
SELECT concat((b.last_name), ' ', (b.first_name)) AS customer_name,
count(*) AS number_of_films_rented_may
FROM rentals_may a
INNER JOIN customer b ON a.customer_id = b.customer_id
GROUP BY 1
ORDER BY 1;
"""
query_2 = """
SELECT concat((b.last_name), ' ', (b.first_name)) AS customer_name,
count(*) AS number_of_films_rented_june
FROM rentals_june a
INNER JOIN customer b ON a.customer_id = b.customer_id
GROUP BY 1
ORDER BY 1;
"""
Function:
def rents(customer_name):
from sqlalchemy import create_engine
import pandas as pd
import getpass
import numpy as np
# Connecting to the database
password = getpass.getpass()
DATABASE_LOCATION = 'mysql+pymysql://root:' + password + '@localhost/sakila'
engine = create_engine(DATABASE_LOCATION)
# Fetching rentals for May with the SQL query
data_05 = pd.read_sql_query(query_1, engine)
# Fetching rentals for June with the SQL query
data_06 = pd.read_sql_query(query_2, engine)
# Merging the two DataFrames and replacing NULL values with (0)
rentals_05_06 = pd.merge(data_05, data_06)
rentals_05_06.fillna(value=0, inplace=True)
# Setting the index by customer name
rentals_05_06.set_index('customer_name', inplace = True)
if customer_name in rentals_05_06.index:
if rentals_05_06.at[customer_name, 'number_of_films_rented_may'] > rentals_05_06.at[customer_name, 'number_of_films_rented_june']:
print('Customer', customer_name, 'decreace of rents in June')
elif rentals_05_06.at[customer_name, 'number_of_films_rented_may'] < rentals_05_06.at[customer_name, 'number_of_films_rented_june']:
print('Customer', customer_name, 'increace of rents in June')
else:
print('Customer', customer_name, 'had same rents as May')
else:
print('Customer', customer_name, ' had no rents in May and June')
Testing the function by passing as argument the name of some customers):
Answer:
from sqlalchemy import create_engine
import textwrap
import pyodbc
import getpass
import pandas as pd
import numpy as np
-
Setting up the query variables and connection string (database location)
sql_query_1 = """
SELECT last_name, first_name,
count(rental_id) AS number_of_films_rented_may
FROM rental a
INNER JOIN customer b ON a.customer_id = b.customer_id
WHERE MONTH(rental_date) = 5
GROUP BY last_name, first_name
ORDER BY 1;
"""
-
Driver: find the DRIVERS that are available using the pyodbc.drivers() method
pyodbc.drivers()
# Creating the server URL:
server='{server_name}.database.windows.net,1433'.format(server_name=server_name)
# Driver:
driver='{SQL Server Native Client 11.0}'
# Server name and DataBase name:
server_name='tcp:serv-sakila-2'
database_name='sakila'
# User name and password
username='a100jcd'
token=getpass.getpass()
# Setting up the Azure SQL database connection
DATABASE_LOCATION = textwrap.dedent('''
Driver={driver};
Server={server};
Database={database};
Uid={username};
Pwd={token};
Encrypt=yes;
TrustServerCertificate=no;
Connection Timeout=30;
'''.format(driver=driver,server=server,
database=database_name, username=username, token=token))
Here I used textwrap
. This python module provides formatting of text by adjusting the line breaks in the input paragraph. The dedent
function in this case is particularly useful to keep the code tidy without adding the actual line breaks and also to be able to use .format
. If I need to connect to another Azure SQL Server database, I only replace the variables:
-
server_name
-
database_name
-
username
-
token
A connection string must be indicated in the same line. If I arbitrarily just format the string with triple quotes, I would be adding line breaks (\n
) which would break the string.
cnx = pyodbc.connect(DATABASE_LOCATION)
data_1 = pd.read_sql_query(sql_query_1, cnx)
data_1.head()
-
Server Driver:
|
Error message:
|
InterfaceError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_19372/2813816242.py in <module>
1 cnx = pyodbc.connect(DATABASE_LOCATION)
InterfaceError: ('IM002', [IM002] [Microsoft][ODBC Driver Manager] Data source name
not found and no default driver specified (0) (SQLDriverConnect))
At the Azure portal, Azure SQL databases provide a ready to use connection string for ODBC[1] specifying the ODBC Driver 13 for SQL Server
.
However SQLAlchemy has several dialects/DBAPI options available[2] such as PyODBC, mxODBC[3] and pymssql. So if we run into error messages of this sort, this is when the .driver
method comes handy to check the available drivers so we can test the connection.
For example in my case, I could not connect to the database using the ODBC Driver 13 for SQL Server
as specified by Azure. I could connect successfully using the ODBC Driver 17 for SQL Server
. Also we have to make sure that this driver is actually install in our pc.
-
Azure SQL server firewall settings:
|
Error message:
|
---------------------------------------------------------------------------
ProgrammingError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_19372/2813816242.py in <module>
----> 1 cnx = pyodbc.connect(DATABASE_LOCATION)
"""ProgrammingError: (42000, [42000] [Microsoft][ODBC Driver 17 for SQL Server]
[SQL Server]Cannot open server serv-sakila-2 requested by the login. Client with IP address
2.206.133.211 is not allowed to access the server. To enable access, use the Windows Azure
Management Portal or run sp_set_firewall_rule on the master database to create a firewall rule
for this IP address or address range. It may take up to five minutes for this change to
take effect. (40615) (SQLDriverConnect); [42000] [Microsoft][ODBC Driver 17 for
SQL Server]Invalid connection string attribute (0); [42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server] Cannot open server serv-sakila-2 requested by the login. Client with IP address
2.206.133.211 is not allowed to access the server. To enable access, use the Windows
Azure Management Portal or run sp_set_firewall_rule on the master database to create
a firewall rule for this IP address or address range. It may take up to five minutes
for this change to take effect. (40615); [42000] [Microsoft][ODBC Driver 17
for SQL Server]Invalid connection string attribute (0))"""
When a Azure SQL Database
or Azure Synapse Analytics are first created, by default, the firewall blocks all access to the public endpoint for the server. It is necessary to add a firewall rule at the server-level to be able to access the resource. We can configure server-level IP firewall rules by using the Azure portal, PowerShell, or Transact-SQL statements.