In this lab, we will explore one of the latest deployment offering of Azure Database for PostgreSQL-Flexible Server which is currently in public preview. Flexible server is architected to meet the requirements of modern applications. With Flexible server you get improved network latency, simplified provisioning experience, inbuilt connection pooler, burstable compute, and above all the ability to start/stop the service when needed to reduce the cost.
In this module you will learn how to import data into an Azure Database for PostgreSQL-Flexible Server instance using a python script and the psycopg2
module. Once the data is loaded, you will be exploring a dataset meant to simulate an "IoT", or Internet of Things use case. This dataset contains simulated weather data from weather sensors in cities across the world.
To store this dataset, we have tables containing geospatial information stored using Postgres's PostGIS data type as well as JSON content stored in Postgres's jsonb datatype. We are accessing the data using functions built into a python script provided as part of this lab. The script relies on psycopg2
to connect to Postgres and load or fetch data.
- Azure Subscription (e.g. Free or Student)
- An Azure Database for PostgreSQL-Flexible Server(preview) (Create via Portal or Azure CLI)
- Python 3.4+
- Latest pip package installer
The psycopg2 module enables connecting to and querying a PostgreSQL database, and is available as a Linux, macOS, or Windows wheel package. Install the binary version of the module, including all the dependencies. For more information about psycopg2
installation and requirements, see Installation.
To install psycopg2
, open a terminal or command prompt and run the command pip install psycopg2
.
We will also install fire
, required by our CLI tool, pg-lab.py
, by running the command pip install fire
.
Connecting to an Azure Database for PostgreSQL database requires the fully qualified server name and login credentials. You can get this information from the Azure portal.
- In the Azure portal, search for and select your Azure Database for PostgreSQL-Flexible Server(preview) server name.
- On the server's Overview page, copy the fully qualified Server name and the Admin username. The fully qualified Server name is always of the form <my-server-name>.postgres.database.azure.com.
- You will also need your Admin password which you chose when you created the server, otherwise you can reset it using the
Reset password
button onOverview
page.
Note: Make sure you have created a server-level firewall rule to allow traffic from the IP address of the machine you will be using to connect to the database. If you are connected to a remote machine via SSH, you can find your current IP address via the terminal using dig +short myip.opendns.com @resolver1.opendns.com
.
-
First, we need to get the script pg-lab.py, as well as data.csv onto our local machine. You may download them manually, or
git clone
this repository andcd
into the correct01-postgres/
directory as follows:git clone https://github.com/Azure-Samples/azure-python-labs.git cd azure-python-labs/01-postgres/
-
Then, we need to set up the Postgres connection we're going to be using for the rest of this lab. The following script writes the connection string to a .config file in the current directory via the writeConfig function. To run this, set some environment variables which we use to create the connection string from the values :
export SERVER_NAME='pg200700.postgres.database.azure.com' export ADMIN_USERNAME='myadmin' export ADMIN_PASSWORD='...' python3 pg-lab.py writeConfig "host=${SERVER_NAME} port=5432 dbname=postgres user=${ADMIN_USERNAME} password=${ADMIN_PASSWORD} sslmode=require"
-
Next, let's create a table and load some data from a local CSV file, data.csv. The loadData function of the lab script will automatically connect to the database, create our tables if they don't exist, and then use a COPY command to load our data into the
raw_data
table fromdata.csv
. All we need to do is invoke that function with the name of the data file.python3 pg-lab.py loadData data.csv
-
Now that the data is loaded, let's look at a sample of the data to see what we're working with. pg-lab.py has a few functions built in to process data and give us some results, so to keep things simple let's start with getAverageTemperatures. This function automatically pulls data, loads it into a dict for processing, and gives us average temperartures per location. This is a very inefficient function, so you'll probably notice that it is slow.
python3 pg-lab.py getAverageTemperatures
-
We're going to be using geospatial data for the next part of this lab, so to prepare for that let's add a geospatial index to speed things up in advance. PostGIS can use GIST indexes to make spatial lookups much faster, so if we create an index and specify the GIST type, we'll get great results. To make this easier, pg-lab.py has a built-in runSQL function to run arbitrary SQL for this lab.
python3 pg-lab.py runSQL "CREATE INDEX idx_raw_data_1 ON raw_data USING GIST (location);"
-
There's one more thing we need to do before we can run all of the queries we want to. Right now, we've only got data in our
raw_data
table, and it has a record for every device and every minute. This means that, if we just want to look up basic information such as what city a device is in, we've got to query a big table, which can be slow. To fix this, pg-lab.py has a populateDevices function that will perform anINSERT INTO SELECT
to automatically populate thedevice_list
table with summary information on our devices. Creating aggregate or summary tables like this is an excellent way to speed up application performance!python3 pg-lab.py populateDevices
-
Now that we've got our database ready to go, let's start using our application. For this part of the lab, we're going to figure out what the average device was at the nearest sensor to a location of our choice. We'll start by picking any city that you like, anywhere in the world, and doing a Bing search for the city and the word "coordinates". For example, if you'd want to see temperature data for Lima, Peru, you'd get to this result page, where we get a latitude and longitude of -12.046374° N, -77.042793° E. Now that we've got some coordinates to test with, we'll find the nearest device so we can get suitable information. The getNearestDevice function will query our new
device_list
table using theST_Distance
PostGIS function to figure out what the closest device is.
-
python3 pg-lab.py getNearestDevice -12.046374 -77.042793
-
Now that we've found the device, we can get the average temperature of that device from our raw_data table using the getDeviceAverage function. Unlike the inefficient query from step 3, we're having Postgres generate the average for us. This is a huge improvement in performance, as we need to move much less data over the network and Postgres is very well optimized to run analtyical queries.
python3 pg-lab.py getDeviceAverage 5
Bonus objective: getAverageTemperatures pulls a lot of data it doesn't need to. Rewrite it to do the average calculation in pure SQL instead!
If you have created an Azure Database for PostgreSQL-Flexible Server(preview) for the purposes of this lab and you do not want to keep and continue to be billed for it, you can delete it via the Azure Portal (see: delete resources). Or if you want to return to it in future you can also stop the server temporarily by using Stop button available on the Overview page. Please note that while the machine is in Stopped state, you will not be charged for the compute resources. However you will still be charged for the storage.
To restart the server again, just go to Azure portal and click on the Start button available on Overview page.
If you want to dig deeper and undestand what all PostgreSQL Flexible Server on Azure has to offer , the Flexible Server documents are a great place to start.