The City of New York would like to develop a Data Analytics platform on Azure Synapse Analytics to accomplish two primary objectives:
The main goals are to create high-quality data pipelines that are dynamic, can be automated, and monitored for efficient operation. The project team also includes the city’s quality assurance experts who will test the pipelines to find any errors and improve overall data quality.
The source data resides in Azure Data Lake and needs to be processed in a NYC data warehouse in Azure Synapse Analytics. The source datasets consist of CSV files with Employee master data and monthly payroll data entered by various City agencies.
For this project, I worked in the Azure Portal, using several Azure resources, including:
The project was divided into 6 steps to organize and allow the correct management.
The data infrastructure involves the creation of the following resources:
- Azure Data Lake Storage Gen2 (storage account) and associated storage container resource to upload the raw data.
- Azure Data Factory Resource.
- SQL Database and the table to store the current year data.
- Synapse Analytics workspace and the master data tables.
In the Azure Data Factory, three Linked Services were created:
- To Azure Data Lake
- To SQL Database
- To Synapse Analytics
In the Azure Data Factory, were created the Datasets to load the raw data and to save the transformed data.