Skip to content

microsoft/Data-and-AI-Platform

Repository files navigation

1. Deploy the Azure Infrastructure and Data Pipeline Related Artifacts

  1. Create a Service Principal

  2. Assign Service Principal with Subscription Rights. There's 2 Options

    • Assign the Service Principal RBAC Owner rights at the Subscription(s)
    • Pre-create all Resource Groups and Assign the Service Principal Owner RBAC Owner rights at each Resource Group
  3. Create a federated credential for the service principal

    • Please use an entity type of environment
    • You will need to create a new federated credential for each environment you're deploying.
      • The IP kit deploys up to 3 environments: development, test, and production
      • Your federated credential environment name must match what's in bold above
  4. Create an Azure Active Directory (AAD) group and add all project team members, or, if only you will be interacting with the deployed resources, yourself

  5. If you're using GitHub environments, then create the below environments in your GitHub Repo

    • development
    • test
    • production
  6. If you're using environments, add the below secrets to each environment you're deploying. If you're not using environments, Add the following Repository Secrets with the same name

  7. Create the below secrets with the same name if you're creating private endpoints

    • DNS_ZONE_SUBSCRIPTION_ID
  8. Create the below secrets with the same name if you're deploying VM's with Bastion

    • VM_USERNAME
    • VM_PASSWORD
  9. For each environment you're deploying, update the feature flag variable file to indicate which resources you are deploying or behavior of resources

    • If you're deploying Role-Based Access Control (RBAC), please refer here for what RBAC is deployed
    • If you're deploying the pre-built data pipelines, you must enable Data Factory, Landing Storage, Data Lake, Azure SQL and either Synapse or Databricks
  10. For each environment you're deploying, update the general variable file with the resource names for the resources you indicated you are deploying based on the feature flag file. Also add required tags, Azure location, and resource group names.

    • All non Logic App/Azure Machine Learning/OpenAI resources will be deployed to the resource group inputted in the PrimaryRg variable
    • The PrimaryRg variable is required. If you're only deploying Logic App/Azure Machine Learning/OpenAI resources, set the PrimaryRg variable as the same name as one of the other resource groups
    • Note that most Azure resource names need to be globally unique, but keep the SQL Database name as "MetadataControl"
    • The following variable values can only contain letters and numbers and must be between 3 and 24 characters long
      • dataLakeName
      • landingStorageName
      • logicAppStorageName
      • mlStorageName
    • The following variable values must be between 3 and 24 characters long
      • keyVaultName
    • The following variable values can only contain letters and numbers
      • mlContainerRegistryName
      • fabricCapacityName
    • If Key Vault or Container Registry are deleted and need to be redeployed, please change the resource name
      • this is due to soft delete policies
  11. If you're deploying the resources securely with no public access and private endpoints, please update the networking setup variable files and set the DeployWithCustomNetworking feature flag in the feature flag variable file to true

    • The best practice is to connect to an existing spoke Virtual Network(s) for private endpoints and vnet injection. Please refer here for an overview of the networking requirements
  12. Update the entra assignments variable files

    • Only the Entra_Group_Admin and Entra_Group_Shared_Service groups are required. If you only have one group from Step 3 above, you can put the same information for both variables
  13. Confirm the following resource providers are registered in your Azure Subscription. If not, register them

    • Microsoft.EventGrid
    • If you're deploying Purview: Microsoft.Purview, Microsoft.EventHub
  14. Trigger the data-strategy-orchestrator GitHub Action. If you're unfamiliar with triggering a GitHub Action, follow these instructions.

    • Please do not use the "rerun" job functionality. Always execute the job using method in above instructions

2. Complete the Post Deployment Tasks

Azure SQL

  1. Execute the below stored procedure in the deployed Azure SQL Database(s)
    • Login with AAD. SQL Auth is disabled.
EXEC [dbo].[AddManagedIdentitiesAsUsers]

Synapse

  1. Execute the below stored procedure in the Synapse Serverless Database StoredProcDB
    • Login with AAD. SQL Auth is disabled post deployment.
EXEC [dbo].[AddManagedIdentitiesAsUsers]
  1. If you're deploying the logic app, run the following precreated SQL script in the Synapse portal: RunForLogicApp

Purview

  1. Add the ADF and Synapse managed identities as Data Curator's in the Root Collection of Purview
    • This is required for lineage
  2. When lake DBs are created, you will need to execute the below commands for Purview to scan
CREATE LOGIN [PurviewAccountName] FROM EXTERNAL PROVIDER;
CREATE USER [PurviewAccountName] FOR LOGIN [PurviewAccountName];
ALTER ROLE db_datareader ADD MEMBER [PurviewAccountName]; 

If you're deploying all resources with no public access behind a virtual network and your service principal didn't have Owner RBAC rights on the Subscription

  1. Get Owner of Subscription to Provide AAD Group with Contributor Access to Purview Managed Resource Group
if you set the feature flag, DeployPurviewIngestionPrivateEndpoints, to true
  1. Within the Azure Portal, navigate to Purview's managed Storage Account and Event Hub. For each resource, approve the pending Private Endpoint connections created by the GitHub Action.

If your deploying all resources with no public access behind a virtual network

  1. Set up a Managed VNET Integration Runtime to scan supported Azure data sources
  2. Set up a Self-Hosted Integration Runtime to scan data sources unsupported by the Managed VNET Integration Runtime

3. Start Ingesting Data

Process Overview

  1. Overview of Pre-Built Ingestion Patterns image
  2. Overview of Pre-Build Data Pipelines image
  3. Moving Data to Curated image

Create Control Table Records for Metadata Driven Ingestion

  1. Please create control table records in the dbo.MetadataControl table in the Azure SQL DB. Please follow the instructions here
    • Every time you need to ingest a new source entity (e.g. sql table, csv file, Excel tab), please create one control table record when moving data from source to landing, one for landing to raw, and one for raw to staging.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.