Connect with experts and redefine what’s possible at work – join us at the Microsoft 365 Community Conference May 6-8. Learn more >

Data Science

24 Topics

A Data Science Process, Documentation, and Project Template You Can Use in Your Solutions
In most of the Data Science and AI articles, blogs and papers I read, the focus is on a particular algorithm or math angle to solving a puzzle. And that's awesome - we need LOTS of those. However, even if you figure those out, you have to use them somewhere. You have to run that on some sort of cloud or local system, you have to describe what you're doing, you have to distribute an app, import some data, check a security angle here and there, communicate with a team....you know, DevOps. In this article, I'll show you a complete process, procedures, and free resources to manage your Data Science project from beginning to end.
BuckWoodyMSFT
Nov 15, 2024 Place Data Architecture Blog
12KViews
0likes
1Comment
Cross Subscription Database Restore for SQL Managed Instance Database with TDE enabled using ADF
Our customers require daily refreshes of their production database to the non-production environment. The database, approximately 600GB in size, has Transparent Data Encryption (TDE) enabled in production. Disabling TDE before performing a copy-only backup is not an option, as it would take hours to disable and re-enable. To meet customer needs, we use a customer-managed key stored in Key Vault. Azure Data Factory is then utilized to schedule and execute the end-to-end database restore process.
MUA
Nov 13, 2024 Place Data Architecture Blog
1.2KViews
1like
0Comments
Create and Deploy Azure SQL Managed Instance Database Project integrated with Azure DevOps CICD
Integrating database development into continuous integration and continuous deployment (CI/CD) workflows is the best practice for Azure SQL managed instance database projects. Automating the process through a deployment pipeline is always recommended. This automation ensures that ongoing deployments seamlessly align with your continuous local development efforts, eliminating the need for additional manual intervention. This article guides you through the step-by-step process of creating a new azure SQL managed instance database project, adding objects to it, and setting up a CICD deployment pipeline using GitHub actions. Prerequisites Visual Studio 2022 Community, Professional, or Enterprise Azure DevOps environment Contributor permission within Azure DevOps Con Sysadmin server roles within Azure SQL managed instance Step 01 Open Visual Studio, click Create a new project Search for SQL Server, select SQL Server Database Project Provide the project name, folder path to store .dacpac file, create Step 2 Import the database schema from an existing database. Right-click on the project and select 'Import'. You will see three options: Data-Tier Application (.dacpac), Database, and Script (.sql). In this case, I am using the Database option and importing form Azure SQL managed instance To proceed, you will encounter a screen that allows you to provide a connection string. You can choose to select a database from local, network, or Azure sources, depending on your needs. Alternatively, you can directly enter the server name, authentication type, and credentials to connect to the database server. Once connected, select the desired database to import and include in your project. Step 3 Configure the import settings. There are several options available, each designed to optimize the process and ensure seamless integration. Import application-scoped objects: will import tables, views, stored procedures likewise objects. Imports reference logins: login related imports. Import Permissions: will import related permissions. Import database settings: will import database settings. Folder Structure: option to choose folder structure in your project for database objects. Maximum files per folder: limit number files per folder. Click Start which will show the progress window as shown. Click “Finish” to complete the step. Step 4 To ensure a smooth deployment process, start by incorporating any necessary post-deployment scripts into your project. These scripts are crucial for executing tasks that must be completed after the database has been deployed, such as performing data migrations or applying additional configurations. To compile your database project in Visual Studio, simply right-click on the project and select 'Build'. This action will compile the project and generate a sqlproj file, ensuring that your database project is ready for deployment. When building the project, you might face warnings and errors that need careful debugging and resolution to ensure the successful creation of the sqlproj file. Common issues include missing references, syntax errors, or configuration mismatches. After addressing all warnings and errors, rebuild the project to create the sqlproj file. This file contains the database schema and is essential for deployment. Ensure that any post-deployment scripts are seamlessly integrated into the project. These scripts will run after the database deployment, performing any additional necessary tasks. To ensure all changes are tracked and can be deployed through your CI/CD pipeline, commit the entire codebase, including the sqlproj file and any post-deployment scripts, to your branch in Azure DevOps. This step guarantees that every modification is documented and ready for deployment. Step 5 Create Azure DevOps pipeline to deploy database project Step 6 To ensure the YAML file effectively builds the SQL project and publishes the DACPAC file to the artifact folder of the pipeline, include the following stages. stages: - stage: Build jobs: - job: BuildJob displayName: 'Build Stage' steps: - task: VSBuild@1 displayName: 'Build SQL Server Database Project' inputs: solution: $(solution) platform: $(buildPlatform) configuration: $(buildConfiguration) - task: CopyFiles@2 inputs: SourceFolder: '$(Build.SourcesDirectory)' Contents: '**\*.dacpac' TargetFolder: '$(Build.ArtifactStagingDirectory)' flattenFolders: true - task: PublishPipelineArtifact@1 inputs: targetPath: '$(Build.ArtifactStagingDirectory)' artifact: 'dacpac' publishLocation: 'pipeline' - stage: Deploy jobs: - job: Deploy displayName: 'Deploy Stage' pool: name: 'Pool' steps: - task: DownloadPipelineArtifact@2 inputs: buildType: current artifact: 'dacpac' path: '$(Build.ArtifactStagingDirectory)' - task: PowerShell@2 displayName: 'upgrade sqlpackage' inputs: targetType: 'inline' script: | # use evergreen or specific dacfx msi link below wget -O DacFramework.msi "https://aka.ms/dacfx-msi" msiexec.exe /i "DacFramework.msi" /qn - task: SqlAzureDacpacDeployment@1 inputs: azureSubscription: '$(ServiceConnection)' AuthenticationType: 'servicePrincipal' ServerName: '$(ServerName)' DatabaseName: '$(DatabaseName)' deployType: 'DacpacTask' DeploymentAction: 'Publish' DacpacFile: '$(Build.ArtifactStagingDirectory)/*.dacpac' IpDetectionMethod: 'AutoDetect' Step 7 To execute any Pre and Post SQL script during deployment, you need to update the SQL package, obtain an access token, and then run the scripts. # install all necessary dependencies onto the build agent - task: PowerShell@2 name: install_dependencies inputs: targetType: inline script: | # Download and Install Azure CLI write-host "Installing AZ CLI..." Invoke-WebRequest -Uri https://aka.ms/installazurecliwindows -OutFile .\AzureCLI.msi Start-Process msiexec.exe -Wait -ArgumentList "/I AzureCLI.msi /quiet" Remove-Item .\AzureCLI.msi write-host "Done." # prepend the az cli path for future tasks in the pipeline write-host "Adding AZ CLI to PATH..." write-host "##vso[task.prependpath]C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin" $currentPath = (Get-Item -path "HKCU:\Environment" ).GetValue('Path', '', 'DoNotExpandEnvironmentNames') if (-not $currentPath.Contains("C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin")) { setx PATH ($currentPath + ";C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin") } if (-not $env:path.Contains("C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin")) { $env:path += ";C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin" } write-host "Done." # install necessary PowerShell modules write-host "Installing necessary PowerShell modules..." Get-PackageProvider -Name nuget -force if ( -not (Get-Module -ListAvailable -Name Az.Resources) ) { install-module Az.Resources -force } if ( -not (Get-Module -ListAvailable -Name Az.Accounts) ) { install-module Az.Accounts -force } if ( -not (Get-Module -ListAvailable -Name SqlServer) ) { install-module SqlServer -force } write-host "Done." - task: AzureCLI@2 name: run_sql_scripts inputs: azureSubscription: '$(ServiceConnection)' scriptType: ps scriptLocation: inlineScript inlineScript: | # get access token for SQL $token = az account get-access-token --resource https://database.windows.net --query accessToken --output tsv # configure OELCore database Invoke-Sqlcmd -AccessToken $token -ServerInstance '$(ServerName)' -Database '$(DatabaseName)' -inputfile '.\pipelines\config-db.dev.sql'
MUA
Nov 13, 2024 Place Data Architecture Blog
567Views
2likes
1Comment
Automated Continuous integration and delivery – CICD in Azure Data Factory
In Azure Data Factory, continuous integration and delivery (CI/CD) involves transferring Data Factory pipelines across different environments such as development, test, UAT and production. This process leverages Azure Resource Manager templates to store the configurations of various ADF entities, including pipelines, datasets, and data flows. This article provides a detailed, step-by-step guide on how to automate deployments using the integration between Data Factory and Azure Pipelines. Prerequisite Azure database factory, Setup of multiple ADF environments for different stages of development and deployment. Azure DevOps, the platform for managing code repositories, pipelines, and releases. Git Integration, ADF connected to a Git repository (Azure Repos or GitHub). The ADF contributor and Azure DevOps build administrator permission is required Step 1 Establish a dedicated Azure DevOps Git repository specifically for Azure Data Factory within the designated Azure DevOps project. Step 2 Integrate Azure Data Factory (ADF) with the Azure DevOps Git repositories that were created in the first step. Step 3 Create developer feature branch with the Azure DevOps Git repositories that were created in the first step. Select the created developer feature branch from ADF to start the development. Step 4 Begin the development process. For this example, I create a test pipeline “pl_adf_cicd_deployment_testing” and save all. Step 5 Submit pull request from developer feature branch to main Step 6 Once the pull requests are merged from the developer's feature branch into the main branch, proceed to publish the changes from the main branch to the ADF Publish branch. The ARM templates (JSON files) will get up-to date, they will be available in the adf-publish branch within the Azure DevOps ADF repository. Step 7 ARM templates can be customized to accommodate various configurations for Development, Testing, and Production environments. This customization is typically achieved through the ARMTemplateParametersForFactory.json file, where you specify environment-specific values such as link service, environment variables, managed link and etc. For example, in a Testing environment, the storage account might be named teststorageaccount, whereas in a Production environment, it could be prodstorageaccount. To create environment specific parameters file Azure DevOps ADF Git repo > main branch > linkedTemplates folder > Copy “ARMTemplateParametersForFactory.json” Create parameters_files folder under root path Copy paste ARMTemplateParametersForFactory.json inside parameters_files folder and rename to specify environment for example, prod-adf-parameters.json Update each environment specific parameter values Step 8 To create an Azure DevOps CICD pipeline, use the following code and ensure you update the variables to match your environment before running it. This will allow you to deploy from one ADF environment to another, such as from Test to Production. name: Release-$(rev:r) trigger: branches: include: - adf_publish variables: azureSubscription: <Your subscription> SourceDataFactoryName: <Test ADF> DeployDataFactoryName: <PROD ADF> DeploymentResourceGroupName: <PROD ADF RG> stages: - stage: Release displayName: Release Stage jobs: - job: Release displayName: Release Job pool: vmImage: 'windows-2019' steps: - checkout: self # Stop ADF Triggers - task: AzurePowerShell@5 displayName: Stop Triggers inputs: azureSubscription: '$(azureSubscription)' ScriptType: 'InlineScript' Inline: | $triggersADF = Get-AzDataFactoryV2Trigger -DataFactoryName "$(DeployDataFactoryName)" -ResourceGroupName "$(DeploymentResourceGroupName)" if ($triggersADF.Count -gt 0) { $triggersADF | ForEach-Object { Stop-AzDataFactoryV2Trigger -ResourceGroupName "$(DeploymentResourceGroupName)" -DataFactoryName "$(DeployDataFactoryName)" -Name $_.name -Force } } azurePowerShellVersion: 'LatestVersion' # Deploy ADF using ARM Template and UAT JSON parameters - task: AzurePowerShell@5 displayName: Deploy ADF inputs: azureSubscription: '$(azureSubscription)' ScriptType: 'InlineScript' Inline: | New-AzResourceGroupDeployment ` -ResourceGroupName "$(DeploymentResourceGroupName)" -TemplateFile "$(System.DefaultWorkingDirectory)/$(SourceDataFactoryName)/ARMTemplateForFactory.json" -TemplateParameterFile "$(System.DefaultWorkingDirectory)/parameters_files/prod-adf-parameters.json" -Mode "Incremental" azurePowerShellVersion: 'LatestVersion' # Restart ADF Triggers - task: AzurePowerShell@5 displayName: Restart Triggers inputs: azureSubscription: '$(azureSubscription)' ScriptType: 'InlineScript' Inline: | $triggersADF = Get-AzDataFactoryV2Trigger -DataFactoryName "$(DeployDataFactoryName)" -ResourceGroupName "$(DeploymentResourceGroupName)" if ($triggersADF.Count -gt 0) { $triggersADF | ForEach-Object { Start-AzDataFactoryV2Trigger -ResourceGroupName "$(DeploymentResourceGroupName)" -DataFactoryName "$(DeployDataFactoryName)" -Name $_.name -Force } } azurePowerShellVersion: 'LatestVersion' Triggering the Pipeline The Azure DevOps CI/CD pipeline is designed to automatically trigger whenever changes are merged into the main branch. Additionally, it can be initiated manually or set to run on a schedule for periodic deployments, providing flexibility and ensuring that updates are deployed efficiently and consistently. Monitoring and Rollback To monitor the pipeline execution, utilize the Azure DevOps pipeline dashboards. In case a rollback is necessary, you can revert to previous versions of the ARM templates or pipelines using Azure DevOps and redeploy the changes.
MUA
Nov 13, 2024 Place Data Architecture Blog
777Views
2likes
1Comment
Creating a Kubernetes Application for Azure SQL Database
Modern application development has multiple challenges. From selecting a "stack" of front-end through data storage and processing from several competing standards, through ensuring the highest levels of security and performance, developers are required to ensure the application scales and performs well and is supportable on multiple platforms. For that last requirement, bundling up the application into Container technologies such as Docker and deploying multiple Containers onto the Kubernetes platform is now de rigueur in application development. In this example, we'll explore using Python, Docker Containers, and Kubernetes - all running on the Microsoft Azure platform. Using Kubernetes means that you also have the flexibility of using local environments or even other clouds for a seamless and consistent deployment of your application, and allows for multi-cloud deployments for even higher resiliency. We'll also use Microsoft Azure SQL Database for a service-based, scalable, highly resilient and secure environment for the data storage and processing. In fact, in many cases, other applications are often using Microsoft Azure SQL Database already, and this sample application can be used to further leverage and enrich that data. This example is fairly comprehensive in scope, but uses the simplest application, database and deployment to illustrate the process. You can adapt this sample to be far more robust, even including leveraging the latest technologies for the returned data. It's a useful learning tool to create a pattern for other applications.
BuckWoodyMSFT
Aug 27, 2024 Place Data Architecture Blog
5.6KViews
0likes
0Comments
The (Amateur) Data Science Body of Knowledge
Whether you're interested in becoming a Data Scientist, a Data Engineer, or just to work with the techniques they use, In this article, I'll help you find resources for whichever path you choose for yourself. At the very least, you'll gain valuable insight to the Data Science field, and how you can use the technologies and knowledge to create a very compelling solution.
BuckWoodyMSFT
Dec 01, 2022 Place Data Architecture Blog
4KViews
3likes
1Comment
Bring Vision to Life with Three Horizons, Data Mesh, Data Lakehouse, and Azure Cloud Scale Analytics
Bring Vision to Life with Three Horizons, Data Mesh, Data Lakehouse, and Azure Cloud Scale Analytics – Plus some bonus concepts! I have not posted in a while so this post is loaded with ideas and concepts to think about. I hope you enjoy it! The structure of the post is a chronological perspective of 4 recent events in my life: 1) Camping on the Olympic Peninsula in WA state, 2) Installation of new windows and external doors in my residential house, 3) Injuring my back (includes a metaphor for how things change over time), and 4) Camping at Kayak Point in Stanwood WA (where I finished writing this). Along with these series of events bookended by Camping trips, I also wanted to mention May 1 st which was International Workers Day (celebrated as Labor Day in September in the US and Canada). To reach the vision of digital transformation through cloud scale analytics we need many more workers (Architects, Developers, DBAs, Data Engineers, Data Scientists, Data Analysts, Data Consumers) and the support of many managers and leaders. Leadership is required so analytical systems can become more distributed and properly staffed to scale vs the centralized and small specialist teams that do not scale. Analytics could be a catalyst for employment with the accelerated building and operating of analytical systems. There is evidence that the structure of the teams working on these analytical systems will need to be more distributed to scale to the level of growth required. When focusing on data management, Data Mesh strives to be more distributed, and Data Lakehouse supports distributed architectures better than the analytical systems of the past. I am optimistic that cloud-based analytical systems supported by these distributed concepts can scale and progress to meet the data management, data engineering, data science, data analysis, and data consumer needs and requirements of many organizations.
DarwinSchweitzer
Oct 22, 2022 Place Data Architecture Blog
22KViews
6likes
1Comment
Using Microsoft R in your Solutions
R is a powerful data language with thousands of "packages" allowing you to extend its uses for Data Science, Advanced Analytics, Machine learning and much more. Microsoft has enhanced this language with a Distribution called "Microsoft R Open". Read on to learn more about this powerful tool you can use stand-alone, or embedded in several Microsoft products.
BuckWoodyMSFT
Dec 21, 2021 Place Data Architecture Blog
8.8KViews
1like
2Comments
DevOps for Data Science – Part 10 - Automated Testing and Scale
The final DevOps Maturity Model is Load Testing and Auto-Scale. Note that you want to follow this progression – there’s no way to do proper load-testing if you aren’t automatically integrating the Infrastructure as Code, CI, CD, RM and APM phases. The reason is that the automatic balancing you’ll do depends on the automation that precedes it – there’s no reason to scale something that you’re about to change.
BuckWoodyMSFT
Aug 31, 2021 Place Data Architecture Blog
8.4KViews
0likes
0Comments
DevOps for Data Science – Part 9 - Application Performance Monitoring
In this series on DevOps for Data Science, I’ve explained the concept of a DevOps “Maturity Model” – a list of things you can do, in order, which will set you on the path for implementing DevOps in Data Science. The first thing you can do in your projects is to implement Infrastructure as Code (IaC) , and the second thing to focus on is Continuous Integration (CI). However, to set up CI, you need to have as much automated testing as you can – and in the case of Data Science programs, that’s difficult to do. From there, the next step in the DevOps Maturity Model is Continuous Delivery (CD). Once you have that maturity level down, you can focus on Release Management. And now we’re off to the next Maturity Level: Application Performance Monitoring (APM).
BuckWoodyMSFT
Aug 31, 2021 Place Data Architecture Blog
7.4KViews
1like
3Comments