azure databricks
72 TopicsDelivering Information with Azure Synapse and Data Vault 2.0
Data Vault has been designed to integrate data from multiple data sources, creatively destruct the data into its fundamental components, and store and organize it so that any target structure can be derived quickly. This article focused on generating information models, often dimensional models, using virtual entities. They are used in the data architecture to deliver information. After all, dimensional models are easier to consume by dashboarding solutions, and business users know how to use dimensions and facts to aggregate their measures. However, PIT and bridge tables are usually needed to maintain the desired performance level. They also simplify the implementation of dimension and fact entities and, for those reasons, are frequently found in Data Vault-based data platforms. This article completes the information delivery. The following articles will focus on the automation aspects of Data Vault modeling and implementation.155Views0likes0CommentsPart 1: Power BI Service Connections to Azure Databricks with Private Networking
This blog was written in conjunction with Leo Furlong, Lead Solutions Architect at Databricks. Enhancing Security and Connectivity: Azure Databricks SQL, Unity Catalog, and Power BI Integration The combination of Azure Databricks SQL, Unity Catalog, and Power BI offers an unparalleled set of capabilities for modern data analytics. However, as organizations increasingly prioritize security, many Azure Databricks customers are deploying their Databricks workspace with private networking requirements which requires additional configuration for allowing connections from BI tools like Power BI. This blog post explores the options available for secure Azure Databricks deployments and how to maintain Power BI connectivity in these scenarios. Private Networking Options for Azure Databricks When deploying Azure Databricks with enhanced security, customers can choose from three main private networking configurations: Public Endpoint with an IP Access List for the Workspace: This option exposes a public endpoint for the Azure Databricks workspace but restricts access to specific IP ranges. Azure Databricks Private Link: Front-end private link provides fully private connectivity, routing all traffic through private endpoints. Hybrid Deployment: Combines front-end private link with a public endpoint protected by a Workspace IP Access List which is typically used for SaaS service connections. Connecting Power BI to a Private Azure Databricks Workspaces While private networking enhances security, it can require additional connection configurations from SaaS services like Power BI. Power BI offers two primary methods for secure connections to data sources with private networking: On-premises data gateway: an application that gets installed on a Virtual Machine that has a direct networking connection to the data source. It allows Power BI to connect to data sources that don’t allow public connections Virtual Network Data Gateway: a managed (virtual/serverless) data gateway that gets created and managed by the Power BI service. Connections work by allowing Power BI to delegate into a VNet for secure connectivity to the data source. While Power BI offers these two options, many customers prefer not to manage additional infrastructure or configurations required for these gateways. In such cases, Power BI can be allowed to access the private Azure Databricks workspace through the IP Access List. Implementing Power BI Connectivity via IP Access List To enable the Power BI Service connectivity to a private Azure Databricks workspace using an IP Access List: Obtain the Power BI Public IPs: Download the latest Azure IP Ranges and Service Tags file from the Microsoft Download Center. This file is updated weekly and contains IP ranges for various Azure services, including Power BI. Add Power BI IPs to Azure Databricks Workspace IP Access List: Extract the Power BI IP ranges from the downloaded file and add them to the Azure Databricks IP Access List using the API or SDK. Regular Updates: Since Power BI public IPs can change frequently, it's crucial to update the Workspace IP Access List regularly. This can be automated using a Databricks Job that periodically downloads the latest IP ranges and updates the Workspace IP Access List. The Job will need to be run by a Workspace Admin in order to set the configurations. You can run the Databricks Job as a Service Principal to make the updates. If you use the Databricks SDK from within a notebook in the Databricks Workspace, authentication is handled for you. The following sample code can be used to turn on your Workspace IP Access List which is more of a one-time operation. The Power BI IPs for IP Access List sample code can be used to refresh your Power BI IPs from a Databricks Workflow. Conclusion By leveraging IP Access Lists, organizations can maintain the security benefits of private Azure Databricks deployments while ensuring seamless connections from Power BI. This approach offers a balance between security and functionality with low maintenance overhead.1KViews2likes1CommentWorkarounds for Maven Json-smart 2.5.2 Release Breaking Azure Databricks Job Dependencies
A recent Maven library release corrupted json-smart’s central metadata, ended up removing all previous versions of Maven library other than 2.5.2. This results in DBR ivy resolution failures if customers’ job has transitive dependency on previous version of json-smart. The impact to Azure Databricks customers is jobs failure and in some cases the job clusters could fail to start. More details and updates of the issue can be found here: https://github.com/netplex/json-smart-v2/issues/240. This also affects google’s maven mirror, which is used by DBR 11+ to resolve maven libraries, and maven central is used as the backup for google’s maven mirror. To mitigate the issue, you can take the workaround steps to install the customer’s requested library and smart-json separately Steps 1: Install the customer’s requested library (use azure-eventhubs-spark as an example) but exclude json-smart. Step 2: Install net.minidev:json-smart:2.3 (or any other version that was needed) Workaround for run time, if you have a preferred Maven mirror that is not affected by this issue, or if you host a private Maven mirror, you can configure your Databricks environment to use it. Step: Set the Spark configuration parameter spark.databricks.driver.preferredMavenCentralMirrorUrl to your preferred mirror repository URL.942Views1like0CommentsDifferent pools for workers and driver - in ADF triggered ADB jobs
Hello All, Azure Databricks allows usage of separate compute pools for drivers and workers when you create a job via the native Databricks workflows. For customers using ADF as an orchestrator for ADB jobs, is there a way to achieve the same when invoking notebooks/jobs via ADF? The linked service configuration in ADF seems to allow only one instance pool. Appreciate any pointers. Thanks !Solved48Views0likes1CommentReplicating Azure Cosmos DB into Azure Databricks using CDC
This blog was written in conjunction with David Poulet, Senior Solutions Architect at Databricks. Microsoft’s NoSQL database-as-a-service, Azure Cosmos DB, is a key platform in Azure for storing non-relational, transactional data and vectors for applications with high throughput and availability requirements. This data often holds valuable business insights, and the ability to analyze this data at scale with Azure Databricks is a key requirement for many customers. Azure Cosmos DB is optimized for fast reads and writes of individual items. However, in common with other data stores of this type, it is not optimized for analytical workloads and this can create a challenge to analyzing stored data in a performant and cost-effective way. Microsoft’s solution to this problem is the Analytical Store, which stores a copy of the Azure Cosmos DB data in a columnar format and keeps it up-to-date. However, until recently this feature stored the data in a proprietary format and a hidden location that could not be accessed except via Azure Synapse and was subject to a number of restrictions around the types of data structures and query types that it could handle. But there is now a flexible and open solution to this problem! Microsoft has a feature in Azure Data Factory that enables users to replicate the Azure Cosmos DB Analytical Store into their lakehouse in Delta format, automatically inserting/updating/deleting records as the source transactional database changes. The incremental nature of this offers significant cost savings vs pulling data directly from the transactional store and dealing with complex incremental ingestion logic in code. In this article, I’ll show how we can leverage this feature to create a simple process to continuously ingest operational data in Azure Cosmos DB into Azure Databricks’ powerful analytics and AI platform. Architecture Overview The architecture we’ll discuss in this article will use the CDC capability for Azure Cosmos DB within Azure Data Factory to process changes in a Azure Cosmos DB container and then merge them into a Delta Lake table in the lakehouse. See the diagram below: Azure Data Factory (ADF) will read a container from Azure Cosmos DB (via the analytical store) and periodically replicate any changes from that container into a Delta Lake table in Azure Databricks. This incremental replication process will operate on a schedule that is defined within ADF. There are a couple of possibilities for how we ingest these changes into Azure Databricks: we could move the data to a staging area, and ingest into Bronze from there using a workflow or Delta Live Tables, but for simplicity we’ll write directly to a table in the Bronze layer of our medallion architecture from ADF. Once the data is in our Bronze layer standard Azure Databricks patterns can be used to cleanse and transform the data into Silver/Gold layers. The bulk of the activities happen in Azure Data Factory, but there are some prerequisites. Before we can create the CDC pipeline it’s assumed the following already exist: A Azure Cosmos DB for NoSQL container, with Analytical Store enabled. Azure Data Factory instance in which to create a CDC pipeline. An ADLS storage container to act as our staging area. With these in place, we can create the CDC pipeline from ADF. Setting Up The Pipeline The feature in ADF that consumes the Azure Cosmos DB changes is in the Data Flows area, so we start by launching the ADF studio and creating a new data flow: The new data flow needs a Source and a Sink. The source will be our Azure Cosmos DB container and the Sink will be our Delta Table in Bronze. First we’ll create and configure the Source to consume from our Azure Cosmos DB container. Click to Add Source in the new Dataflow. In the source settings we have to set the Source Type to Inline and the Inline Dataset Type to Azure Cosmos DB for NoSQL. The Store Type should be set to Analytical. The Linked Service should be set to a linked service for Azure Cosmos DB that has been set up to connect to our source container. For details on how to create an ADF Linked Service see the getting started documentation for Azure Cosmos DB. In the Source Options for the Data Flow, there are some settings that are important to control the behavior of the reads from the source feed. The Container name field is where we select the Azure Cosmos DB container we are interested in. In this example we have a container with some simple customer related data in. The Start from field allows us to synchronize ALL the data in the container from the start of its life, or you can select to only sync changes from now on (or from a given timestamp). You have the option to capture intermediate updates, if you want to maintain a history of all the changes, but we are just going to capture the latest state so this is unselected. Capture Deletes ensures that deleted items from the source are also deleted in our Bronze table. Capture Transactional store TTLs means that if items are expired from the Azure Cosmos DB transactional store by the Time-To-Live function, they will also be deleted from our copy of the data. This is enabled by default but many people may not want this functionality as TTL is often used to reduce the data size of the transactional store at the cost of losing historical data, but in the analytics world this historical data is often important. We’ll leave it at the default though for now. Next we’ll add a Sink to publish the change data to. Click the + button next to the source icon and search for the Sink option. We then need to configure the Sink to point to our Bronze table in the lakehouse. In the Sink settings we select our incoming source stream (there is only one in this case, the one we just created). We again select Inline for Sink type. And the Inline dataset type is Delta. Once again the Linked service is an ADF linked service which points to a blob container/folder that will store our Bronze table. You can read the documentation for creating an ADF blob linked service (or ADLS, either will work) on this page. Next, the Settings page for our Sink has some important options to control the behavior of the table we are creating. First we need to select the correct Folder path for the folder in the blob container that will store our Bronze table data. Here we have a simple folder called customer where ADF will put the Delta Lake files. We also need to think about the Update method field. In this case we will allow Insert (to put new rows in the table as they are added in the source), Delete (to remove rows in the table as they are deleted in the source) and Update (updating existing rows to match changes in the source). To do this ADF needs a unique field in the source that it can match in the target table - so we select List of columns and put {_rid} in the column field. _rid is a system field in Azure Cosmos DB that uniquely identifies a data item. At this point we are actually ready to run this Data Flow to start syncing Azure Cosmos DB changes to our Bronze table. To do this we need to create a Pipeline in ADF to run the Data Flow defined above. In the ADF studio resources section, under Pipelines create a new pipeline, and in that pipeline drag a single action onto the pipeline edit canvas - a Data Flow action. Once we’ve created a pipeline with a Data Flow action, we will edit the Data Flow action settings to trigger the CDC Data Flow we created above. Here all we need to do is select our data flow in the Data Flow drop-down. Then, like all ADF pipelines we need a trigger to start the pipeline and we’re ready to start ingesting data. From the pipeline editor menu select Add Trigger and then New/Edit - this will bring up the trigger menu below. We’ll set our trigger to run on creation and then run every 5 minutes after that. This means that every 5 minutes the pipeline will get the latest changes from Azure Cosmos DB and push them into our Bronze table. Using The Target Table With the pipeline running, we should start to see data flowing into our target Delta Lake table. I have created a simple customer data set for this example, with three items in the container. After the pipeline has run these items are pushed into a Delta Lake table in our target ADLS container. In a notebook in Azure Databricks, we can load that Delta Lake table and see its contents: We can already access the data in the target Delta table from Azure Databricks. Each time the pipeline in ADF runs, it will update this table with whatever inserts/updates/deletes have happened in the source container. To really make the best use of this as a Bronze table in Azure Databricks, we’re going to create an external table in Unity Catalog to integrate this data with the rest of our UC resources and in this way make it securely accessible to all our Azure Databricks users. First in the Catalog view in Azure Databricks we create a new external location: Then we configure the external location to point to our target ADLS folder. In the new external location dialog, we give the location a name, we select the storage credential that we’ll use to access the external container (in this case the managed identity that is assigned to my Azure Databricks workspace), and the URL to the actual storage container itself. Note that if you have not already done so you will have to ensure that the managed identity for your Azure Databricks workspace has been assigned the relevant permissions to access the storage container. For more information on configuring external locations in Azure Databricks see this documentation. Finally we can create an external table over our target storage container location so that we can access the table in UC. Inside an Azure Databricks notebook we can do this very simply: In the above example this creates the bronze table in the cdc_demo schema of my cdc_catalog catalog. Once this is done we can query this table like any other table in Unity Catalog, and view the data that’s being replicated from Azure Cosmos DB by our ADF pipeline. We can then continue to enrich, clean and merge this data downstream using standard Azure Databricks processes for example as shown in the documentation here. So we can see that with a simple pipeline in ADF, we have created a robust way of opening up our Azure Cosmos DB transactional data to whatever complex analytical processes we want to use in Azure Databricks without reading the transactional data store itself, thus reducing cost and “noisy neighbor” risks.846Views2likes0CommentsPart 2: Migrate Azure Analysis Services to Power BI Premium using Azure Databricks - How-To
Migrate your Analysis Services models to Power BI Premium and Databricks SQL Serverless using the migration wizard and Tabular Editor to obtain fantastic performance, scalability, and costs savings.6.5KViews0likes2CommentsImplementing Data Vault 2.0 on Fabric Data Warehouse
In the previous articles of this series, we have discussed how to model Data Vault on Microsoft Fabric. Our initial focus was on the basic entity types including hubs, links, and satellites; advanced entity types, such as non-historized links and multi-active satellites and the third article was modeling a more complete model, including a typical modeling process, for Microsoft Dynamics CRM data.7.7KViews0likes1Comment