Analytics on Azure Blog

8 MIN READ

Implementing Business Logic using Data Vault 2.0 on Azure Fabric

Microsoft

Nov 06, 2024

This Article is Authored By Michael Olschimke, co-founder and CEO at Scalefree International GmbH and Co-authored with Kilian GrünhagenSenior BI Consultant from Scalefree
The Technical Review is done by Ian Clarke and Naveed Hussain – GBBs (Cloud Scale Analytics) for EMEA at Microsoft

Business logic serves an important role in the data-driven data platform. There is a business expectation of the information to be delivered. This expectation can be defined by two characteristics: the user expects the information in a certain structure (often a dimensional model) and they expect certain content, for example, all currency amounts to be in Euros. But there is a gap between these expectations and the actual data from the data sources. In the Data Vault 2.0 architecture, the Business Vault is used to bridge this gap and focuses on implementing the business logic to transform the data to meet the content expectations of the business users.

Introduction

The previous articles focused on modelling and implementing the Raw Data Vault, where the raw data is captured by integrating the data on shared business keys and their relationships and versioning all changes to descriptive data.

In the Raw Data Vault layer, only so-called hard rules are applied. Hard rules don’t change the meaning of the content, they only change the structure of the incoming data set. The creative destruction of the data into business keys (stored in hubs), relationships between business keys (stored in links), and descriptive data (stored in satellites) is a prime example. But also the data type alignment to match the data types of the (often relational) data model is a good example: changing the data type of a whole number from a CSV string into an integer is not changing the content, but only the structure of the data.

This is important to ensure auditability when the original data deliveries must be reproduced and to create multiple business perspectives when information users cannot agree on shared business logic or the definition of their concepts (“Who is a customer?”). In both cases, the unmodified raw data is required.

How to Model the Business Vault

The business logic to transform the raw data into useful information, for example by cleansing it, recomputing foreign currency amounts, or standardizing addresses, is implemented in the Business Vault. This sparsely modelled layer sits right between the Raw Data Vault with its unmodified raw data and the information mart where the final information in the expected structure and with the expected content is delivered to the presentation layer.

“Sparsely modelled” refers to the fact that, believe it or not, some of your data is good enough for reporting. There is no need to cleanse, standardize, or otherwise transform the data because it is exactly what the information user expects. In this case, the dimensional entities in the information mart are directly derived from the Raw Data Vault entities.

However, if business logic needs to be applied, it's done in the Business Vault. The entities are typically modelled in the same way as in the Raw Data Vault, so one can expect hubs, links, and many satellites in the Business Vault, including special entity types such as multi-active satellites, non-historized links, etc.

For example, in the above diagram, there are the invoice hub and its two satellites originating from the CRM and ERP system. In addition to these Raw Data Vault entities, a computed (or business) satellite invoice_lroc_bsat with one calculated attribute for the invoice amount has been added.

But in either case, with or without additional Business Vault entities, it also means that the final information is not done yet, as the desired target structure (e.g., a dimensional model) is not created yet. This model will be derived from the information mart. To do so, the dimensional modeler can combine (and often has to combine) entities from the Raw Data Vault and the Business Vault.

Implementing Business Logic on Fabric

In many cases, the business logic is relatively simple and can be implemented in SQL. In such cases, an SQL view is the preferred choice and is used to implement the Business Vault entity. If the business logic becomes too complex or the performance of the view is not as desired, an external script might be used as an alternative. For example, a Python script could retrieve data from the Raw Data Vault (but also from the Business Vault) and write the results into a table in the Business Vault. This external Python script is considered to be part of the Business Vault as long as the data platform team has it under its own version control. Keep in mind that there are other options, such as PIT tables, to improve the performance of virtualized entities in the Data Vault architecture.

Note that there are actually two options to implement business logic in the Data Vault architecture: besides the discussed option in the Business Vault, it is also possible to implement business rules directly in the information marts, for example in dimension or fact entities. However, when doing so, the business logic is not re-usable. If the same business logic should be used for dimensions in multiple information marts, the business logic must be replicated. If the business logic is implemented in the Business Vault instead, the entities can be reused by multiple information marts.

The Business Vault is often implemented in a cascading fashion: a Business Vault entity is not limited to a Raw Data Vault entity as its only data source. Instead, a Business Vault entity can source from multiple entities, both from the Raw Data Vault and other Business Vault entities. By doing so, the overall implementation logic is cascading across multiple Business Vault entities, which is a typical scenario. In some cases, developers try to avoid this, but end up with a Business Vault entity with complex implementation logic.

From an auditing perspective, there is one more requirement: It should be possible to truncate a materialized Business Vault entity and rebuild it by applying the same, unmodified business logic to the same, unmodified source data. The results in the Business Vault entity must be the same. If this is not the case, either the source data has been modified or the business logic.

Cleansing Dirty Data using Computed Satellites

A typical entity type in the Business Vault is the computed satellite, also known as the business satellite. To be short: it’s just a satellite in the Business Vault. The only difference to its counterpart in the Raw Data Vault is that it contains computed results, not just raw data.

This makes sense as descriptive data is stored in a satellite in the Raw Data Vault and subject to the application of business logic, for example to cleanse the data, standardize addresses and phone numbers and otherwise increase the value of the data.

For example, if the Raw Data Vault satellite captures raw data from the data source, it might be erroneous (e.g. on the city name):

This data is captured as it is in the Raw Data Vault of the Data Vault architecture, completely unmodified. The goal of the Raw Data Vault is to capture the good, the bad, and the ugly data, and no judgment is made about these categories. As discussed at the beginning of this article, data cleansing is part of the Business Vault, not of the Raw Data Vault.

The next diagram shows the relationship between the computed satellite with the computed city attribute, the Raw Data Vault satellite, and their shared hub:

The computed satellite is attached to the same hub, as it still describes the same store, just with cleansed data. In this case, the business logic is simple: data is cleansed by joining into a mapping table for the city name based on the raw data. For each city in the Raw Data Vault satellite, there is a mapping in the reference data for mapping the original city name to the cleansed city name:

CREATE VIEW [dv_core].[store_address_crm_lroc_bsat] AS (

SELECT       hk_store_hub

            ,load_datetime

            ,record_source

            ,hd_store_address_crm_lroc_sat

            ,address_street

            ,postal_code

            , CASE

               WHEN store.city != cities.CityName

               AND cities.ZipCode IS NOT NULL THEN cities.CityName

               ELSE store.city END AS city

            ,country

FROM [dv_core].[store_address_crm_lroc0_sat] store

LEFT JOIN [MS_BLOG_DWH].[dv_core].[city_ref_sat]  cities on store.postal_code = cities.ZipCode);

It is not uncommon for a lot of business logic to be implemented as simply as the above code. This is achieved by providing a materialized mapping table between dirty data and cleansed data as a reference hub and satellite.

Once the computed satellite is deployed, the downstream developer for the information mart can now choose between the cleansed address data or the original address data to be used for the dimension entity by joining the appropriate satellite entity.

Dealing with Duplicate Records

Another often-used entity in the Business Vault is the Same-As-Link (SAL). The name of the SAL stems from the sentence, “This business key identifies the same business object as the other business key.” So, the link relates two business keys in a duplicate-to-master relationship. One business key identifies the master key to be used for reporting, and the other identifies the duplicate key. If the data contains multiple duplicates, multiple duplicate business keys might be mapped to the same master key.

For example, the following table shows a source data set with duplicate records:

There are different variations of Michael Olschimke, and because the operational system did not recognize that all of them refer to the same actual customer, the operational system assigned separate business keys to each record.

The business key customer_id is captured by the hub customer_hub, while the name is captured by a satellite, not shown in the following diagram:

Based on the descriptive data in the satellite, SQL Server’s SOUNDEX function (supported in Fabric Warehouse) can be used to calculate the similarity of two strings, based on the pronunciation of the text. The matches where the similarity is above a certain threshold are considered as duplicates and added to the same-as-link (SAL). That way, duplicates are marked and the mapping can be used later to retrieve a deduplicated dimension.

The following code shows the virtual implementation of the same-as-link:

CREATE VIEW [dv_core].[customer_sal] AS

WITH crm_data AS (

SELECT

      crm.hk_customer_hub

    , crm.load_datetime

    , hub.record_source

    , hub.customer_id

    , crm.name

    , SOUNDEX(crm.name) as soundex_name

    , crm.email

FROM dv_core.customer_hub hub

LEFT JOIN dv_core.customer_crm_lroc_sat crm

    ON hub.hk_customer_hub = crm.hk_customer_hub AND crm.is_current = 0

)

, shop_data AS (

SELECT

      shop.hk_customer_hub

    , shop.load_datetime

    , hub.record_source

    , hub.customer_id

    , shop.name

    , SOUNDEX(shop.name) as soundex_name

    , shop.email

FROM dv_core.customer_hub hub

LEFT JOIN dv_core.customer_shop_lroc_sat shop

    ON hub.hk_customer_hub = shop.hk_customer_hub AND shop.is_current = 0

)



SELECT

    hk_customer_hub

    , load_datetime

    , record_source

    , hk_master

    , hk_duplicate

    FROM(

        SELECT

            crm.hk_customer_hub AS hk_customer_hub,

            LEAST(crm.load_datetime, shop.load_datetime) AS load_datetime,

            'https://wiki.scalefree.com/business_rules/unique_customers' AS record_source,

            crm.hk_customer_hub AS hk_master,

            shop.hk_customer_hub AS hk_duplicate,

            DIFFERENCE(crm.name, shop.name) AS similarity_score

            FROM crm_data crm

            LEFT JOIN shop_data shop

                ON crm.soundex_name = shop.soundex_name

            )level1 WHERE level1.similarity_score = 4;

Once the same-as-link is created, it can be joined with the hub to reduce the duplicates to the master record based on the SoundEx function. The actual business logic is implemented in the view - the users who query this model don’t necessarily need to know how to apply SoundEx - they just use the results by joining the link to the customer hub.

Concluding the Value of the Business Vault

By implementing the business logic in Business Vault entities, the business logic can be used by multiple information marts but also data scientists and other power users. The Business Vault model presents the result of the business logic, while the source code implements it in many cases. However, in other cases, the business logic could also be implemented in external scripts, such as Python. In this case, the Business Vault entity would be a physical table that is loaded from the external script.

There are some cases, where it makes more sense to implement the business logic in the dimensional entity - for example in the dimension view. However, in such cases, the code will not be reused by multiple information marts. To use the same logic in multiple marts, the logic must be replicated.

We will discuss the information marts and how we derive dimensional models from a Data Vault model in the next article of this series.

<<< Back to Blog Series Title Page

Updated Nov 06, 2024

Version 2.0

analytics

azure

Azure Synapse Analytics

Delta Lake

Microsoft Fabric

Naveed-Hussain

Microsoft

Joined February 20, 2023