Data Engineering
10 TopicsHow to Query Spark Tables from Serverless SQL Pools in Azure Synapse
Introduction Say goodbye to constantly running Spark clusters! With the shared metadata functionality, you can shut down your Spark pools while still be able to query your Spark external tables using Serverless SQL Pool. In this blog we dive into, how Serverless SQL Pool streamlines your data workflow by automatically synchronizing metadata from your Spark pools. Shared Metadata functionality Azure Synapse Analytics allows the different workspace computational engines to share databases and tables between its Apache Spark pools and serverless SQL pool. When we create tables in Apache Spark Pool, whether managed or external, the Serverless SQL pool automatically synchronizes its metadata. This metadata synchronization automatically creates a corresponding external table in a serverless SQL pool database. Then after a short delay, we can see the table in our Serverless SQL pool. Creating a managed table in Spark and querying from Serverless SQL Pool Now we can shut down our Spark pools and still be able to query Spark external tables from Serverless SQL Pool. NOTE: Azure Synapse currently only shares managed and external Spark tables that store their data in Parquet, DELTA, or CSV format. Tables backed by other formats are not automatically synced. You may be able to sync such tables explicitly yourself as an external table in your own SQL database if the SQL engine supports the table's underlying format. Also, External tables created in Spark are not available in dedicated SQL pool databases. Why we get an error if you use dbo schema in Spark pool or if you don’t use dbo schema in Serverless SQL pool? The dbo schema (short for “database owner”) is the default schema in SQL Server and Azure Synapse SQL pools. Spark pool only supports user-defined schemas. Means, it does not recognize dbo as a valid schema name. While in Serverless SQL Pool, all the tables belong to the dbo schema, regardless of their original schema in Spark pool or other sources.197Views0likes0CommentsRobust data protection features of Azure Synapse
Introduction Data serves as the vital essence of any organization. Whether you’re dealing with sensitive customer information, or financial records, safeguarding your data is non-negotiable. Many organizations face challenges such as: How do you protect the data if you don't know where it is? What level of protection is needed? —because some datasets require more protection than others. Azure Synapse Analytics offers powerful features to help you achieve this, ensuring confidentiality, integrity, and availability. In this blog, we’ll explore the Data Encryption capabilities integrated into Azure Synapse Analytics, discussing encryption techniques for data at rest and in transit, as well as approaches for detecting and categorizing sensitive data in your Synapse workspace. What is Data Discovery and Classification? Imagine your company that have massive amounts of information stored in their databases. But some of columns needs extra protection – like Social Security numbers or financial records. Manually finding this sensitive data is a time-consuming nightmare. Here's the good news: there's a better way! Azure Synapse offers a feature called Data Discovery that automates this process. How does Data Discovery work? Think of Data Discovery as a super-powered scanner. It automatically goes through every row and column of your data lake or databases, looking for patterns that might indicate sensitive information. Just like a smart assistant, it can identify potentially sensitive data and classify those columns for you. Once the data discovery process is complete, it provides classification recommendations based on a predefined set of patterns, keywords, and rules. These recommendations can then be reviewed, and then Sensitivity-classification labels can be applied to the appropriate columns. This process is known as Classification. What happen after classifying sensitivity labels on columns? Sensitivity-classification labels is a new metadata attributes that have been added to the SQL Server database engine. So, after classifying sensitivity labels on columns, the organization can leverage these labels to: implement fine-grained access controls. Only authorized person with the necessary clearance can access sensitive data. masking the sensitive data when accessed by users who do not have the necessary permissions, allowing them to see only anonymized versions of the data. monitoring of access and modification activities on sensitive data (Auditing access to sensitive data). Any unusual or unauthorized activities can be flagged for investigation. Steps for Discovering, Classifying or labelling columns that contain sensitive data in your database The classification includes two metadata attributes: Labels: The main classification attributes, used to define the sensitivity level of the data stored in the column. Information types: Attributes that provide more granular information about the type of data stored in the column. Step 1 -> Choose Information Protection policy based on your requirement SQL Information Protection policy is a built-in set of sensitivity labels and information types with discovery logic, which is native to the SQL logical server. You can also customize the policy, according to your organization's needs, for more information, see Customize the SQL information protection policy in Microsoft Defender for Cloud (Preview). Step 2 -> View and apply classification recommendations The classification engine automatically scans your database for columns containing potentially sensitive data and provides a list of recommended column classifications. After accepting recommendation for columns by selecting the check box in the left column and then select Accept selected recommendations to apply the selected recommendations. You can also classify columns manually, as an alternative or in addition to the recommendation-based classification. To complete your classification, select Save in the Classification page. Note: There is another option for data discovery and classification, which is Microsoft Purview, which is a unified data governance solution that helps manage and govern on-premises, multi-cloud, and software-as-a-service (SaaS) data. It can automate data discovery, lineage identification, and data classification. By producing a unified map of data assets and their relationships, it makes data easily discoverable. Data Encryption Data encryption is a fundamental component of data security, ensuring that information is safeguarded both at rest and in transit. So, Azure Synapse take care of this responsibility for us. It leverages robust encryption technologies to protect data. Data at Rest Azure offers various methods of encryption across its different services. Azure Storage Encryption By default, Azure Storage encrypts all data at rest using server-side encryption (SSE). It's enabled for all storage types (including ADLS Gen2) and cannot be disabled. SSE uses AES 256 to encrypts and decrypts data transparently. AES 256 stands for 256-bit Advanced Encryption Standard. AES 256 is one of the strongest block ciphers available and is FIPS 140-2 compliant. Well, I know these sounds like some Hacking terms😅. But the platform itself manages the encryption key, so we don't need to understand these Hacking terms😅. Also, it forms the first layer of data encryption. This encryption applies to both user and system databases, including the master database. Note: For additional security, Azure offers the option of double encryption. Infrastructure encryption uses a platform-managed key in conjunction with the SSE key, encrypting data twice with two different encryption algorithms and keys. This provides an extra layer of protection, ensuring that data at rest is highly secure. Double the Protection with Transparent Data Encryption (TDE) It is an industrial methodology that encrypts the underlying files of the database and not the data itself. This adds a second layer of data encryption. TDE performs real-time I/O encryption and decryption of the data at the page level. Each page is decrypted when it's read into memory and then encrypted before being written to disk. TDE encrypts the storage of an entire database by using a symmetric key called the Database Encryption Key. Means when data is written to the database, it is organized into pages and then TDE encrypts each page using DEK before it is written to disk, that makes it unreadable without the key. And when a page is read from disk into memory, TDE decrypts it using the DEK, making the data readable for normal database operations. Why do we call it transparent? because the encryption and decryption processes are transparent to applications and users, they have no idea that the data is encrypted or not, the only way they would know if they don't have access to it. This is because encryption and decryption happen at the database engine level, without requiring application awareness or involvement. By default, TDE protects the database encryption key (DEK) with a built-in server certificate managed by Azure. However, organizations can opt for Bring Your Own Key (BYOK), that key can be securely stored in Azure Key Vault, offering enhanced control over encryption keys. Data in transit Data encryption in transit is equally crucial to protect sensitive information as it moves between clients and servers. Azure Synapse utilizes Transport Layer Security (TLS) to secure data in motion. Azure Synapse, dedicated SQL pool, and serverless SQL pool use the Tabular Data Stream (TDS) protocol to communicate between the SQL pool endpoint and a client machine. TDS depends on Transport Layer Security (TLS) for channel encryption, ensuring all data packets are secured and encrypted between endpoint and client machine. It uses a signed server certificate from the Certificate Authority (CA) used for TLS encryption, managed by Microsoft. Azure Synapse supports data encryption in transit with TLS v1.2, using AES 256 encryption.142Views0likes0CommentsRemove unwanted partition data in Azure Synapse (SQL DW)
Introduction to Partition Switching? Azure Synapse Dedicated SQL pool or SQL Server or Azure SQL Database, allows you to create partitions on a target table. Table partitions enable you to divide your data into multiple chunks or partitions. It improves query performance by eliminating partitions that is not necessary. In most cases partitions are built on date column. Why don't we simply drop the unwanted Partition? because of several regions: Clustered Columnstore Index: Dropping a partition directly can potentially lead to performance degradation, especially with a CCI. This is because CCIs are optimised for data locality and dropping a partition disrupts that organisation. Rebuilding the CCI after dropping the partition would be required, which can be time-consuming for a large table. Transaction Safety: Directly dropping a partition might not be a transactional operation. This means if the drop operation fails midway, the partition might be left in an inconsistent state, potentially causing data integrity issues. Requirement to apply Partition Switching The definitions of source and target tables are the same. Steps for Partition Switching in Synapse SQL Pool: Step 1 (Optional) -> Create a credential Skip this step if you're loading the Contoso public data. Don't skip this step if you're loading your own data. To access data through a credential, use the following script to create a database-scoped credential. Then use it when defining the location of the data source. CREATE MASTER KEY; CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential WITH IDENTITY = 'Managed Identity', SECRET = 'https://rnd-learning.vault.azure.net/secrets/synapselearningadls-accesskey/d94c967cb0c5452eafaf5d207afcb86a' ; CREATE EXTERNAL DATA SOURCE AzureStorage WITH ( TYPE = HADOOP, LOCATION = 'wasbs://<blob_container_name>@<azure_storage_account_name>.blob.core.windows.net', CREDENTIAL = AzureStorageCredential ); MASTER KEY is required to encrypt the credential secret in the next step. IDENTITY refers to the type of authentication you're using. Here I am using Managed Identity, because I allow Azure Synapse workspace to securely connect to and authenticate with Azure Key Vault without having to embed any credentials directly in your code. TYPE is HADOOP because, PolyBase uses Hadoop APIs to access data in Azure blob storage. Step 2 -> Create the external data source Use this command to store the location of the data, and the data type. CREATE EXTERNAL DATA SOURCE AzureStorage_west_public WITH ( TYPE = Hadoop, LOCATION = 'wasbs://contosoretaildw-tables@contosoretaildw.blob.core.windows.net/' ); Step 3 -> Configure the data format The data is stored in text files in Azure blob storage, and each field is separated with a delimiter. Use this command to specify the format of the data in the text files. The Contoso data is uncompressed, and pipe delimited. CREATE EXTERNAL FILE FORMAT TextFileFormat WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS ( FIELD_TERMINATOR = '|', STRING_DELIMITER = '', DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff', USE_TYPE_DEFAULT = FALSE ) ); Step 4 -> Create the schema for the external tables To create a place to store the Contoso data in your database, create a schema. CREATE SCHEMA [asb] GO Step 5 -> Create the external tables Run the following script to create the FactOnlineSales external tables. All you're doing here is defining column names and data types, and binding them to the location and format of the Azure blob storage files. The definition is stored in the data warehouse and the data is still in the Azure Storage Blob. CREATE EXTERNAL TABLE [asb].FactOnlineSales ( [OnlineSalesKey] [int] NOT NULL, [DateKey] [datetime] NOT NULL, [StoreKey] [int] NOT NULL, [ProductKey] [int] NOT NULL, [PromotionKey] [int] NOT NULL, [CurrencyKey] [int] NOT NULL, [CustomerKey] [int] NOT NULL, [SalesOrderNumber] [nvarchar](20) NOT NULL, [SalesOrderLineNumber] [int] NULL, [SalesQuantity] [int] NOT NULL, [SalesAmount] [money] NOT NULL, [ReturnQuantity] [int] NOT NULL, [ReturnAmount] [money] NULL, [DiscountQuantity] [int] NULL, [DiscountAmount] [money] NULL, [TotalCost] [money] NOT NULL, [UnitCost] [money] NULL, [UnitPrice] [money] NULL, [ETLLoadID] [int] NULL, [LoadDate] [datetime] NULL, [UpdateDate] [datetime] NULL ) WITH ( LOCATION='/FactOnlineSales/', DATA_SOURCE = AzureStorage_west_public, FILE_FORMAT = TextFileFormat, REJECT_TYPE = VALUE, REJECT_VALUE = 0 ); Step 6 -> Load the data There are different ways to access external data. You can query data directly from the external tables, load the data into new tables in the data warehouse, or add external data to existing data warehouse tables. Step 6.1 -> Create a new schema CREATE SCHEMA [cso] GO Step 6.2 -> Load the data into new tables To load data from Azure blob storage into the data warehouse table, use the CREATE TABLE AS SELECT (Transact-SQL) statement. CTAS creates a new table and populates it with the results of a select statement. CTAS defines the new table to have the same columns and data types as the results of the select statement. If you select all the columns from an external table, the new table will be a replica of the columns and data types in the external table. CREATE TABLE [cso].[FactOnlineSales] WITH ( CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH([ProductKey]), PARTITION ( [DateKey] RANGE RIGHT FOR VALUES ( '2007-01-01 00:00:00.000','2008-01-01 00:00:00.000', '2009-01-01 00:00:00.000','2010-01-01 00:00:00.000' ) ) ) AS SELECT * FROM [asb].FactOnlineSales; With this statement I have created 5 partitions in the [cso].[FactOnlineSales] table, each of which has the duration of a year, except the first that contains all rows with DateKey < 2007–01–01 and the last that contains all rows with DateKey ≥ 2010–01–01. Step 7 -> Create an empty partition table Now do the same thing for the target table. Here I forcefully created empty table, for switching with source table. CREATE TABLE [cso].[FactOnlineSales_out] WITH ( CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH([ProductKey]), PARTITION ( [DateKey] RANGE RIGHT FOR VALUES ( '2007-01-01 00:00:00.000' ) ) ) AS SELECT * FROM [cso].[FactOnlineSales] WHERE 1 = 2; NOTE: If you are switching out the partition (means deleting the partition) you can partition data out to any table irrespective of whether that table is partition or not. So here data will be switched from partition table to a non-partition table. But if you are switching in the partition (means switching the partition with new data), there is a strict criterion where you have to same partitioning boundary define. Step 8 -> Switch the Partition Here I switched out the partition. Now after switch; [cso].[FactOnlineSales_out] has the data about Jan 1st, 2007, till December 31st, 2007. While the [cso].[FactOnlineSales] has no data in partition 2. ALTER TABLE [cso].[FactOnlineSales] SWITCH PARTITION 2 TO [cso].[FactOnlineSales_out] PARTITION 2; NOTE: The command is very simple, but there is one catch; it requires the partition number of source and target tables to perform the switching. Validating partition switching for both source and target table. Step 9 -> Delete the staging table Based on your requirement, delete this table or archive the data of this table as cold data. DROP TABLE [cso].[FactOnlineSales_out]; What happens during the Partition Switch? Before the Switch: Imagine the data for FactOnlineSales is physically stored on disk, potentially spread across multiple files. Each partition in FactOnlineSales has its own metadata entry that keeps track of the specific locations of it's data on disk. During the Switch (using partition X as the example): You identify partition X (containing old data) in FactOnlineSales. The ALTER TABLE SWITCH statement updates the metadata entries for both tables: In FactOnlineSales, the metadata entry for partition X is modified to point to an empty location on disk. This essentially signifies that partition X is now "empty" within FactOnlineSales. In FactOnlineSales__out, a new metadata entry is created for partition X. This new entry points to the same physical location on disk where the data for partition X already resides (remember, the data itself doesn't move). After the Switch: Both FactOnlineSales and FactOnlineSales__out have metadata entries for partition X. However, these entries point to different things: FactOnlineSales entry points to an empty location, indicating the partition is no longer actively used within that table. FactOnlineSales__out entry points to the actual data location, making it appear like FactOnlineSales "owns" that partition. How to check or verify the number of partitions? SQL Pool provides different system, that is used to query the different metadata for all the objects that is in the SQL Pool. And one of the system views that provides all the information related to partition, number of rows in that partition and all those things is sys.dm_pdw_nodes_db_partition_stats Use this script to check the number of partitions. SELECT pnp.partition_number, sum(nps.[row_count]) AS Row_Count FROM sys.tables t INNER JOIN sys.indexes i ON t.[object_id] = i.[object_id] AND i.[index_id] <= 1 /* HEAP = 0, CLUSTERED or CLUSTERED_COLUMNSTORE =1 */ INNER JOIN sys.pdw_table_mappings tm ON t.[object_id] = tm.[object_id] INNER JOIN sys.pdw_nodes_tables nt ON tm.[physical_name] = nt.[name] INNER JOIN sys.pdw_nodes_partitions pnp ON nt.[object_id]=pnp.[object_id] AND nt.[pdw_node_id]=pnp.[pdw_node_id] AND nt.[distribution_id] = pnp.[distribution_id] INNER JOIN sys.dm_pdw_nodes_db_partition_stats nps ON nt.[object_id] = nps.[object_id] AND nt.[pdw_node_id] = nps.[pdw_node_id] AND nt.[distribution_id] = nps.[distribution_id] AND pnp.[partition_id]=nps.[partition_id] WHERE t.name='FactOnlineSales' GROUP BY pnp.partition_number;182Views0likes0CommentsWays to control who can access data in Azure Synapse DW
Introduction The assets of a bank are only accessible to some high-ranking officials, and even they don't have access to individual user lockers. These privacy features help build trust among customers. The same goes with in our IT world. Every user wants their sensitive data to be accessible only to themselves, not even available to those with higher privileges within the company. So, as you move data to the cloud, securing the data assets is critical to building trust with your customers and partners. To enable these kinds of preventions, Azure Synapse supports a wide range of advanced access control features to control who can access what data. These features are: Object-level security Row-level security Column-level security Dynamic data masking Synapse role-based access control In this blog we will explore these features. Object-level security In Azure Synapse, whenever we create tables, views, stored procedures, and functions, they are created as objects. In a dedicated SQL pool these objects can be secured by granting specific permissions to database-level users or groups. For example, you can give SELECT permissions to user accounts or Database Roles to give access to specific objects. To assign permission: GRANT SELECT ON [schema_name].[table_name] TO [user_or_group]; To revoke permission: REVOKE SELECT ON [schema_name].[table_name] FROM [user_or_group]; Additionally, when you assign a user to Synapse Administrator RBAC role, they automatically gain full access to all dedicated SQL pools within that workspace. It allows them to perform any action (including managing permissions) across all databases. In Addition, when a user assigned to the Storage Blob Data Contributor role (have READ, WRITE, and EXECUTE permissions) of data lakes and the data lakes is connected to the workspace like Synapse or Databricks, then these permissions automatically applied to the Spark-created tables. This is known as Microsoft Entra pass-through. Look, when Storage Blob Data Contributor role assigned to me: Then I am able to query my Spark-created table. But when I removed that role from myself. Then it gave me an error! Row-level security RLS is a mechanism to restrict row level access (read, write, ...), based on the user's context data. A typical use cases is like, A common database tables used by multiple tenants to store the data, and in such case, we want each tenant to restrict access to their own data only. It enables this fine-grained access control without having to redesign your data warehouse. It also eliminates the need to use Views to filter out rows for access control management. NOTE: The access restriction logic is located in the database tier and the database system applies the access restrictions every time when the data is access from any tier. This makes the security system more reliable and robust by reducing the surface area of your security system. How to implement RLS? RLS can be implemented by using SECURITY POLICY. RLS is a form of predicate-based access control that works by automatically applying a Security Predicate to all queries on a table. Security Predicate binds the predicate function to the table. Predicate Function is basically a user defined function which determines a user executing the query will have access to the row or not. There are two types of security predicates: Filter predicates: It silently filters out rows that users shouldn't see during SELECT, UPDATE, and DELETE operations. This is used when you want to hide data without disrupting the user experience. For example, in an employee database, filter predicate is used to ensure salespeople can only see their own customer records. They wouldn't even know about records belonging to other salespeople. Block predicates: It explicitly blocking write operations (INSERT, UPDATE, DELETE) that violate pre-defined rules. If a user tries to perform an action that breaks the rules, the operation fails with an error message. This is used where you want to prevent unauthorized modifications. Implementing Filter Predicates Step 1: Creating dummy users and tables, and then grant read access to these objects. CREATE SCHEMA Sales GO CREATE TABLE Sales.Region ( id int, SalesRepName nvarchar(50), Region nvarchar(50), CustomerName nvarchar(50) ); -- Inserting data INSERT INTO Sales.Region VALUES (1, 'Mann', 'Central Canada', 'C1'); INSERT INTO Sales.Region VALUES (2, 'Anna', 'East Canada', 'E1'); INSERT INTO Sales.Region VALUES (3, 'Anna', 'East Canada', 'E2'); INSERT INTO Sales.Region VALUES (4, 'Mann', 'Central Canada', 'C2'); INSERT INTO Sales.Region VALUES (6, 'Anna', 'East Canada', 'E3'); -- Creating Users CREATE USER SalesManager WITHOUT LOGIN; CREATE USER Mann WITHOUT LOGIN; CREATE USER Anna WITHOUT LOGIN; -- Granting Read Access to the Users GRANT SELECT ON Sales.Region TO SalesManager; GRANT SELECT ON Sales.Region TO Mann; GRANT SELECT ON Sales.Region TO Anna; Step 2: Create Security Filter Predicate Function. --Creating Schema for Security Predicate Function CREATE SCHEMA spf; CREATE FUNCTION spf.securitypredicatefunc(@SaleRepName AS NVARCHAR(50)) RETURNS TABLE WITH SCHEMABINDING AS RETURN SELECT 1 AS securitypredicate_result WHERE @SaleRepName = USER_NAME() OR USER_NAME() = 'SalesManager'; The function returns a table with a single value that is 1, when it satisfies the WHERE condition. And SCHEMABINDING ensures that the underlying objects (tables, views, etc.) referenced by the function cannot be modified (dropped or altered) while the function exists. Step 3: Create Security Policy that Filter Predicate Security and binds the predicate function to the table. CREATE SECURITY POLICY MySalesFilterPolicy ADD FILTER PREDICATE spf.securitypredicatefunc(SalesRepName) ON Sales.Region WITH (STATE = ON); Step 4: Test your RLS. EXECUTE AS USER = 'Mann'; SELECT * FROM Sales.Region ORDER BY id; REVERT; When a user (e.g., 'Mann') executes a query on the table, SQL Server automatically invokes the security predicate function for each row in the table. Internally the function is called by SQL Server as part of the query execution plan. So, the permissions required to execute the functions are inherently handled by the SQL Server engine. So, there is no need to explicitly give the permission to functions. Step 5: You can disable RLS by Altering the Security Policy. ALTER SECURITY POLICY MySalesFilterPolicy WITH (STATE = OFF); Column-level security It is similar to RLS, but as its name suggests, it applies at the column level. For example, in financial services, only account managers have access to customer social security numbers (SSN), phone numbers, and other personally identifiable information (PII). Additionally, the method of implementing CLS differs. It is implemented by granting Object level Security. Implementing CLS Step 1: Creating dummy user and table. CREATE USER TestUser WITHOUT LOGIN; CREATE TABLE Membership ( MemberID int IDENTITY, FirstName varchar(100) NULL, SSN char(9) NOT NULL, LastName varchar(100) NOT NULL, Phone varchar(12) NULL, Email varchar(100) NULL ); Step 2: Grant the User to access columns except sensitive columns. GRANT SELECT ON Membership ( MemberID, FirstName, LastName, Phone, Email ) TO TestUser; Step 3: Now if the user tries to access whole columns, it will give error. EXECUTE AS USER = 'TestUser'; SELECT * FROM Membership; Dynamic data masking It is the process of limiting the exposure of sensitive data, to the user who should not have access to viewing it. For example, Customer service agents who need to access customer records but should not see full credit card numbers, which can be masked. You may ask, why can't we use CLS, or why we don't completely restrict the access? Because of these reasons: - A CLS will completely restrict the access of reading and altering columns. But when a masking is applied on a column, it doesn't prevent updates to that column. So, if users receive masked data while querying the masked column, the same users can update the data if they have write permissions. In masking you can use SELECT INTO or INSERT INTO to copy data from a masked column into another table that will store as masked data (assuming it's exported by a user without UNMASK privileges). But in CLS you can't do anything, if you don't have access to restricted column. NOTE: Administrative users and roles (such as sysadmin or db_owner) can always view unmasked data via the CONTROL permission, which includes both the ALTER ANY MASK and UNMASK permission. You can grant, or revoke UNMASK permission at the database-level, schema-level, table-level or at the column-level to a user, database role, Microsoft Entra identity or Microsoft Entra group. Implementing DDM Step 1: Creating dummy user. CREATE USER MaskingTestUser WITHOUT LOGIN; Step 2: Create a table and apply the masking on required columns. CREATE TABLE Data.Membership ( FirstName VARCHAR(100) MASKED WITH (FUNCTION = 'partial(1, "xxxxx", 1)') NULL, LastName VARCHAR(100) NOT NULL, Phone VARCHAR(12) MASKED WITH (FUNCTION = 'default()') NULL, Email VARCHAR(100) MASKED WITH (FUNCTION = 'email()') NOT NULL, DiscountCode SMALLINT MASKED WITH (FUNCTION = 'random(1, 100)') NULL ); -- inserting sample data INSERT INTO Data.Membership VALUES ('Kapil', 'Dev', '555.123.4567', 'kapil@team.com', 10); Here, I have applied both default and custom masking functions. Step 3: Granting the SELECT permission on the schema where the table resides. Users view masked data. GRANT SELECT ON SCHEMA::Data TO MaskingTestUser; Step 4: Granting the UNMASK permission allows Users to see unmasked data. GRANT UNMASK TO MaskingTestUser; Step 5: Use the ALTER TABLE statement to add a mask to an existing column in the table, or to edit the mask on that column. ALTER TABLE Data.Membership ALTER COLUMN LastName ADD MASKED WITH (FUNCTION = 'partial(2,"xxxx",0)'); ALTER TABLE Data.Membership ALTER COLUMN LastName VARCHAR(100) MASKED WITH (FUNCTION = 'default()'); Synapse role-based access control Basically, it leverages the built-in roles to assign permissions to users, groups, or other security principals to manage who can: Publish code artifacts and list or access published code artifacts. Execute code on Apache Spark pools and integration runtimes. Access linked (data) services that are protected by credentials. Monitor or cancel job executions, review job output and execution logs.1.3KViews1like0CommentsWays to load data in Synapse DW from External Data Source
Introduction After performing transformation on data, then it needs to save it in Data Warehouse for further analysis. This process comes under ETL (Extract, Transform, Load). It involves extracting data from various sources, transforming it as per business need, and then loading it into destination. There are many ways to load data in DW: - POLYBASE COPY INTO command ADF Copy activity Spark Pool in Synapse Databricks In this blog I will explain these processes and talk about some limitations of some activity and its workaround. NOTE: Whenever we use other tools for inserting data in Synapse like ADF, Data Flow, Spark Pool and Databricks, it highly recommended to enable staging or under hood it can be implemented automatically. It allows for efficient data transfer and can handle large datasets more effectively. Basically, this approach minimizes the load on both the source and the destination during the transfer process. Internally the data is first loaded in Staging layer (temporary storage layer), then from there, data is loaded in DW. After loading the data in DW, the temporary data in staging layer is deleted. POLYBASE PolyBase is a technology that uses metadata to allows SQL Server and dedicated SQL pools to query and import data from external data sources using T-SQL. Step 1: Create an object of Database Scoped Credential. This indicates how it going to connect to external data source(Azure Data Lake Gen 2). In the Database Scoped Credentials, I am using Managed Identity (means Synapse uses its own credentials to access Data Lake). You also have to give Storage Blob Data Contributor role to the Managed Identity of Synapse Workspace. CREATE DATABASE SCOPED CREDENTIAL adlsgen2synp WITH IDENTITY = 'MANAGED IDENTITY' GO Step 2: Create an External Data Source. It's a definition of External Data Sorce Location. CREATE EXTERNAL DATA SOURCE taxi_raw_data WITH ( LOCATION = 'abfss://raw@synapselearningadls.dfs.core.windows.net/', CREDENTIAL = adlsgen2synp, TYPE = HADOOP ); GO Step 3: Create an External File Format. It defines format of the file like CSV. CREATE EXTERNAL FILE FORMAT csv_file_format WITH( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS ( FIELD_TERMINATOR = ',', STRING_DELIMITER = '"', FIRST_ROW = 2, USE_TYPE_DEFAULT = FALSE ) ); GO Step 4: Create External Table so that it able to query the external data directly from synapse without importing it. CREATE EXTERNAL TABLE temp_table.ext_taxi_zone ( LocationID SMALLINT, Borough VARCHAR(15), Zone VARCHAR(50), service_zone VARCHAR(15)) WITH( LOCATION = 'taxi_zone.csv', DATA_SOURCE = taxi_raw_data, FILE_FORMAT = csv_file_format ); GO Step 5: Use CTAS statement to import the data in Synapse SQL DW. CREATE TABLE temp_table.taxi_zone WITH ( DISTRIBUTION = ROUND_ROBIN ) AS SELECT * FROM temp_table.ext_taxi_zone; NOTE: Polybase does not support DELTA file format. External Table can't be modified. For altering columns, you must have to drop the table and then recreate again. COPY INTO command It is a newer way of loading the data in DW. It doesn't need any extra objects. It copies the data directly to the table in DW. It is faster and simpler compared to POLYBASE. COPY INTO [temp_table].[taxi_zone] FROM 'https://synapselearningadls.blob.core.windows.net/raw/taxi_zone.csv' WITH ( CREDENTIAL = (IDENTITY = 'MANAGED IDENTITY'), FILE_TYPE = 'CSV', FIRSTROW = 2 ) NOTE: It supports only these CSV, PARQUET, ORC file types. ADF Copy Activity Step 1: Giving some permissions to ADF Managed Identity to access Synapse Database. Here EXTERNAL PROVIDER refers to the Microsoft Entra. CREATE USER [adf-rnd-learning] FROM EXTERNAL PROVIDER; GO EXEC sp_addrolemember db_owner, [adf-rnd-learning]; GRANT ADMINISTER DATABASE BULK OPERATIONS TO [adf-rnd-learning]; Step 2: Create linked service for ADLS Gen 2 (External data source) and Azure Synapse Analytics. Create Dataset for source external file in ADLS and sink Table in synapse DB. Create a pipeline and use COPY Activity to copy data from ADLS to Synapse database. Set the source and sink dataset in COPY activity and enable the Staging. Here in staging, the data stored temporary and after loading in Synapse DB, it deleted. NOTE: ADF COPY activity doesn't support DELTA. But a workaround is you can use Data Flow in ADF. While adding the source in Data Flow, choose source type as Inline. Spark Pool in Synapse In Spark Pool we can use Synapse DW Objects with the help of spark connector. And the best part is, Synapse automatically manages all authentication for us. So, we don't have to focus on managing credentials. sqlanalytics contains a spark connector that is used to connect to dedicated SQL Pool from Spark Pool. Here, under the hood staging data is loaded automatically in some default location. Databricks In Databricks, for accessing Synapse DW Objects first we have to manage credentials for both Synapse and ADLS. Step 1: Give required permissions to Databricks to access ADLS Gen 2. Create a service principal in Microsoft Entra Id and add Client Secret to this service principal. Assign the Storage Blob Data Contributor role of ADLS to this service principal. Then add the client_id, tenant_id and client secrets of service principal to Azure Key Vault. Then create Scope in Databricks and add the required details of your key vault. Databricks keeps the Secret Scope in a hidden user interface. To reveal that interface, type this “#secrets/createScope” in the end of URL in the Databricks homepage. Step 2: Give permission to Service Principal to access Synapse Database. Execute this below query in Synapse Database. CREATE USER [databricks_learning_appreg] FROM EXTERNAL PROVIDER; GO EXEC sp_addrolemember db_owner, [databricks_learning_appreg]; GRANT ADMINISTER DATABASE BULK OPERATIONS TO [databricks_learning_appreg]; Step 3: Fetch the credentials from Databricks Secrets Scope. client_id = dbutils.secrets.get(scope='dblearning-scope', key='db-learning-client-id') tenant_id = dbutils.secrets.get(scope='dblearning-scope', key='db-learning-tenant-id') client_secret = dbutils.secrets.get(scope='dblearning-scope', key='db-learning-client-secret') Step 4: Set the configurations in Databricks, for ADLS. spark.conf.set("fs.azure.account.auth.type.synapselearningadls.dfs.core.windows.net", "OAuth") spark.conf.set("fs.azure.account.oauth.provider.type.synapselearningadls.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") spark.conf.set("fs.azure.account.oauth2.client.id.synapselearningadls.dfs.core.windows.net", client_id) spark.conf.set("fs.azure.account.oauth2.client.secret.synapselearningadls.dfs.core.windows.net", client_secret) spark.conf.set("fs.azure.account.oauth2.client.endpoint.synapselearningadls.dfs.core.windows.net", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token") Step 5: Set the configurations in Databricks, for Synapse SQL DW. spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.id", client_id) spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.secret", client_secret) Step 6: Now load the data from ADLS into data frame and perform some transformations. Step 7: Now write the transformed data to Synapse SQL DW. df.write.format("sqldw") \ .option("url", 'jdbc:sqlserver://synapse-rnd-learning.sql.azuresynapse.net:1433;database=prac_dedicated') \ .option('tempDir', 'abfss://stagging@synapselearningadls.dfs.core.windows.net/stage') \ .option("forwardSparkAzureStorageCredentials", "false") \ .option("dbTable", "temp_table.taxi_zone") \ .option('enableServicePrincipalAuth', 'true') \ .mode('append') \ .save() Step 8: Now check the table, transformed data is inserted into Synapse SQL DW. df2 = spark.read.format("sqldw") \ .option("url", 'jdbc:sqlserver://synapse-rnd-learning.sql.azuresynapse.net:1433;database=prac_dedicated') \ .option('tempDir', 'abfss://stagging@synapselearningadls.dfs.core.windows.net/stage') \ .option("forwardSparkAzureStorageCredentials", "false") \ .option("dbTable", "temp_table.taxi_zone") \ .option('enableServicePrincipalAuth', 'true') \ .load() Here you notice that in PySpark code, I am defining staging location for both reading and writing.374Views0likes0CommentsProtect PII information in your Microsoft Fabric Lakehouse with Responsible AI
Data analysts and data scientists need to protect the personally identifiable information (PII) of their clients, such as names, addresses, emails, phone numbers, or social security numbers, that they use to build reports and dashboards. PII can pose risks to both the data subjects and the data holders and can introduce biases that affect the decisions made based on the data. One way to protect PII is to use Responsible AI, which is a set of principles and practices that help to mask PII with synthetic or anonymized data that preserves the statistical properties and structure of the original data but does not reveal the identity or attributes of the individuals.1.5KViews0likes0CommentsDiscover the Future of Data Engineering with Microsoft Fabric for Technical Students & Entrepreneurs
Microsoft Fabric is an all-in-one analytics solution for enterprises that covers everything from data movement to data science, Real-Time Analytics, and business intelligence. It offers a comprehensive suite of services, including data lake, data engineering, and data integration, all in one place. This makes it an ideal platform for technical students and entrepreneurial developers looking to streamline their data engineering and analytics workflows.6KViews4likes1CommentGame of Learner Clinics for Data 2022
The Game of Learners Clinics for Data is a 5-week skilling initiative, which aims to help you learn and gain technical skills on how to work with data in the cloud. This Initiative will also provide you with the option to get to receive a free voucher to sit for the DP-900 Azure Data Fundamentals Exam. Register at https://aka.ms/dataclinics/register1.4KViews2likes0CommentsBlobfuse is an open source project developed to provide a virtual filesystem backed by the Azure Blob storage.
First published on MSDN on Feb 20, 2019 Blobfuse uses the libfuse open source library to communicate with the Linux FUSE kernel module, and implements the filesystem operations using the Azure Storage Blob REST APIs.2KViews0likes0Comments