Connect with experts and redefine what’s possible at work – join us at the Microsoft 365 Community Conference May 6-8. Learn more >

HDInsight

52 Topics

Data Vault 2.0 Warehouse Automation on Azure
This is the series of 'Blog Articles' on the topic "Data Vault 2.0 on Azure" where we start from 'What?' and then slowly dwell into 'How To?' implement DV 2.0 on Azure Data Platform Technologies.
Naveed-Hussain
Mar 07, 2025 Place Analytics on Azure Blog
8KViews
0likes
1Comment
Securing Azure HDInsight: ESM Support with Ubuntu 18.04, Cluster Updates, and Best Practices
Azure HDInsight, Microsoft's cloud-based big data analytics platform, continues to advance its features to provide users with a secure and efficient environment. In this article, we will explore the latest enhancements, focusing on Expanded Security Maintenance (ESM) support, the importance of regular cluster updates, and best practices recommended by Microsoft to fortify HDInsight deployments. The foundation of a secure Azure HDInsight environment lies in its ability to address critical vulnerabilities promptly. Microsoft ensures this by shipping the latest HDInsight images with Expanded Security Maintenance (ESM) support, which provides a framework for ongoing support, stability with minimal changes specifically targeting critical, high and some medium-level fixes. This ensures that HDInsight users benefit from a continuously updated and secure environment. ESM Support in Latest Images: Azure HDInsight 5.0 and 5.1 versions use Ubuntu 18.04 pro image. Ubuntu Pro includes security patching for all Ubuntu packages due to Expanded Security Maintenance (ESM) for Infrastructure and Applications. Ubuntu Pro 18.04 LTS will remain fully supported until April 2028. For more information on what's new in the latest HDInsight images with ESM support, users can refer to the official release notes on the Azure HDInsight Release Notes Archive. Periodic Cluster Updates: Maintaining a secure HDInsight environment requires diligence in keeping clusters up to date. Microsoft facilitates this process through the HDInsight OS patching mechanism. Periodically updating clusters using the procedures outlined in the official documentation ensures that users benefit from the latest features, performance improvements, and crucial security patches. Learn more about updating HDInsight clusters through the Azure HDInsight OS Patching documentation. ESM and HDI Release Integration: Extended Security Maintenance is seamlessly integrated into HDInsight releases. As part of each HDInsight release, critical fixes provided by ESM are bundled. This ensures that users benefit from the latest security enhancements with each new release. Customer Recommendation: Use the Latest Image: To maximize the benefits of the latest features and security updates, customers are strongly recommended to use the most recent HDInsight image number. By doing so, organizations ensure that their HDInsight clusters are fortified against the latest threats and vulnerabilities. Accessing Fixed CVE Details: For users seeking detailed information about the fixed Common Vulnerabilities and Exposures (CVEs), the Ubuntu CVE site serves as a valuable resource. Here, users can access comprehensive insights into the specific vulnerabilities addressed in each release, empowering them to make informed decisions about their security posture.
ApurbaSR
Jan 15, 2025 Place Analytics on Azure Blog
2.1KViews
0likes
0Comments
Apache Kafka Cluster - TLS encryption on HDInsight
Process of setting up Transport Layer Security (TLS) encryption for Apache Kafka Cluster on Azure HDInsight for Brokers and Zookeepers.
asethia
Mar 20, 2024 Place Analytics on Azure Blog
2.4KViews
0likes
0Comments
Migration of Apache Spark from HDInsight 5.0 to HDInsight 5.1
Azure HDInsight Spark 5.0 to HDI 5.1 Migration A new version of HDInsight 5.1 is released with Spark 3.3.1. This release improves join query performance via Bloom filters, increases the Pandas API coverage with the support of popular Pandas features such as datetime.timedelta and merge_asof, simplifies the migration from traditional data warehouses by improving ANSI compliance and supporting dozens of new built-in functions. In this article we will discuss about the migration of user applications from HDInsight Spark 3.1 to HDInsight Spark 3.3
ApurbaSR
Feb 27, 2024 Place Analytics on Azure Blog
14KViews
1like
0Comments
Migration Guide for HDInsight 5.1 - Hadoop cluster
Migration Guide for Hadoop cluster
nijelsf
Feb 27, 2024 Place Analytics on Azure Blog
4.6KViews
0likes
0Comments
HDInsight - Iceberg Open-Source Table Format
Author(s): Arun Sethia is a Program Manager in Azure HDInsight Customer Success Engineering (CSE) team. Introduction In my previous blog, we talked about leveraging Delta Lake on Azure HDInsight; this blog is dedicated to Iceberg open source table format. At this moment, HDInsight does not support Iceberg out of the box, so we will see how someone can leverage Iceberg Open Table Format in this blog. We will also see code examples to use tables from three data sources: Delta, Iceberg, and Spark native parquet files. A quick touch on Apache Iceberg - Apache Iceberg is an open-source table format that provides a transactional and versioned storage layer for big data analytics needs, originally developed to address issues in Apache Hive. It provides an alternative to Spark's default Parquet-based storage. It adds tables to compute engines, including Spark, Trino, Flink, Hive, etc., using a high-performance table format that works like a SQL table. Few silent features: Support Schema evolution without any side-effects Evolve the partition layout based on data volume and query pattern. Hidden Partition – There is no need to maintain a partition column (by transformation); it helps users not need to supply partition layout information when querying Iceberg tables. Support of time travel; useful to examine changes over time. Maintaining versions helps users to correct problems by resetting tables to a stable data version. Apache Iceberg provides an easy way to extend Spark with table specifications. In the next few sections, we will understand how to extend Spark (limited to Spark SQL) and configure the Apache Iceberg catalog with HDInsight. Spark Extension Before we jump into how to use Iceberg with HDInsight 5.0 and Spark, let us spend some time understanding a few concepts related to Spark extensions. It can help with the why and how part of the Iceberg configuration with HDInsight. Using SparkSessionExtensions provides various extension points to extend the Spark Session. The Spark SQL extension is a custom class to extend the behavior of the Spark SQL. The Spark SQL extension can be configured using the configuration property spark.sql.extensions. You can specify a comma-separated list of fully qualified class names for the extension classes you want to use in Spark SQL. With these extensions, you can extend and customize the behavior of Spark SQL. You can specify a comma-separated list of fully qualified class names for the extension classes you want to use in Spark SQL. Some common Spark SQL extensions are CSV data source, Avro data source, query optimization rules, etc. Such custom application-specific Spark configuration can be made in multiple ways; based on your deployment methods: Either by passing the configuration part of the Spark-Submit command Using the configuration magic command from Jupyter or Programmatically from the Spark job code with the SparkSession object. Spark Catalog Spark Catalog is a centralized metadata repository that provides information on data and metadata in Spark. It stores information about data stored in Spark, tables, databases, partitions, etc. It supports SQL operations on metadata, such as CREATE table, ALTER table, etc. By default, Spark comes with two catalogs, Hive and In-memory. The Hive catalog is used to store metadata in a Hive Metastore, and the in-memory catalog is used to store metadata in-memory. Spark and Hive use independent catalogs to access tables created using Spark SQL or Hive tables. A table created by Spark resides in the Spark catalog. A table created from Hive resides in the Hive catalog. We can change the Spark default catalog to the Hive catalog using the configuration property metastore.catalog.default with the value Hive. In such case, your Spark tables are managed tables in the Hive catalog and must be transaction enabled, or you can have external tables without transaction enabled. The Spark configuration property 'spark.sql.catalog' configures the list of catalogs available in Spark SQL. It allows us to configure multiple catalogs and specify the default catalog in Spark SQL, the configuration property spark.sql.defaultCatalog allows you to set the default catalog. We can create custom catalogs by implementing the Catalog interface in Spark and registering the catalog with Spark using the "spark.sql.catalog.<catalog name>" configuration parameter, where “<catalog name >” is a unique identifier for your custom catalog. In addition to managing metadata, Spark persists Spark SQL data at the warehouse directory configured using spark.sql.warehouse.dir. Iceberg Extension Iceberg core library components enable integration with compute engines like Spark, Flink, etc. These connectors are maintained in the iceberg repository and they are built for multiple versions. Iceberg provides a runtime connector for different versions of Spark. The runtime jar (iceberg-spark-runtime) is the only addition to the classpath needed to use the Iceberg open-source table format. The Iceberg Spark connector provides an extension for Spark SQL by IcebergSparkSessionExtensions class; it adds support for Iceberg tables. The extension allows users to interact with Iceberg tables in Spark SQL using DataFrame and SQL APIs for Parquet tables. The following table provides Iceberg runtime compatibility matrix w.r.t Spark version: Spark Version Latest Iceberg Support Gradle 2.4 1.1.0 org.apache.iceberg:iceberg-spark-runtime-2.4:1.1.0 3.0 1.0.0 org.apache.iceberg:iceberg-spark-runtime-3.0_2.12:1.0.0 3.1 1.1.0 org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.1.0 3.2 1.1.0 org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0 3.3 1.1.0 org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0 Using HDInsight, you can configure the connector either part of your uber jar (as dependent jar) or part of the Spark Submit command and SQL extension part of your application code or via Spark Submit. Configure Iceberg Runtime Connector & Extension Using Spark-Submit Provide Iceberg runtime connector dependency by supplying a Maven coordinate with –packages and provide Spark SQL extension using spark configuration property. spark-submit –packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.1.0 --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions Using Application Code Provide Iceberg runtime connector as a maven dependency to your application pom.xml file. <dependency> <groupId>org.apache.iceberg</groupId> <artifactId>iceberg-spark-runtime-${spark.major.version}_${scala.major.version}</artifactId> <version>1.1.0</version> </dependency> We can use SparkConf to set up the Iceberg Spark SQL extension and Iceberg runtime connector jar. val sparkConf = new SparkConf() sparkConf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") sparkConf.set("spark.jars.packages","org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.1.0") Use SparkConf to build the Spark Session. val spark = SparkSession .builder() .config(sparkConf) Using Jupyter Notebook %%configure { "conf": {"spark.jars.packages": "org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.1.0", "spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" } } Iceberg Spark SQL Catalog Iceberg supplies the following Spark SQL catalog implementations: org.apache.iceberg.spark.SparkSessionCatalog – Support for Iceberg tables to Spark’s built-in catalog, and delegates to the built-in catalog for non-Iceberg tables org.apache.iceberg.spark.SparkCatalog – It supports a Hive Metastore or a Hadoop warehouse as a catalog Spark session Catalog Iceberg table supports Spark’s built-in catalog; we can configure spark_catalog to use Iceberg’s SparkSessionCatalog. The Iceberg session catalog loads non-Iceberg tables using the spark catalog (built-in catalog). The catalog type hive defines that use Hive Catalog for Iceberg tables, and non-Iceberg tables will be created in the default spark catalog. spark.sql.catalog.spark_catalog = org.apache.iceberg.spark.SparkSessionCatalog spark.sql.catalog.spark_catalog.type = hive You can provide these configurations based on your application development and deployment method set by your enterprise: If you use the Spark-Submit command from edge nodes or Livy API to submit a job, then you can use ‘conf’ parameters. From your application code, you can use the SparkConf object to set these configurations and create SparkSession using SparkConf From Jupyter notebook, you can use configure magic command as follows. %%configure { "conf": {"spark.jars.packages": "org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.1.0", "spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions", "spark.sql.catalog.spark_catalog":"org.apache.iceberg.spark.SparkSessionCatalog", "spark.sql.catalog.spark_catalog.type":"hive" } } For example, if you run this code with the above configuration, you should see `iceberg_table` created in Hive Catalog as an external table and `spark_table` in Spark Catalog as a managed table. spark.sql("""CREATE TABLE IF NOT EXISTS iceberg_table (id string, creation_date string, last_update_time string) USING iceberg""") spark.sql("""CREATE TABLE IF NOT EXISTS spark_table (id string, creation_date string, last_update_time string)""") You can use the following query on HMS to find the catalog name for `iceberg_table` and `spark_table`: SELECT t.TBL_NAME,dbc.ctgName FROM TBLS t INNER JOIN (SELECT c.Name as ctgName, d.DB_ID as dbId FROM CTLGS c , DBS d WHERE d.CTLG_NAME = c.NAME ) dbc ON dbc.dbId=t.DB_ID WHERE TBL_NAME IN ('iceberg_table','spark_table') The output will be: Custom Catalog Iceberg supports multiple data catalog types such as Hive, Hadoop, JDBC, or custom catalog implementations. These catalogs are configured using the Hadoop configuration property spark.sql.catalog.<<catalogname>> and the type of catalog using spark.sql.catalog.<<catalogname>>.type. Hive Catalog Hadoop Catalog spark.sql.catalog.<<catalogname>>=org.apache.iceberg.spark.SparkCatalogspark.sql.catalog.<<catalogname>>.type=hive spark.sql.catalog.<<catalogname>>.uri = thrift://metastore-host:port spark.sql.catalog.<<catalogname>>.warehouse=abfs://<<warehouse path>> spark.sql.catalog.iceberg_hadoop =org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.iceberg_hadoop.type=hadoop spark.sql.catalog.iceberg_hadoop.warehouse=abfs://<<warehouse path>> The catalog.uri is optional; by default, it will use hive.metastore.uris from hive-site.xml (from Ambari -> Hive -> Advanced). The catalog.warehouse is optional; by default, it will use ‘Hive Metastore Warehouse directory’ from hive-site.xml (from Ambari -> Hive -> Advanced). You can provide these configurations based on your application development and deployment method set by your enterprise: If you use the Spark-Submit command from edge nodes or Livy API to submit a job, then you can use ‘conf’ parameters. From your application code, you can use the SparkConf object to set these configurations and create SparkSession using SparkConf From Jupyter notebook, you can use configure magic command as follows (in this case we are using Hadoop catalog): %%configure { "conf": {"spark.jars.packages": "org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.1.0", "spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions", "spark.sql.catalog.spark_catalog":"org.apache.iceberg.spark.SparkSessionCatalog", "spark.sql.catalog.iceberg":"org.apache.iceberg.spark.SparkCatalog", "spark.sql.catalog.iceberg.type":"hadoop", "spark.sql.catalog.iceberg.warehouse":"/iceberg/warehouse" } } Iceberg Hive Tables With Custom Catalog The current HDInsight supports Apache Hive version 3.1.1; if you want to use Iceberg tables from Hive (catalog type hive), the HiveIcebergStorageHandler and supporting classes need to be made available on Hive’s classpath. These classes are available as part of Iceberg-Hive runtime jar. You can add the Iceberg-Hive runtime jar by including the jar file to Hive’s auxiliary classpath, so it is available by default, or if you want to use Hive shell, then this can be achieved by stating so: add jar /<<path>>/iceberg-hive-runtime-1.1.0.jar; Iceberg Hive Tables using a custom catalog can be created in two ways: Using Hive DDL Command from hive-cli or beeline CREATE EXTERNAL TABLE customer ( id bigint, name string ) PARTITIONED BY ( state string ) STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' TBLPROPERTIES ('iceberg.catalog'='iceberg'); The Table property 'iceberg.catalog'='iceberg' set table catalog to 'iceberg'. Using Iceberg Hive Catalog Java API – Example code is available at git repository. import org.apache.iceberg.hive.HiveCatalog import org.apache.iceberg.types.Types import org.apache.iceberg.{PartitionSpec, TableProperties, Schema => IcebergSchema} import org.apache.iceberg.CatalogProperties import org.apache.spark.sql.SparkSession val catalogName = "iceberg" val nameSpace = "default" val tableName = "customer" def createTableByHiveCatalog(spark: SparkSession): Unit = { import scala.collection.JavaConverters._ // table specification starts val schema= new IcebergSchema( Types.NestedField.required(1, "id", Types.IntegerType.get()), Types.NestedField.required(2, "name", Types.StringType.get()), Types.NestedField.required(3, "state", Types.StringType.get()) ) val spec = PartitionSpec.builderFor(schema).bucket("state", 128).build() import org.apache.iceberg.catalog.TableIdentifier val tableIdentifier: TableIdentifier = TableIdentifier.of(nameSpace,tableName) val tblProperties = Map(TableProperties.ENGINE_HIVE_ENABLED->"true","iceberg.catalog"->"iceberg") // table specification ends val catalog = new HiveCatalog() catalog.setConf(spark.sparkContext.hadoopConfiguration) val properties = Map(CatalogProperties.WAREHOUSE_LOCATION->"/iceberg/warehouse/") catalog.initialize(catalogName, properties.asJava) catalog.createTable(tableIdentifier, schema, spec,s"/iceberg/warehouse/${tableName}",tblProperties.asJava) } Example - Delta, Iceberg and Spark Parquet Let us jump into the example code; in this example code, we will create sample data for Product master, Sales, and Return transactions using Mockneat. These data are stored in three file formats – Spark Parquet, Iceberg Parquet, and Delta Parquet. The scenario setup is as follows: The sample code for using Jupyter notebook is available at hdinotebookexamples. Limitation: There are limitations; at this moment, Delta Lake and Hudi currently do not support custom catalogs (out of the box). However, since Iceberg supports a custom catalog, the option is only to use spark_catalog to DeltaCatalog and create a custom catalog for Iceberg. References: https://iceberg.apache.org/ https://github.com/apache/hudi/issues/5537 HDInsight Open Table Formats: Hudi on HDInsight - https://murggu.medium.com/apache-hudi-on-hdinsight-8d981269a97a Delta Lake on HDInsight - Delta Lake on HDInsight - Microsoft Community Hub
asethia
Nov 09, 2023 Place Analytics on Azure Blog
9.3KViews
3likes
0Comments
Agile Data Vault 2.0 Projects with Azure DevOps
Having discussed the value of Data Vault 2.0 and the associated architectures in the previous articles of this blog series, this article will focus on the organization and successful execution of Data Vault 2.0 projects using Azure DevOps. It will also discuss the differences between standard Scrum, as used in agile software development, and the Data Vault 2.0 methodology, which is based on Scrum but also includes aspects from other methodologies. Other functions of Azure DevOps, for example the deployment of the data analytics platform, will be discussed in subsequent articles of this ongoing blog series.
Naveed-Hussain
Jul 19, 2023 Place Analytics on Azure Blog
6.9KViews
1like
0Comments
Data Science with Azure Synapse and Data Vault 2.0
The use of a Managed Self-Service BI with Data Vault 2.0 is demonstrated. The architecture is described, processes explained and compared to “classical” data science approaches (e.g., sandboxing).
Naveed-Hussain
Jul 19, 2023 Place Analytics on Azure Blog
4.2KViews
0likes
0Comments
HDInsight 5.0 with Spark 3.x – Part 2
Author(s): Arun Sethia is a Program Manager in Azure HDInsight Customer Success Engineering (CSE) team. Introduction This blog is part of Spark 3.x series of blogs; the blog is dedicated towards looking into two different aspects of AQE (Adaptive Query Execution), Dynamic Switching Join Strategies and Dynamic Optimizing Skew Join using Apache Spark in Azure HDInsight. Adaptive Query Execution – Dynamic Switching Join Strategies There are multiple business scenarios where we must join multiple datasets to generate business insight for end-user consumption. Spark applies different join strategies based on the nature of dataset or query basic to choose the join operations. Few of the join strategies are: Broadcast Hash Join Shuffle Hash Join Shuffle sort-merge Join Cartesian Join Broadcast Nested Loop Join We will not get into the detail of these join strategies in this blog; these join strategies are explained here. The Broadcast Hash Join is the most performant when any join side fits well in memory. Spark plans a Broadcast Hash Join if the estimated size of a join relation is less than the configured spark.sql.autoBroadcastJoinThreshold value. The smaller DataFrame will be broadcasted to all executors to perform Broadcast Hash Join. The example code used for this blog has two datasets: Customer Sales The business would like to get sales by date for a given state (filter customer data for a given state); which can be done by joining the customer data set with the sales data set. %%sql SELECT tx_date, sum(tx_value) AS total_sales FROM sale JOIN customer ON customer.customer_id = sale.customer_id WHERE address_state="IL" GROUP BY tx_date The filter by address state is not known in static planning, so the initial plan opts for sort merge join. The customer table after filtering is small (~10% of original), so the query can do a broadcast hash join instead. The broadcast join is a very high-performance join and a better option for the smaller data set where a smaller data/table is sent to every executor to execute a map-side join. We can use broadcast Spark join hint to enforce specific join must be used during the join operation. The broadcast joins hint suggests that Spark use broadcast join regardless of the configuration property autoBroadcastJoinThreshold. Without AQE, the estimated size of join relations comes from the statistics of the original table. Unfortunately, it can go wrong in most real-world cases. Developers can only determine the table/dataset size after applying a filter and using a hint without knowing the dataset's size may result in an OOM exception. AQE converts sort-merge join to broadcast hash join when the runtime statistics of any join side are smaller than the broadcast hash join threshold. It is not as efficient as planning a broadcast hash join in the first place, but it's better than keep doing the sort-merge join, as we can save the sorting of both the join sides and read shuffle files locally to save network traffic(if spark.sql.adaptive.localShuffleReader.enabled is true) The AQE can be enabled using Spark Configuration (from Spark 3.2.0 onwards, It is enabled by default). Let's run the earlier code, enabled with AQE with HDInsight 5.0 using Spark 3.1.1. The source code example is available GitHub. // Enable AQE sql("SET spark.sql.adaptive.enabled=true") sql("set spark.sql.adaptive.localShuffleReader.enabled=true") As we can see, Spark optimizes the initial plan; It can replan the join strategy dynamically from sort-merge join to broadcast hash join at runtime if the size fits spark.sql.autoBroadcastJoinThreshold. We can see local shuffle reader (CustomShuffleReader) is used to avoid shuffle when AQE optimizes the sort-merge join to the broadcast join. Adaptive Query Execution – Dynamic Handle Skew Joins Data skewness happens when data are unevenly distributed for a given key column; that means few column values have more rows than others; for example, the number of orders for day/month/year is more than others, uneven number of orders by selective customers, an uneven number of claims from a geo-location, number of page hits are uneven during the hours of the day, etc. Spark join operation on two datasets would require moving data with the same join key(s) to the same executor. If your business data are skewed among different partition key column values, one or more partitions will have more data than other partitions. Such data skewness can cause Spark jobs to have one or more trailing tasks (larger partitions), severely downgrade queries' overall elapsed time, and waste resources on the cluster because idle waiting for these trailing tasks to complete. Such skewed partitions do not fit in the memory of the executors. In that case, such tasks can result in garbage-collection problems or more slowness because data may spill to the disk, or in the worst case, this can result in Out of Memory exceptions and cause jobs to fail. The example code used for this example has two datasets: Sales Items In the code example the sales data has been modified to generate data skewness for item id “18”. The source code example is available GitHub. The business would like to get sales by date; we need to join sales with the item to get the item's price. %%sql SELECT tx_date, sum(soldQty * unit_price) AS total_sales FROM sale JOIN item ON item.item_id = sale.item_id GROUP BY tx_date It is difficult to detect data skewness from the Spark query execution plan. It provides the steps performed to execute a job but does not provide data distribution after each task. For this, we can use Spark UI from the history server. If we examine the job detail by accessing Spark UI, we get the following: Stage 4 has taken longest time 7 mins, and if we drill down this stage to task level, we can see summary statistics of all 201 tasks of this stage. There is significant difference between max and 75 th percentile or median. It is a strong indicator of data skewness. The number of records processed by one task is significantly higher than other tasks, and we can also see the skewness problem from that task that it cannot fit data in memory, and data are spilled onto the disk. We can manage data skew problems in multiple ways, such as using derived columns to divide large partitions, Broadcast Hash join if the dataset is not too large, etc. But we still may see room for performance improvements. The Adaptive Query Execution from Spark 3.x can rescue you with minimal code change; this feature dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed tasks into roughly evenly sized tasks. It takes effect when both spark.sql.adaptive.enabled and spark.sql.adaptive.skewJoin.enabled configurations are enabled. There are two additional parameters to tune skewJoin in AQE: spark.sql.adaptive.skewJoin.skewedPartitionFactor – A partition is considered as skewed if its size is larger than this factor multiplying the median partition size. spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes - A partition is considered as skewed if its size in bytes is larger than this threshold. We will enable both AQE and skewjoin to our last example execution. sql("SET spark.sql.adaptive.enabled=true") sql("SET spark.sql.adaptive.skewJoin.enabled=true") By looking execution plan and SQL query plan, we can find Spark optimizes the initial plan, the time taken by the query has reduced from 7 mins to 1.5 mins, and the number of tasks has reduced from 200 to 70. If we drill down stage#6 to the task level, we can see summary statistics of all 70 tasks of this stage. There is not much difference between the max and 75th percentile or median. This is because the AQE has optimized uneven partitions. While looking to Spark SQL plan from Spark UI, we can find: There is a skewed partition from the "sale" dataset AQE splits the skewed partition into smaller partitions (in this case 47 smaller paritions). And finally sort merge join operator is marked with a skew join flag. By default, AQE Coalesce Partitions (spark.sql.adaptive.coalescePartitions.enabled) is enabled; if not, you can enable the same so that AQE will optimize smaller partitions into larger partitions based on statistics of data and processing resources that will reduce shuffle time and data transfer. Summary AQE in Spark optimizes joins, especially where to join involves a filter; it optimizes the initial plan from a sort-merge join to a broadcast join based on runtime statistics. AQE can replan the join strategy dynamically from sort-merge join to broadcast hash join at runtime. Similarly, AQE can help significantly in managing data skew, especially long-running jobs that should be analyzed for such opportunities to allow developers to mitigate data-skew problems early to utilize resources and better overall performance. AQE is a splendid feature added since Spark 3.x; migrating your HDInsight clusters to Spark 3.x can improve your business ROI and better performance. Contact the HDInsight team if you need help to migrate your workload from HDInsight 4.0 to HDInsight 5.x to take maximum benefits. References https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html HDInsight 5.0 with Spark 3.x – Part 1 - Microsoft Community Hub
asethia
Jun 08, 2023 Place Analytics on Azure Blog
2.5KViews
0likes
0Comments
Enhanced autoscale capabilities in HDInsight clusters
HDInsight now has enhanced capabilities which include improved latency, and feedback loop alongside support for recommissioning nodemanagers in case of load-aware autoscale which improves cluster utilization massively and lowers the total cost of ownership significantly.
sairamyeturi
May 11, 2023 Place Analytics on Azure Blog
4.9KViews
2likes
0Comments