Analytics on Azure Blog

15 MIN READ

Decision Guide for Selecting an Analytical Data Store in Microsoft Fabric

SlavaTrofimov

Microsoft

Jan 02, 2025

How to select an analytical data store in Microsoft Fabric based on your workload's data volumes, data type requirements, compute engine preferences, data ingestion patterns, data transformation needs, query patterns, and other factors.

Introduction

Microsoft Fabric is a unified data platform that offers a variety of data stores that can be used for analytical purposes, including SQL Databases, Data Warehouses, Lakehouses, and Eventhouses. Each data store offers a compelling blend of functionality and capabilities, which make Microsoft Fabric uniquely capable of handling diverse analytical workloads. The choice of an optimal analytical data store for each workload is influenced by multiple factors, such as data volumes, required data types, desired compute engines, data ingestion patterns, data transformation capabilities, data retrieval patterns, need for granular access controls and the need for integration with OneLake and other components of Microsoft Fabric. This document compares analytical data stores, provides a decision guide for selecting a suitable data store for any workload, and provides examples of common use cases and architectural patterns associated with Fabric’s analytical data stores.

Overview of Common Analytical Data Stores

SQL Databases

SQL Databases in Microsoft Fabric are designed for structured data and can support transactional and analytical workloads. They are ideal for scenarios where data volumes are moderate, typically ranging from several gigabytes to a few terabytes. SQL Databases support a wide variety of structured data types, ranging from integers, strings and dates to more specialized types, such as geometry, geography, JSON, XML and more. SQL Databases fully implement the mature and efficient T-SQL-based relational database engine, making them suitable for high-frequency updates and operations that require transactional consistency and referential integrity. SQL Databases support batch and transactional data ingestion patterns and provide data transformation capabilities through stored procedures, views, user defined functions and SQL queries. They are optimized for low-latency queries, highly selective lookups, and highly concurrent data retrieval patterns. SQL Databases provide robust granular access controls, including object-level, column-level and row-level security. Furthermore, thanks to automatic OneLake mirroring, SQL Databases in Fabric seamlessly integrate with the broader Fabric ecosystem to support data processing by any Fabric compute engine, cross-warehouse queries, Direct Lake mode semantic models in Power BI, and more.

Data Warehouses

Data Warehouses in Microsoft Fabric are optimized for large-scale analytical workloads and can effectively handle data volumes, ranging from gigabytes to petabytes. They primarily support structured data types and are optimized for diverse queries and analytical workloads. Data Warehouses use a relational engine and are designed for high-throughput batch data ingestion and flexible transformations using stored procedures, views and other T-SQL queries. Data Warehouses are optimized for analytical data retrieval patterns and excel at executing diverse queries over large volumes of data. Efficient workload management and burstable capacities allow Data Warehouses to reach high levels of concurrency. Data Warehouses provide extensive access controls including object-level, column-level and row-level security. Fabric Data Warehouses physically store data in OneLake and expose it in Delta format to any Fabric compute engine.

Lakehouses

Lakehouses combine the features of data lakes and data warehouses, providing a unified platform for both structured and unstructured data. They can manage up to petabytes of data and support a wide range of data types, including structured, semi-structured, and unstructured. Lakehouses use flexible and scalable Spark compute engine, which supports PySpark, Spark SQL, Scala and R for data engineering and data science workloads. They support both batch and real-time data ingestion patterns, making their versatility an asset for diverse analytical workloads. Lakehouses provide data transformation capabilities through Spark-based processing, enabling complex data engineering and data science scenarios. Lakehouses are built on top of OneLake and typically store their data in the Delta format, which promotes sharing and interoperability across the entire enterprise. The integrated SQL Analytics Endpoint provides Lakehouses with the ability to query data that is physically stored in OneLake using the T-SQL language in a manner that resembles a relational data warehouse. Lakehouses tend to excel at analytical data retrieval patterns and can effectively query massive volumes of data. The SQL Analytics Endpoint of a Lakehouse provides granular access controls, including object-level, column-level and row-level security.

Eventhouses

Eventhouses in Microsoft Fabric are designed for real-time event processing and analytics, capable of handling high-velocity data streams and processing millions of events per second. They support structured, semi-structured, and unstructured data, tailored for streaming events and time series analysis. Eventhouses use Kusto Query Language (KQL) and a subset of T-SQL for real-time analytics and event processing. They support real-time data ingestion patterns, optimized for high-velocity data streams, but may also ingest data in a batch manner. Eventhouses provide data transformation capabilities through KQL queries (often implemented as update policies), enabling real-time data processing and analytics. They are optimized for efficient, scalable, and highly concurrent query patterns that are often required for real-time analytics. KQL databases in Eventhouses support granular access controls to ensure that data is only accessed by authorized users. Eventhouses can be configured to make their data automatically available in OneLake, making it suitable for consumption by other Fabric experiences, albeit with some additional latency.

Comparison of Analytical Data Stores

The following table provides a comparison of key features of analytical data stores in Microsoft Fabric. Throughout this table, a ✅ symbol indicates that the capability is readily supported and recommended. The ⚠️ symbol indicates that the capability is supported with some considerations or limitations. The ❌ symbol indicates that the capability is typically not supported or not recommended.

Note: there is no commonly accepted definition for the terms “small data” and “Big Data,” and definitions of these terms will continue to change over time along with evolving capabilities of data platforms. Nevertheless, in this decision guide at the time of this writing, “small data” generally refers to total data volumes ranging from megabytes to hundreds of gigabytes, with individual tables up to a hundred gigabytes in size and up to tens of gigabytes of data ingested per day. The term “Big Data” generally refers to data volumes measured in tens of terabytes to petabytes, individual tables that are multiple terabytes in size and data ingestion rates exceeding hundreds of gigabytes per day. Data volumes that fall between the "small data" and "Big Data" thresholds can be described as "moderate" or "medium" data.

Capability		SQL Database	Data Warehouse	Lakehouse	Eventhouse
Data volumes
	Small	✅	✅¹	✅¹	✅¹
	Moderate	✅	✅	✅	✅
	Big Data	❌	✅	✅	✅
Supported types of data
	Structured	✅	✅	✅	✅
	Semi-structured	⚠️	⚠️	✅	✅
	Unstructured	❌	❌	✅	✅
Primary compute engine
	Write operations	T-SQL	T-SQL	Spark (PySpark, Spark SQL, Scala, R), Python	KQL
	Read operations	T-SQL	T-SQL	T-SQL, Spark (PySpark, Spark SQL, Scala, R), Python	KQL, T-SQL
Data ingestion patterns
	Typical ingestion frequency	Moderate-High	Moderate	Moderate-High	High
	Recommended batch size	Small-Medium	Medium-Large	Small-Large	Small-Large
	Efficiency of appends	High	High	High	High
	Efficiency of updates/deletes	High	Moderate	Moderate	Low
Data ingestion tools (in Microsoft Fabric)
	Pipelines	✅	✅	✅	✅
	Dataflows Gen 2	✅	✅	✅	✅
	Shortcuts	❌	⚠️	✅	✅
	Eventstreams	❌	❌	✅	✅
	Spark Connectors	⚠️	⚠️	✅	⚠️
	T-SQL Commands	✅	✅	❌	❌
	KQL Commands	❌	❌	❌	✅
Data transformation capabilities
	Variety of supported structured data types	High	Moderate	Moderate	Moderate
	Parsing of semi-structured data	⚠️	⚠️	✅	✅
	Parsing of unstructured data	❌	❌	✅	⚠️
	SQL support (any dialect)	✅	✅	✅	⚠️
	SQL surface area (any dialect)	Broad	Moderate	Broad	Limited²
	T-SQL surface area	Broad	Moderate	Limited²	Limited²
	Python support	❌	❌	✅	⚠️
	Spark support (PySpark, Spark SQL, Scala, R)	❌	❌	✅	❌
	KQL support	❌	❌	❌	✅
	Transformation extensibility³	Moderate	Moderate	Very High	High
	Single-table transaction support	✅	✅	✅	✅
	Multi-table transaction support	✅	✅	❌	⚠️
Data retrieval patterns
	Optimized for selective lookups	✅	❌	❌	✅
	Optimized for large scans and aggregations	⚠️	✅	✅	✅
	Ideal query execution time⁴	Milliseconds+	Tens of Milliseconds+	Tens of Milliseconds+	Milliseconds+
	Realistic query execution time⁵	Subsecond+	Seconds+	Seconds+	Subsecond+
	Peak query concurrency⁶	High	High	High	High
	Peak query throughput⁷	Very High	High	High	Very High
Granular access controls
	Object-level security	Yes	Yes	Yes	Yes⁸
	Column-level security	Yes	Yes	Yes⁹	No
	Row-level security	Yes	Yes	Yes⁹	Yes
OneLake integration
	Data available in OneLake	Yes¹⁰	Yes	Yes	Yes¹¹
	Data stored in open format (Delta)	Yes¹⁰	Yes	Yes	Yes¹¹
	Can be a source of shortcuts	Yes¹⁰	Yes	Yes	Yes¹¹
	Access data via shortcuts	No	Yes¹²	Yes	Yes
	Cross warehouse/lakehouse queries	Yes¹³	Yes	Yes	Yes¹¹
Compute management
	Ability to customize size and configuration of compute resources	Low	Low	High	Low
	Administrative skillset needed to manage or tune compute resources	Low	Low	Moderate-High	Low

Footnotes

Data Warehouses, Lakehouses and Eventhouses do not have minimum data volume requirements and offer equivalent functionality at all data volumes. Yet, some benefits offered by these highly-scalable systems may not be fully realized with small data volumes.
Lakehouses and Eventhouses support a subset of T-SQL surface area and are limited to read-only operations.
Refers to the ability to extend data transformations using user-defined functions, methods, referencing external modules or libraries, etc.)
Represents lower bounds of execution times for light queries using small volumes of data from warm cache, excluding network latency or the time needed to render results in a client application. Note that query execution times can be heavily influenced by numerous factors and results may vary with your particular workload.
Represents lower bounds of response times to mixed queries using moderate volumes of data, excluding network latency or the time needed to render results in a client application. Note that query execution times can be heavily influenced by numerous factors and results may vary with your particular workload.
Peak number of queries that can be executing simultaneously, relative to other analytical data stores.
Peak number of queries that can be completed over a given period of time, relative to other analytical data stores. This number is influenced by concurrency, query duration, and other factors.
Partial object-level security is implemented using Restricted View Access Policies.
Granular access controls are available for the SQL Analytics Endpoint.
OneLake integration is implemented via automatic database mirroring.
Via automatic sync from KQL Database to OneLake.
Indirectly, via cross-database queries to Lakehouses.
Available for mirrored data accessed via the SQL Analytics Endpoint.

Decision Guide for Selecting an Analytical Data Store

The following decision guide serves as a practical tool to help you select a suitable data store for a particular use case by answering a series of questions. In addition to selecting the primary data store, this decision guide may also suggest a complementary secondary data store to support your requirements more fully for a mixed analytical workload. This decision guide is intended to be applied for each use case or each individual data product with the understanding that multiple analytical data stores may be used to accommodate diverse workloads across your entire data estate.

Common Use Cases

Let’s examine several common analytical use cases and identify suitable analytical data stores in Microsoft Fabric for each of them. In this table, a ✅ symbol indicates that the data store is suitable for a use case. The ⚠️ symbol indicates that the data store may support the use case with some limitations or considerations. The ❌ symbol indicates that the data store is typically not recommended for a particular use case.

Use Case		SQL Database	Data Warehouse	Lakehouse	Eventhouse
	Low-latency operational reporting	✅	⚠️	⚠️	✅
	Enterprise data warehousing	⚠️	✅	✅	⚠️
	Multi-hop (medallion) architecture	⚠️^*	⚠️^*	✅^*	✅^*
	Departmental data-marts	✅	✅	✅	❌
	Real-time analytics	⚠️	❌	⚠️	✅
	Ad-hoc analysis of event-based observational data	⚠️	⚠️	⚠️	✅
	Ingestion and storage of arbitrary unstructured data	❌	❌	✅	⚠️

^*Assuming that a single data store is used to implement the entire use case. In practice, this use case is often implemented using multiple complementary data stores.

Low-Latency Operational Reporting

In low-latency operational reporting, data is constantly trickling in from source systems and is immediately viewed by large numbers of users. SQL Databases are an excellent analytical data store for small-to-moderate scale operational reporting due to their efficiency and flexibility. For large-scale operational reporting, Eventhouses provide a powerful combination of real-time ingestion and highly-performant, low-latency queries. Data Warehouses and Lakehouses may also be used for operational reporting scenarios, although they may be constrained by latency and may be less efficient for the highly selective query patterns often associated with operational reporting.

Enterprise Data Warehousing

For enterprise data warehousing, Data Warehouses are an obvious analytical data store due to their ability to handle large-scale analytical workloads and complex queries. Similarly, Lakehouses with their built-in SQL Analytics Endpoints are also an excellent choice, as they provide many of the capabilities and benefits of Data Warehouses along with other powerful capabilities for storing and processing data. SQL Databases are also suitable for data warehousing scenarios at enterprises with small to moderate data volumes. While Eventhouses provide a compelling set of capabilities in storing and processing of large volumes of data, their limited support for SQL makes them less suitable for the role of a relational data warehouse.

Multi-Hop (Medallion) Architecture

Medallion architecture refers to an architectural pattern where data goes through multiple hops or transformations (e.g., bronze, silver, gold) on its journey from ingestion in raw form to refinement into a readily consumable form, such as a dimensional model. Solutions that follow the multi-hop architectural pattern can be implemented using a single data store or a combination of multiple complementary data stores.

Lakehouses are often a natural fit for the bronze and silver layers of the medallion architecture due to their support for diverse data types and large data volumes, as well as their ability to support complex transformations of data from multiple sources. While Lakehouses may serve as the final (i.e., "gold") layer in such solutions, it is not uncommon for Data Warehouses to be used as the gold layer due to their efficiency in serving refined data to large numbers of users.

Eventhouses can be an excellent choice for implementing the end-to-end multi-hop architectural pattern for structured or unstructured streaming data and real-time analytics.

SQL Databases may also be used to implement the multi-hop architectural pattern when data is predominantly structured and data volumes are small.

While the multi-hop architectural pattern can be implemented end-to-end in a Data Warehouse, it is less common due to the limited support for unstructured data. As discussed above, Data Warehouses are commonly used as the final (i.e., "gold") layer in a medallion architecture (in conjunction with one or more Lakehouses).

Departmental Data Marts

For departmental data marts, where data tends to be structured and data volumes tend to be moderate, SQL Databases are often an excellent analytical data store due to their efficiency in storing, processing and querying structured data. Data Warehouses are also a great choice for this use case, particularly when data volumes become larger. Lakehouses may also be used to implement departmental data marts, particularly when the data marts need to consolidate subsets of data from multiple Data Warehouses or Lakehouses using shortcuts.

Real-Time Analytics

For most real-time analytics scenarios, Eventhouses are the most suitable analytical data store due to their ability to handle high-velocity data streams, efficiently parse structured and unstructured data, and provide real-time insights at massive scale. Lakehouses may also serve as a data store for near real-time intelligence scenarios, since they support large-scale ingestion and processing of diverse data; yet, latency and query concurrency may be a concern in certain scenarios. When data volumes are small and data is structured, SQL Databases may also be a viable choice for real-time analytics due to their efficiency and support for low-latency queries.

Ad-Hoc Analysis of Event-Based Observational Data

For ad-hoc analysis of event-based observational data, Eventhouses are the most suitable analytical data store due to their support for real-time data processing and low-latency queries along with a powerful, elegant and concise Kusto Query Language (KQL). SQL Databases, Data Warehouses and Lakehouses may also be used to analyze observational data; yet, these analytical data stores are not purposely designed for this use case and will suffer from certain limitations in accommodating this scenario.

Ingestion and Storage of Arbitrary Unstructured Data

For ingestion and storage of arbitrary unstructured data, Lakehouses are the most suitable analytical data store due to their support for storing data in any data format and unlimited scale. Eventhouses may also be effective at ingesting, storing and analyzing unstructured data that arrives as streams of events, such as in free-form text. Yet, Eventhouses would not be suitable for processing of data in arbitrary binary file formats.

Reference Architectures for Sample Scenarios

The following diagram illustrates reference architectures for a few common scenarios, including operational reporting, enterprise data warehousing and real-time intelligence. Please note that Microsoft Fabric supports a wide range of capabilities that can accommodate a broad range of business and technical requirements that may not be reflected in these examples.

Operational Reporting

Operational reporting scenario prioritizes efficiency, latency and concurrency for small and medium volumes of structured or semi-structured data. In this scenario, data is ingested from source systems using Dataflows Gen 2 and Pipelines. Data is staged in a SQL Database where it is cleansed and refined using T-SQL stored procedures. A Power BI Semantic Model in DirectQuery mode is built on top of the SQL Database (for efficiency, some tables in the Semantic Model may be imported). A combination of interactive and paginated Power BI reports delivers real-time operational reports to a large number of users. Optionally, data from the SQL Database can be exposed to external applications via the API for GraphQL in Microsoft Fabric or via external APIs directly connected to the SQL Database as a source of operational data. Note that data from the SQL Database is automatically mirrored into OneLake, which will enable a set of complementary analytical capabilities that will be described below as part of the Enterprise Data Warehousing Scenario.

Enterprise Data Warehousing

Enterprise data warehousing scenario prioritizes scalability and versatility in storing and analyzing medium to large volumes of structured, semi-structured or unstructured data. In this scenario, data is ingested from a variety of source systems using Pipelines and Dataflows Gen 2. Initially, the data is stored in a Lakehouse in its original form. The data is cleansed and refined using Spark Notebooks and Spark Job Definitions. The data is eventually loaded into a Data Warehouse where it is combined with data from a mirrored SQL Database using cross-warehouse queries and further transformed using T-SQL stored procedures. A Power BI semantic model in Direct Lake mode is built on top of the Data Warehouse, which serves as a scalable and low-latency foundation for Power BI reports serving analytical needs of end users.

Real-Time Analytics

Real-time analytics scenario prioritizes scalability, latency, concurrency, and versatility in ingesting large volumes of structured, semi-structured or unstructured data. In this scenario, data is pushed by the source application to the EventHub-compatible endpoint of the Eventstream, which processes and loads the data into a KQL Database in an Eventhouse. A set of update policies defined on tables in the KQL Database transform and refine the streaming data. The data is visualized on Real Time Dashboards. Similarly, a Power BI semantic model in DirectQuery mode is used by Power BI reports and enable ad-hoc analysis in a no-code manner. A KQL Queryset is used by advanced users to perform sophisticated ad-hoc analysis and in-depth investigation of streaming data. Finally, a Data Activator constantly monitors streaming data for unexpected conditions and triggers notifications or actions in near real time.

Analytical Data Stores across the Enterprise Data Estate

We have previously focused on the merits of each individual analytical data store and the selection of a suitable data store for a particular use case. This section will help us understand how these individual choices may fit into the global context of the enterprise data estate.

Leading enterprises are adopting modern architectural patterns that maximize the value of their data by offering flexibility and agility to data practitioners while erasing the barriers for data exchange and collaboration. This often involves moving from a monolithic data repository to a distributed data mesh architecture where data is treated as a product and where each data domain is empowered to ingest, process and serve its data using technologies that are optimal for the requirements and preferences of the data domain. The variety of analytical data stores offered by Microsoft Fabric provides autonomy to each data domain in selecting its preferred technologies while enabling unified governance and seamless in-place sharing of data across domains. The following diagram illustrates an example of a data mesh architecture at a hypothetical company that takes advantage of multiple analytical data stores in Microsoft Fabric while allowing for effective data sharing across business units using shortcuts and other interfaces.

Please note that this diagram is oversimplified – it illustrates the core analytical data stores for each data domain without showing the full range of supporting items required to ingest, transform and analyze the data as part of an end-to-end analytical solution. Also note that data mesh topologies may vary significantly depending on the organizational size, structure, skill distribution, maturity level, governance requirements and other factors.

Conclusion

SQL Databases, Data Warehouses, Lakehouses and Eventhouses make Microsoft Fabric uniquely capable of handling diverse analytical workloads. Each of these analytical data stores provides a unique blend of capabilities and limitations that must be matched to the workload to achieve optimal results. While some use cases can be addressed using a single analytical data store, certain complex use cases that involve mixed workloads are best served by leveraging multiple complementary analytical data stores which are readily available in Microsoft Fabric as the unified data platform. Analytical data store comparisons, a decision guide and reference architectures provided in this article can facilitate the selection of a suitable data store for each scenario and integration of the data stores into an end-to-end solution.

Disclaimer

This article intentionally oversimplifies certain technical concepts in an attempt to provide practical and approachable guidance in selecting an analytical data store in Microsoft Fabric. Yet, this simplicity sacrifices certain nuances that may be relevant to your scenarios. While this document may serve as a starting point in the selection of an analytical data store, it should not be the final point. Please carefully review technical documentation for each data store to validate its suitability for your specific workload.

Also note that this article discusses the characteristics of analytical data stores in Microsoft Fabric at the time when it was written (January 2025). As Microsoft Fabric continues to evolve and grow, new innovations and new capabilities will influence the completeness and validity of this decision guide.