Big Data
41 TopicsIngest, prepare, and transform using Azure Databricks and Data Factory
Today’s business managers depend heavily on reliable data integration systems that run complex ETL/ELT workflows (extract, transform/load and load/transform data). These workflows allow businesses to ingest data in various forms and shapes from different on-prem/cloud data sources; transform/shape the data and gain actionable insights into data to make important business decisions. With the general availability of Azure Databricks comes support for doing ETL/ELT with Azure Data Factory. Read more about it in the Azure blog.2.8KViews0likes0CommentsRun your PySpark Interactive Query and batch job in Visual Studio Code
We are excited to introduce the integration of HDInsight PySpark into Visual Studio Code (VSCode), which allows developers to easily edit Python scripts and submit PySpark statements to HDInsight clusters. For PySpark developers who value productivity of Python language, VSCode HDInsight Tools offer you a quick Python editor with simple getting started experiences, and enable you to submit PySpark statements to HDInsight clusters with interactive responses. This interactivity brings the best properties of Python and Spark to developers and empowers you to gain faster insights. Read more about it in the Azure blog.2.6KViews0likes0CommentsSiphon: Streaming data ingestion with Apache Kafka
Data is at the heart of Microsoft’s cloud services, such as Bing, Office, Skype, and many more. As these services have grown and matured, the need to collect, process and consume data has grown with it as well. Data powers decisions, from operational monitoring and management of services, to business and technology decisions. Data is also the raw material for intelligent services powered by data mining and machine learning. Most large-scale data processing at Microsoft has been done using a distributed, scalable, massively parallelized storage and computing system that is conceptually similar to Hadoop. This system supported data processing using a batch processing paradigm. Over time, the need for large scale data processing at near real-time latencies emerged, to power a new class of ‘fast’ streaming data processing pipelines. Siphon was created as a highly available and reliable service to ingest massive amounts of data for processing in near real-time. Apache Kafka is a key technology used in Siphon, as its scalable pub/sub message queue. Siphon handles ingestion of over a trillion events per day across multiple business scenarios at Microsoft. Initially Siphon was engineered to run on Microsoft’s internal data center fabric. Over time, the service took advantage of Azure offerings such as Apache Kafka for HDInsight, to operate the service on Azure. Read about it in the Azure blog.2.1KViews0likes0CommentsAnnouncing Apache Kafka for Azure HDInsight general availability
Apache Kafka on Azure HDInsight was added last year as a preview service to help enterprises create real-time big data pipelines. Since then, large companies such as Toyota, Adobe, Bing Ads, and GE have been using this service in production to process over a million events per sec to power scenarios for connected cars, fraud detection, clickstream analysis, and log analytics. HDInsight has worked very closely with these customers to understand the challenges of running a robust, real-time production pipeline at an enterprise scale. Using our learnings, we have implemented key features in the managed Kafka service on HDInsight, which is now generally available. Running big data streaming pipelines is hard. Doing so with open source technologies for the enterprise is even harder. Apache Kafka, a key open source technology, has emerged as the de-facto technology for ingesting large streaming events in a scalable, low-latency, and low-cost fashion. Enterprises want to leverage this technology, however, there are many challenges with installing, managing, and maintaining a streaming pipeline. Open source bits lack support and in-house talent needs to be well versed with these technologies to ensure the highest levels of up-time. Every second an ingestion pipeline is down, data is lost. Read about it in the Azure blog.2KViews0likes0CommentsStructured streaming with Azure Databricks into Power BI & Cosmos DB
In this blog we’ll discuss the concept of Structured Streaming and how a data ingestion path can be built using Azure Databricks to enable the streaming of data in near-real-time. We’ll touch on some of the analysis capabilities which can be called from directly within Databricks utilising the Text Analytics API and also discuss how Databricks can be connected directly into Power BI for further analysis and reporting. As a final step we cover how streamed data can be sent from Databricks to Cosmos DB as the persistent storage. Structured streaming is a stream processing engine which allows express computation to be applied on streaming data (e.g. a Twitter feed). In this sense it is very similar to the way in which batch computation is executed on a static dataset. Computation is performed incrementally via the Spark SQL engine which updates the result as a continuous process as the streaming data flows in. Read more about it in the Azure blog.1.6KViews0likes0CommentsWelcome our newest family member - Data Box Disk
Last year at Ignite, we talked to you about the preview of Azure Data Box, a ruggedized, portable, and simple way to move large datasets into Azure. So far, the response has been phenomenal. Customers have used Data Box to move petabytes of data into Azure. While our customers and partners love Data Box, they told us that they also wanted a lower capacity, even easier-to-use option. They cited examples such as moving data from Remote/Office Branch Offices (ROBOs), which have smaller data sets and minimal on-site tech support. They said they needed an option for recurring, incremental transfers for ongoing backups and archives. And they said it needed to have the same traits as Data Box – namely fast, simple, and secure. So, we're here today with our partners at Inspire 2018 to announce a new addition to the Data Box family: Azure Data Box Disk. Read about it in the Azure blog.1.5KViews0likes0CommentsAzure HDInsight Performance Benchmarking: Interactive Query, Spark, and Presto
Fast SQL query processing at scale is often a key consideration for our customers. In this blog post we compare HDInsight Interactive Query, Spark, and Presto using the industry standard TPCDS benchmarks. These benchmarks are run using out of the box default HDInsight configurations, with no special optimizations. For customers wanting to run these benchmarks, please follow the easy to use steps outlined on GitHub. The TPC Benchmark DS (TPC-DS) is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. According to TPCDS, the benchmark provides a representative evaluation of performance as a general purpose decision support system. A benchmark result measures query response time in single user mode, query throughput in multi-user mode and data maintenance performance for a given hardware, operating system, and data processing system configuration under a controlled, complex, and multi-user decision support workload. The purpose of TPC benchmarks is to provide relevant, objective performance data to industry users. TPC-DS Version 2 enables emerging technologies, such as big data systems, to execute the benchmark. Please note that these are unaudited results. Read about it in the Azure blog.1.5KViews0likes0CommentsAzure Time Series Insights API, Reference Data, Ingress, and Azure Portal Updates
Today we are announcing the release of several updates to Time Series Insights based on customer feedback. Time Series Insights is a fully-managed analytics, storage, and visualization service that makes it simple to explore and analyze billions of IoT events simultaneously. It allows you to visualize and explore time series data streaming into Azure in minutes, all without having to write a single line of code. For more information about the product, pricing, and getting started, please visit the Time Series Insights website. We also offer a free demo environment to experience the product for yourself. We know that administrators want to plan for and manage their Time Series Insights environments with usage and health telemetry in the Azure Portal. To help enable them to do this more effectively, we have added ingress and storage monitoring at the Time Series Insights environment level in the Portal. We are also working on adding metric alerts, so you can be automatically informed of critical information related to the status of your environment. We will continue to add additional environment telemetry to the Azure Portal in the future – be on the lookout for updates in the coming months. Read about it in the Azure blog.1.4KViews0likes0CommentsGo serverless with R Scripts on Azure Function
Serverless is all the rage, now you can get in on the action using R! Azure Function supports a variety of languages (C#, F#, js, batch, PowerShell, Python, php and the list is growing). However, R is not natively supported. In the following blog we describe how you can run R scripts on Azure Function using the R site extension. Azure Functions can be used in several scenarios because of the broad choice of triggers offered: Timer trigger, executes a Function on a schedule. Http trigger, execute a Function after an HTTP call. Azure Queue Storage, Service Bus, Blob Storage, triggers the function when a new object or message is received. Learn about it in the Azure blog.1.4KViews0likes0CommentsPower BI Embedded dashboards with Azure Stream Analytics
Azure Stream Analytics is a fully managed “serverless” PaaS service in Azure built for running real-time analytics on fast moving streams of data. Today, a significant portion of Stream Analytics customers use Power BI for real-time dynamic dashboarding. Support for Power BI Embedded has been a repeated ask from many of our customers, and today we are excited to share that it is now generally available. Read about it in the Azure blog.1.4KViews1like0Comments