Big Data
61 TopicsDP-900: Microsoft Azure Data Fundamentals Study Guide
Microsoft Azure provides an array of services that enable businesses and organizations to undergo digital transformation by making quick and informed decisions. The DP-900 Microsoft Azure Data Fundamentals exam evaluates learners' understanding of data concepts such as relational, non-relational, big data, and analytics. The exam requires learners to demonstrate their knowledge of core data concepts, relational and non-relational data, and Azure data services. Microsoft offers resources such as Microsoft Learn self-paced curriculum, an instructor-led course, and related documentation to help students prepare for the exam. Understanding Azure data principles is vital for more advanced Azure certifications like Azure Database Administrator Associate and Azure Data Engineer Associate.12KViews8likes0CommentsA Step-by-Step Guide to migrate data from Elasticsearch to Azure Data Explorer(ADX) using Logstash
Data migration is the process of transferring data from one source to another. It can be a complex and time-consuming task, especially when dealing with large amounts of data. This article is an extension to an existing article to migrate data from Elastic Search to Azure Data Explorer (ADX) using Logstash pipeline as a step-step-step guide.7.6KViews4likes0CommentsUsing Azure Data Factory orchestrating Kusto query-ingest
In this blog post, we’ll explore how Azure Data Factory (ADF) can be used for orchestrating large query ingestions. With this approach you will learn, how to split one large query ingests into multiple partitions, orchestrated with ADF.7.1KViews3likes1CommentEmpowering Startups: The Introductory Guide to Databricks for Entrepreneur's Data-Driven Success
Unlock the key to entrepreneurial success with Databricks—a journey where data empowers startups to thrive. Get ready to embark on a transformative quest for data-driven excellence!3.2KViews2likes0CommentsDevito Book Summer Project with Imperial College London
If you are interested in numerical computation, programming in Python, and/or applied mathematics, and would like to contribute to our open-source textbook, feel free to reach out to us on Slack and check out the devito_book repository on GitHub.2.9KViews2likes0CommentsIngest, prepare, and transform using Azure Databricks and Data Factory
Today’s business managers depend heavily on reliable data integration systems that run complex ETL/ELT workflows (extract, transform/load and load/transform data). These workflows allow businesses to ingest data in various forms and shapes from different on-prem/cloud data sources; transform/shape the data and gain actionable insights into data to make important business decisions. With the general availability of Azure Databricks comes support for doing ETL/ELT with Azure Data Factory. Read more about it in the Azure blog.2.8KViews0likes0CommentsRun your PySpark Interactive Query and batch job in Visual Studio Code
We are excited to introduce the integration of HDInsight PySpark into Visual Studio Code (VSCode), which allows developers to easily edit Python scripts and submit PySpark statements to HDInsight clusters. For PySpark developers who value productivity of Python language, VSCode HDInsight Tools offer you a quick Python editor with simple getting started experiences, and enable you to submit PySpark statements to HDInsight clusters with interactive responses. This interactivity brings the best properties of Python and Spark to developers and empowers you to gain faster insights. Read more about it in the Azure blog.2.6KViews0likes0CommentsSiphon: Streaming data ingestion with Apache Kafka
Data is at the heart of Microsoft’s cloud services, such as Bing, Office, Skype, and many more. As these services have grown and matured, the need to collect, process and consume data has grown with it as well. Data powers decisions, from operational monitoring and management of services, to business and technology decisions. Data is also the raw material for intelligent services powered by data mining and machine learning. Most large-scale data processing at Microsoft has been done using a distributed, scalable, massively parallelized storage and computing system that is conceptually similar to Hadoop. This system supported data processing using a batch processing paradigm. Over time, the need for large scale data processing at near real-time latencies emerged, to power a new class of ‘fast’ streaming data processing pipelines. Siphon was created as a highly available and reliable service to ingest massive amounts of data for processing in near real-time. Apache Kafka is a key technology used in Siphon, as its scalable pub/sub message queue. Siphon handles ingestion of over a trillion events per day across multiple business scenarios at Microsoft. Initially Siphon was engineered to run on Microsoft’s internal data center fabric. Over time, the service took advantage of Azure offerings such as Apache Kafka for HDInsight, to operate the service on Azure. Read about it in the Azure blog.2.1KViews0likes0CommentsAnnouncing Apache Kafka for Azure HDInsight general availability
Apache Kafka on Azure HDInsight was added last year as a preview service to help enterprises create real-time big data pipelines. Since then, large companies such as Toyota, Adobe, Bing Ads, and GE have been using this service in production to process over a million events per sec to power scenarios for connected cars, fraud detection, clickstream analysis, and log analytics. HDInsight has worked very closely with these customers to understand the challenges of running a robust, real-time production pipeline at an enterprise scale. Using our learnings, we have implemented key features in the managed Kafka service on HDInsight, which is now generally available. Running big data streaming pipelines is hard. Doing so with open source technologies for the enterprise is even harder. Apache Kafka, a key open source technology, has emerged as the de-facto technology for ingesting large streaming events in a scalable, low-latency, and low-cost fashion. Enterprises want to leverage this technology, however, there are many challenges with installing, managing, and maintaining a streaming pipeline. Open source bits lack support and in-house talent needs to be well versed with these technologies to ensure the highest levels of up-time. Every second an ingestion pipeline is down, data is lost. Read about it in the Azure blog.2KViews0likes0Comments