observability
23 TopicsThe Future of AI: Reduce AI Provisioning Effort - Jumpstart your solutions with AI App Templates
In the previous post, we introduced Contoso Chat – an open-source RAG-based retail chat sample for Azure AI Foundry, that serves as both an AI App template (for builders) and the basis for a hands-on workshop (for learners). And we briefly talked about five stages in the developer workflow (provision, setup, ideate, evaluate, deploy) that take them from the initial prompt to a deployed product. But how can that sample help you build your app? The answer lies in developer tools and AI App templates that jumpstart productivity by giving you a fast start and a solid foundation to build on. In this post, we answer that question with a closer look at Azure AI App templates - what they are, and how we can jumpstart our productivity with a reuse-and-extend approach that builds on open-source samples for core application architectures.235Views0likes0CommentsAutomating the Linux Quality Assurance with LISA on Azure
Introduction Building on the insights from our previous blog regarding how MSFT ensures the quality of Linux images, this article aims to elaborate on the open-source tools that are instrumental in securing exceptional performance, reliability, and overall excellence of virtual machines on Azure. While numerous testing tools are available for validating Linux kernels, guest OS images and user space packages across various cloud platforms, finding a comprehensive testing framework that addresses the entire platform stack remains a significant challenge. A robust framework is essential, one that seamlessly integrates with Azure's environment while providing the coverage for major testing tools, such as LTP and kselftest and covers critical areas like networking, storage and specialized workloads, including Confidential VMs, HPC, and GPU scenarios. This unified testing framework is invaluable for developers, Linux distribution providers, and customers who build custom kernels and images. This is where LISA (Linux Integration Services Automation) comes into play. LISA is an open-source tool specifically designed to automate and enhance the testing and validation processes for Linux kernels and guest OS images on Azure. In this blog, we will provide the history of LISA, its key advantages, the wide range of test cases it supports, and why it is an indispensable resource for the open-source community. Moreover, LISA is available under the MIT License, making it free to use, modify, and contribute. History of LISA LISA was initially developed as an internal tool by Microsoft to streamline the testing process of Linux images and kernel validations on Azure. Recognizing the value it could bring to the broader community, Microsoft open-sourced LISA, inviting developers and organizations worldwide to leverage and enhance its capabilities. This move aligned with Microsoft's growing commitment to open-source collaboration, fostering innovation and shared growth within the industry. LISA serves as a robust solution to validate and certify that Linux images meet the stringent requirements of modern cloud environments. By integrating LISA into the development and deployment pipeline, teams can: Enhance Quality Assurance: Catch and resolve issues early in the development cycle. Reduce Time to Market: Accelerate deployment by automating repetitive testing tasks. Build Trust with Users: Deliver stable and secure applications, bolstering user confidence. Collaborate and Innovate: Leverage community-driven improvements and share insights. Benefits of Using LISA Scalability: Designed to run large-scale test cases, from 1 test case to 10k test cases in one command. Multiple platform orchestration: LISA is created with modular design, to support run the same test cases on various platforms including Microsoft Azure, Windows HyperV, BareMetal, and other cloud-based platforms. Customization: Users can customize test cases, workflow, and other components to fit specific needs, allowing for targeted testing strategies. It’s like building kernels on-the-fly, sending results to custom database, etc. Community Collaboration: Being open source under the MIT License, LISA encourages community contributions, fostering continuous improvement and shared expertise. Extensive Test Coverage: It offers a rich suite of test cases covering various aspects of compatibility of Azure and Linux VMs, from kernel, storage, networking to middleware. How it works Infrastructure LISA is designed to be componentized and maximize compatibility with different distros. Test cases can focus only on test logic. Once test requirements (machines, CPU, memory, etc) are defined, just write the test logic without worrying about environment setup or stopping services on different distributions. Orchestration. LISA uses platform APIs to create, modify and delete VMs. For example, LISA uses Azure API to create VMs, run test cases, and delete VMs. During the test case running, LISA uses Azure API to collect serial log and can hot add/remove data disks. If other platforms implement the same serial log and data disk APIs, the test cases can run on the other platforms seamlessly. Ensure distro compatibility by abstracting over 100 commands in test cases, allowing focus on validation logic rather than distro compatibility. Pre-processing workflow assists in building the kernel on-the-fly, installing the kernel from package repositories, or modifying all test environments. Test matrix helps one run to test all. For example, one run can test different vm sizes on Azure, or different images, even different VM sizes and different images together. Anything is parameterizable, can be tested in a matrix. Customizable notifiers enable the saving of test results and files to any type of storage and database. Agentless and low dependency LISA operates test systems via SSH without requiring additional dependencies, ensuring compatibility with any system that supports SSH. Although some test cases require installing extra dependencies, LISA itself does not. This allows LISA to perform tests on systems with limited resources or even different operating systems. For instance, LISA can run on Linux, FreeBSD, Windows, and ESXi. Getting Started with LISA Ready to dive in? Visit the LISA project at aka.ms/lisa to access the documentation. Install: Follow the installation guide provided in the repository to set up LISA in your testing environment. Run: Follow the instructions to run LISA on local machine, Azure or existing systems. Extend: Follow the documents to extend LISA by test cases, data sources, tools, platform, workflow, etc. Join the Community: Engage with other users and contributors through forums and discussions to share experiences and best practices. Contribute: Modify existing test cases or create new ones to suit your needs. Share your contributions with the community to enhance LISA's capabilities. Conclusion LISA offers open-source collaborative testing solutions designed to operate across diverse environments and scenarios, effectively narrowing the gap between enterprise demands and community-led innovation. By leveraging LISA, customers can ensure their Linux deployments are reliable and optimized for performance. Its comprehensive testing capabilities, combined with the flexibility and support of an active community, make LISA an indispensable tool for anyone involved in Linux quality assurance and testing. Your feedback is invaluable, and we would greatly appreciate your insights.202Views1like0CommentsEffective Cloud Governance: Leveraging Azure Activity Logs with Power BI
We all generally accept that governance in the cloud is a continuous journey, not a destination. There's no one-size-fits-all solution and depending on the size of your Azure cloud estate, staying on top of things can be challenging even at the best of times. One way of keeping your finger on the pulse is to closely monitor your Azure Activity Log. This log contains a wealth of information ranging from noise to interesting to actionable data. One could set up alerts for delete and update signals however, that can result in a flood of notifications. To address this challenge, you could develop a Power Bi report, similar to this one, that pulls in the Azure Activity Log and allows you to group and summarize data by various dimensions. You still need someone to review the report regularly however consuming the data this way makes it a whole lot easier. This by no means replaces the need for setting up alerts for key signals, however it does give you a great view of what's happened in your environment. If you're interested, this is the KQL query I'm using in Power Bi let start_time = ago(24h); let end_time = now(); AzureActivity | where TimeGenerated > start_time and TimeGenerated < end_time | where OperationNameValue contains 'WRITE' or OperationNameValue contains 'DELETE' | project TimeGenerated, Properties_d.resource, ResourceGroup, OperationNameValue, Authorization_d.scope, Authorization_d.action, Caller, CallerIpAddress, ActivityStatusValue | order by TimeGenerated asc34Views0likes0CommentsHow Microsoft Ensures the Quality of Linux VM Images and Platform Experiences on Azure?
In the continuously evolving landscape of cloud computing and AI, the quality and reliability of virtual machines (VMs) plays vital role for businesses running mission-critical workloads. With over 65% of Azure workloads running Linux our commitment to delivering high-quality Linux VM images and platforms remains unwavering. This involves overcoming unique challenges and implementing rigorous validation processes to ensure that every Linux VM image offered on Azure meets the high standards of quality and reliability. Ensuring the quality of Linux images and the overall platform experience on Azure involves addressing the challenges posed by a unique platform stack and the complexity of managing and validating multiple independent release cycles. High-quality Linux VMs are essential for ensuring consistent performance, minimizing downtime and regressions, and enhancing security by addressing vulnerabilities with timely updates. Figure 1: Complexity of Linux VMs in Azure VM Image Updates: Azure's Marketplace offers a diverse array of Linux distributions, each maintained by its respective publishers. These distributions release updates on their own schedules, independent of Azure's infrastructure updates. Package Updates: Within each Linux distribution, numerous packages are maintained and updated separately, adding another layer of complexity to the update and validation process. Extension and Agent Updates: Azure provides over 75+ guest VM extensions to enhance operating system capabilities, security, recovery etc. These extensions are updated independently, requiring careful validation to ensure compatibility and stability. Azure Infrastructure Updates: Azure regularly updates its underlying infrastructure, including components like Azure Boost, to improve reliability, performance, and security. VM SKUs and Sizes: Azure provides thousands of VM sizes with various combinations of CPU, memory, disk, and network configurations to meet diverse customer needs. Managing concurrent updates across all VMs poses significant QA challenges. To address this, Azure uses rigorous testing, gating and validation processes to ensure all components function reliably and meet customer expectations. Azure’s Approach to Overcoming Challenges To address these challenges, we have implemented a comprehensive validation strategy that involves testing at every stage of the image and kernel lifecycle. By adopting a shift-left approach, we execute Linux VM-specific test cases as early as possible. This strategy helps us catch failures close to the source of changes before they are deployed to Azure fleet. Our validation gates integrate with various entry points and provide coverage for a wide variety of scenarios on Azure. Upstream Kernel Validation: As a founding member of Kernel CI, Microsoft validates commits from Linux next and stable trees using Linux VMs in Azure and shares results with the community via Kernel CI DB. This enables us to detect regressions at early stages. Azure-Tuned Kernel Validation: Azure-Tuned Kernels provided by our endorsed distribution partners are thoroughly validated and signed off by Microsoft before it is released to the Azure fleet. Linux Guest Image Validation: The quality team works with endorsed distribution partners for major releases to conduct thorough validation. Each refreshed image, including those from third-party publishers, is validated and certified before being added to the marketplace. Automated pipelines are in place to validate the images once they are available in the Marketplace. Package Validation: Unattended Update: We conduct validation of packages updates with target distro to prevent regression and ensure that only tested snapshots are utilized for updating Linux VM in Azure. Guest Extension Validation: Every Azure-provided extensions undergoes Basic Validation Testing (BVT) across all images and kernel versions to ensure compatibility and functionality amidst any changes. Additionally, comprehensive release testing is conducted for major releases to maintain reliability and compatibility. New VM SKU Validation: Any new VM SKU undergoes validation to confirm it supports Linux before its release to the Azure fleet. This process includes functionality, performance and stress testing across various Linux distributions, and compatibility tests with existing Linux images in the fleet. Azure HostOS & Host Agent Validation: Updates to the Azure Host OS & Agents are thoroughly tested from the Linux guest OS perspective to confirm that changes in the Azure host environment do not result in regressions in compatibility, performance, or stability for Linux VMs. At any stage where regressions or bugs are identified, we block those releases to ensure they never reach customers. All issues are resolved and rigorously retested before images, kernels, or extension updates are made available. Through these robust validation processes, Azure ensures that Linux VMs consistently deliver to customer expectations, delivering a reliable, secure, and high-performance environment for mission-critical workloads. Validation Tools for VM Guest Images and Kernel To ensure the quality and reliability of Linux VM images and kernels on Azure, we leverage open-source kernel testing frameworks like LTP, kselftest, and fstest, along with extensive Azure-specific test cases available in LISA, to comprehensively validate all aspects of the platforms. LISA (Linux Integration Services Automation): Microsoft is committed to open source and that is no different with our testing framework LISA. LISA is an open-source core testing framework designed to meet all Linux validation needs. It includes over 400 tests covering performance, features and security, ensuring comprehensive validation of Linux images on Azure. By automating diverse test scenarios, LISA enables early detection and resolution of issues, enhancing the stability and performance of Linux VMs. Conclusion At Azure, Linux quality is a fundamental aspect of our commitment to delivering reliable VM images and platforms. Through comprehensive testing and strong collaboration with Linux distribution partners, we ensure quality and reliability of VMs while proactively identifying and resolving potential issues. This approach allows us to continually refine our processes and maintain the quality that customers expect from Azure. Quality is a core focus, and we remain dedicated to continuous improvement, delivering world-class Linux environments to businesses and customers. For us, quality is not just a priority—it’s our standard. Your feedback is invaluable, and we would greatly appreciate your insights.518Views0likes0CommentsAI reports: Improve AI governance and GenAIOps with consistent documentation
AI reports are designed to help organizations improve cross-functional observability, collaboration, and governance when developing, deploying, and operating generative AI applications and fine-tuned or custom models. These reports support AI governance best practices by helping developers document the purpose of their AI model or application, its features, potential risks or harms, and applied mitigations, so that cross-functional teams can track and assess production-readiness throughout the AI development lifecycle and then monitor it in production. Starting in December, AI reports will be available in private preview in a US and EU Azure region for Azure AI Foundry customers. To request access to the private preview of AI reports, please complete the Interest Form. Furthermore, we are excited to announce new collaborations with Credo AI and Saidot to support customers’ end-to-end AI governance. By integrating the best of Azure AI with innovative and industry-leading AI governance solutions, we hope to provide our customers with choice and help empower greater cross-functional collaboration to align AI solutions with their own principles and regulatory requirements. Building on learnings at Microsoft Microsoft’s approach for governing generative AI applications builds on our Responsible AI Standard and the National Institute of Standards and Technology’s AI Risk Management Framework. This approach requires teams to map, measure, and manage risks for generative applications throughout their development cycle. A core asset of the first—and iterative—map phase is the Responsible AI Impact Assessment. These assessments help identify potential risks and their associated harms, as well as mitigations to address them. As development of an AI system progresses, additional iterations can help development teams document their progress in risk mitigation and allow experts to review the evaluations and mitigations and make further recommendations or requirements before products are launched. Post-deployment, these assessments become a source of truth for ongoing governance and audits, and help guide how to monitor the application in production. You can learn more about Microsoft’s approach to AI governance in our Responsible AI Transparency Report and find a Responsible AI Impact Assessment Guide and example template on our website. How AI reports support AI impact assessments and GenAIOps AI reports can help organizations govern their GenAI models and applications by making it easier for developers to provide the information needed for cross-functional teams to assess production-readiness throughout the GenAIOps lifecycle. Developers will be able to assemble key project details, such as the intended business use case, potential risks and harms, model card, model endpoint configuration, content safety filter settings, and evaluation results into a unified AI report from within their development environment. Teams can then publish these reports to a central dashboard in the Azure AI Foundry portal, where business leaders can track, review, update, and assess reports from across their organization. Users can also export AI reports in PDF and industry-standard SPDX 3.0 AI BOM formats, for integration into existing GRC workflows. These reports can then be used by the development team, their business leaders, and AI, data, and other risk professionals to determine if an AI model or application is fit for purpose and ready for production as part of their AI impact assessment processes. Being versioned assets, AI reports can also help organizations build a consistent bridge across experimentation, evaluation, and GenAIOps by documenting what metrics were evaluated, what will be monitored in production, and the thresholds that will be used to flag an issue for incident response. For even greater control, organizations can choose to implement a release gate or policy as part of their GenAIOps that validates whether an AI report has been reviewed and approved for production. Key benefits of these capabilities include: Observability: Provide cross-functional teams with a shared view of AI models and applications in development, in review, and in production, including how these projects perform in key quality and safety evaluations. Collaboration: Enable consistent information-sharing between GRC, development, and operational teams using a consistent and extensible AI report template, accelerating feedback loops and minimizing non-coding time for developers. Governance: Facilitate responsible AI development across the GenAIOps lifecycle, reinforcing consistent standards, practices, and accountability as projects evolve or expand over time. Build production-ready GenAI apps with Azure AI Foundry If you are interested in testing AI reports and providing feedback to the product team, please request access to the private preview by completing the Interest Form. Want to learn more about building trustworthy GenAI applications with Azure AI? Here’s more guidance and exciting announcements to support your GenAIOps and governance workflows from Microsoft Ignite: Learn about new GenAI evaluation capabilities in Azure AI Foundry Learn about new GenAI monitoring capabilities in Azure AI Foundry Learn about new IT governance capabilities in Azure AI Foundry Whether you’re joining in person or online, we can’t wait to see you at Microsoft Ignite 2024. We’ll share the latest from Azure AI and go deeper into capabilities that support trustworthy AI with these sessions: Keynote: Microsoft Ignite Keynote Breakout: Trustworthy AI: Future trends and best practices Breakout: Trustworthy AI: Advanced AI risk evaluation and mitigation Demo: Simulate, evaluate, and improve GenAI outputs with Azure AI Foundry Demo: Track and manage GenAI app risks with AI reports in Azure AI Foundry We’ll also be available for questions in the Connection Hub on Level 3, where you can find “ask the expert” stations for Azure AI and Trustworthy AI.2KViews1like0CommentsContinuously monitor your GenAI application with Azure AI Foundry and Azure Monitor
Now, Azure AI Foundry and Azure Monitor seamlessly integrate to enable ongoing, comprehensive monitoring of your GenAI application's performance from various perspectives, including token usage, operational metrics (e.g. latency and request count), and the quality and safety of generated outputs. With online evaluation, now available in public preview, you can continuously assess your application's outputs, regardless of its deployment or orchestration framework, using built-in or custom evaluation metrics. This approach can help organizations identify and address security, quality, and safety issues in both pre-production and post-production phases of the enterprise GenAIOps lifecycle. Additionally, online evaluations integrate seamlessly with new tracing capabilities in Azure AI Foundry, now available in public preview, as well as Azure Monitor Application Insights. Tying it all together, Azure Monitor enables you to create custom monitoring dashboards, visualize evaluation results over time, and set up alerts for advanced monitoring and incident response. Let’s dive into how all these monitoring capabilities fit together to help you be successful when building enterprise-ready GenAI applications. Observability and the enterprise GenAIOps lifecycle The generative AI operations (GenAIOps) lifecycle is a dynamic development process that spans all the way from ideation to operationalization. It involves choosing the right base model(s) for your application, testing and making changes to the flow, and deploying your application to production. Throughout this process, you can evaluate your application’s performance iteratively and continuously. This practice can help you identify and mitigate issues early and optimize performance as you go, helping ensure your application performs as expected. You can use the built-in evaluation capabilities in Azure AI Foundry, which now include remote evaluation and continuous online evaluation, to support end-to-end observability into your app’s performance throughout the GenAIOps lifecycle. Online evaluation can be used in many different application development scenarios, including: Automated testing of application variants. Integration into DevOps CI/CD pipelines. Regularly assessing an application’s responses for key quality metrics (e.g. groundedness, coherence, recall). Quickly responding to risky or inappropriate outputs that may arise during real-world use (e.g. containing violent, hateful, or sexual content) Production application monitoring and observability with Azure Monitor Application Insights. Now, let explore how you can use tracing for your application to begin your observability journey. Gain deeper insight into your GenAI application's processes with tracing Tracing enables comprehensive monitoring and deeper analysis of your GenAI application's execution. This functionality allows you to trace the process from input to output, review intermediate results, and measure execution times. Additionally, detailed logs for each function call in your workflow are accessible. You can inspect parameters, metrics, and outputs of each AI model utilized, which facilitates debugging and optimization of your application while providing deeper insights into the functioning and outputs of the AI models. The Azure AI Foundry SDK supports tracing to various endpoints, including local viewers, Azure AI Foundry, and Azure Monitor Application Insights. Learn more about new tracing capabilities in Azure AI Foundry. Continuously measure the quality and safety of generated outputs with online evaluation With online evaluation, now available in public preview, you can continuously evaluate your collected trace data for troubleshooting, monitoring, and debugging purposes. Online evaluation with Azure AI Foundry offers the following capabilities: Integration between Azure AI services and Azure Monitor Application Insights Monitor any deployed application, agnostic of deployment method or orchestration framework Support for trace data logged via the Azure AI Foundry SDK or a logging API of your choice Support for built-in and custom evaluation metrics via the Azure AI Foundry SDK Can be used to monitor your application during all stages of the GenAIOps lifecycle To get started with online evaluation, please review the documentation and code samples. Monitor your app in production with Azure AI Foundry and Azure Monitor Azure Monitor Application Insights excels in application performance monitoring (APM) for live web applications, providing many experiences to help enhance the performance, reliability, and quality of your applications. Once you’ve started collecting data for your GenAI application, you can access an out-of-the-box dashboard view to help you get started with monitoring key metrics for your application directly from your Azure AI project. Insights are surfaced to you via an Azure Monitor workbook that is linked to your Azure AI project, helping you quickly observe trends for key metrics, such as token consumption, user feedback, and evaluations. You can customize this workbook and add tiles for additional metrics or insights based on your business needs. You can also share it with your team so they can get the latest insights as well. Build enterprise-ready GenAI apps with Azure AI Foundry Ready to learn more? Here are other exciting announcements from Microsoft Ignite to support your GenAIOps workflows: New tracing and debugging capabilities to drive continuous improvement New ways to evaluate models and applications in pre-production New ways to document and share evaluation results with business stakeholders Whether you’re joining in person or online, we can’t wait to see you at Microsoft Ignite 2024. We’ll share the latest from Azure AI and go deeper into best practices for GenAIOps with these breakout sessions: Multi-agentic GenAIOps from prototype to production with dev tools Trustworthy AI: Advanced risk evaluation and mitigation Azure AI and the dev toolchain you need to infuse AI in all your apps1.9KViews0likes0CommentsEnhancing Observability with Inspektor Gadget
Thorough observability is essential to a pain free cloud experience. Azure provides many general-purpose observability tools, but you may want to create custom tooling . Inspektor Gadget is an open-source framework that makes customizable data collection easy. Microsoft recently contributed new features to Inspektor Gadget that further enhance its modular framework, making it even easier to meet your specific systems inspection needs. Of course, we also made it easy for Azure Kubernetes Service (AKS) users to use.1KViews0likes0CommentsAzure Metric vs Performance counters show different values
Azure Metric vs Performance counters Return values of network traffic are totally off, regardless of time frame between portal, log analytic query perf and InsightsMetrics. See screen off excel. I have open log analytic workspace, select Time range Last 24 hours and one day 26/03/2024 Perf | where TimeGenerated between (datetime(2024-03-26) .. datetime(2024-03-27)) | where Computer == "**********" | where ObjectName == "Network Interface" and CounterName == "Bytes Sent/sec" or CounterName == "Bytes Received/sec" | summarize BytsSent = sum(CounterValue) by bin(TimeGenerated, 1d),CounterName InsightsMetrics | where TimeGenerated between (datetime(2024-03-26) .. datetime(2024-03-27)) | where Origin == "vm.azm.ms" | where Computer == "*******" | where Namespace == "Network" | where Name == "ReadBytesPerSecond" or Name == "WriteBytesPerSecond" | extend Tags = parse_json(Tags) | extend BytestoSec = toreal(Tags.["vm.azm.ms/bytes"]) | sort by TimeGenerated | project TimeGenerated,Name,Val,BytestoSec | summarize AggregatedValue = sum(BytestoSec) by bin(TimeGenerated, 1d),Name I don’t know what im doing wrong or i don't understand . But sample interval in data collection rule is 15s, and sample interval of metric is 60s.816Views0likes2Comments