Connect with experts and redefine what’s possible at work – join us at the Microsoft 365 Community Conference May 6-8. Learn more >

updates

11 Topics

Enhancing Azure Private DNS Resiliency with Internet Fallback
Is your Azure environment prone to DNS resolution hiccups, especially when leveraging Private Link and multiple virtual networks? Dive into our latest blog post, "Enhancing Azure Private DNS Resiliency with Internet Fallback," and discover how to eliminate those frustrating NXDOMAIN errors and ensure seamless application availability. I break down the common challenges faced in complex Azure setups, including isolated Private DNS zones and hybrid environments, and reveal how the new internet fallback feature acts as a vital safety net. Learn how this powerful tool automatically switches to public DNS resolution when private resolution fails, minimizing downtime and simplifying management. Our tutorial walks you through the easy steps to enable internet fallback, empowering you to fortify your Azure networks and enhance application resilience. Whether you're dealing with multi-tenant deployments or intricate service dependencies, this feature is your key to uninterrupted connectivity. Don't let DNS resolution issues disrupt your operations. Read the full article to learn how to implement Azure Private DNS internet fallback and ensure your applications stay online, no matter what.
adityakumar60
Mar 03, 2025 Place Azure Infrastructure Blog
647Views
2likes
1Comment
Learn to elevate security and resiliency of Azure and AI projects with skilling plans
In an era where organizations are increasingly adopting a cloud-first approach to support digital transformation and AI-driven innovation, learning skills to enhance cloud resilience and security has become a top priority. By 2025, an estimated 85% of companies will have embraced a cloud-first strategy, according to research by Gartner, marking a significant shift toward reliance on platforms like Microsoft Azure for mission-critical workloads. Yet according to a recent Flexera survey, 78% of respondents found a lack of skilled people and expertise to be one of their top three cloud challenges along with optimizing costs and boosting security. To help our customers unlock the full potential of their Azure investments, Microsoft introduced Azure Essentials, a single destination for in-depth skilling, guidance and support for elevating reliability, security, and ongoing performance of their cloud and AI investments. In this blog we’ll explore this guidance in detail and introduce you to two new free, self-paced skilling resource Plans on Microsoft Learn to get your team skilled on building resiliency into your Azure and AI environments. Empower your team: Learn proactive resiliency for critical workloads in Azure Azure offers a resilient foundation to reliably support workloads in the cloud, and our Well-Architected Framework helps teams design systems to recover from failures with minimal disruption. Figure 1: Design your critical workloads for resiliency, and assess existing workloads for ongoing performance, compliance and resiliency. The new resiliency-focused Microsoft Learn skilling plan helps teams learn to “Elevate reliability, security, and ongoing performance of Azure and AI projects”, and they see how the Well-Architected Framework, coupled with the Cloud Adoption Framework, provides actionable guidelines to enhance resilience, optimize security measures, and ensure consistent, high-performance for Azure workloads and AI deployments. The Plan also covers cost optimization through the FinOps Framework, ensuring that security and reliability measures are implemented within budget. This training also emphasizes Azure AI Foundry, a tool that allows teams to work on AI-driven projects while maintaining security and governance standards, which are critical to reducing vulnerabilities and ensuring long-term stability. The plan guides learners in securely developing, testing, and deploying AI solutions, empowering them to build resilient applications that can support sustained performance and data integrity. The impact of Azure’s resiliency guidance is significant. According to Forrester, following this framework reduces planned downtime by 30%, prevents 15% of revenue loss due to resilience issues, and achieves an 18% ROI through rearchitected workloads. Given that 60% of reliability failures result in losses of at least $100,000, and 15% of failures cost upwards of $1 million, these preventative measures underscore the financial value of resilient architecture. Ensuring security in Azure AI workloads AI adds complexity to security considerations in cloud environments. AI applications often require significant data handling, which introduces new vulnerabilities and compliance considerations. Microsoft’s guidance focuses on integrating robust security practices directly into AI project workflows, ensuring that organizations adhere to stringent data protection regulations. Azure’s tools, including multi-zone deployment options, network security solutions, and data protection services, empower customers to create resilient and secure workloads. Our new training on proactive resiliency and reliability of critical Azure and AI workloads guides you in building fault-tolerant systems and managing risks in your environments. This plan teaches users how to assess workloads, identify vulnerabilities, and deploy prioritized resiliency strategies, equipping them to achieve optimal performance even under adverse conditions. Maximizing business value and ROI through resiliency and security Companies that prioritize resiliency and security in their cloud strategies enjoy multiple benefits beyond reduced downtime. Forrester’s findings suggest that a commitment to resilience has a three-year financial impact, with significant cost savings from avoided outages, higher ROI from optimized workloads, and increased productivity. Organizations can reinvest these savings into further modernization efforts, expanding their capabilities in AI and data analytics. Azure’s tools, frameworks, and Microsoft’s shared responsibility model give businesses the foundation to build resilient, secure, and high-performing applications that align with their goals. Microsoft Learn’s structured learning Plans provide self-paced modules to help you “Elevate Azure Reliability and Performance” and “Improve resiliency of critical workloads on Azure,” provide essential training to build skills in designing and maintaining reliable and secure cloud projects. As more companies embrace cloud-first strategies, Microsoft’s commitment to proactive resiliency, architectural guidance, and cost management tools will empower organizations to realize the full potential of their cloud and AI investments. Start your journey to a reliable and secure Azure cloud today. Resources: Visit Microsoft Learn Plans
MeganLordeonPennie
Feb 14, 2025 Place Azure Infrastructure Blog
258Views
1like
0Comments
Resiliency Best Practices You Need For your Blob Storage Data
Maintaining Resiliency in Azure Blob Storage: A Guide to Best Practices Azure Blob Storage is a cornerstone of modern cloud storage, offering scalable and secure solutions for unstructured data. However, maintaining resiliency in Blob Storage requires careful planning and adherence to best practices. In this blog, I’ll share practical strategies to ensure your data remains available, secure, and recoverable under all circumstances. 1. Enable Soft Delete for Accidental Recovery (Most Important) Mistakes happen, but soft delete can be your safety net and. It allows you to recover deleted blobs within a specified retention period: Configure a soft delete retention period in Azure Storage. Regularly monitor your blob storage to ensure that critical data is not permanently removed by mistake. Enabling soft delete in Azure Blob Storage does not come with any additional cost for simply enabling the feature itself. However, it can potentially impact your storage costs because the deleted data is retained for the configured retention period, which means: The retained data contributes to the total storage consumption during the retention period. You will be charged according to the pricing tier of the data (Hot, Cool, or Archive) for the duration of retention 2. Utilize Geo-Redundant Storage (GRS) Geo-redundancy ensures your data is replicated across regions to protect against regional failures: Choose RA-GRS (Read-Access Geo-Redundant Storage) for read access to secondary replicas in the event of a primary region outage. Assess your workload’s RPO (Recovery Point Objective) and RTO (Recovery Time Objective) needs to select the appropriate redundancy. 3. Implement Lifecycle Management Policies Efficient storage management reduces costs and ensures long-term data availability: Set up lifecycle policies to transition data between hot, cool, and archive tiers based on usage. Automatically delete expired blobs to save on costs while keeping your storage organized. 4. Secure Your Data with Encryption and Access Controls Resiliency is incomplete without robust security. Protect your blobs using: Encryption at Rest: Azure automatically encrypts data using server-side encryption (SSE). Consider enabling customer-managed keys for additional control. Access Policies: Implement Shared Access Signatures (SAS) and Stored Access Policies to restrict access and enforce expiration dates. 5. Monitor and Alert for Anomalies Stay proactive by leveraging Azure’s monitoring capabilities: Use Azure Monitor and Log Analytics to track storage performance and usage patterns. Set up alerts for unusual activities, such as sudden spikes in access or deletions, to detect potential issues early. 6. Plan for Disaster Recovery Ensure your data remains accessible even during critical failures: Create snapshots of critical blobs for point-in-time recovery. Enable backup for blog & have the immutability feature enabled Test your recovery process regularly to ensure it meets your operational requirements. 7. Resource lock Adding Azure Locks to your Blob Storage account provides an additional layer of protection by preventing accidental deletion or modification of critical resources 7. Educate and Train Your Team Operational resilience often hinges on user awareness: Conduct regular training sessions on Blob Storage best practices. Document and share a clear data recovery and management protocol with all stakeholders. 8. "Critical Tip: Do Not Create New Containers with Deleted Names During Recovery" If a container or blob storage is deleted for any reason and recovery is being attempted, it’s crucial not to create a new container with the same name immediately. Doing so can significantly hinder the recovery process by overwriting backend pointers, which are essential for restoring the deleted data. Always ensure that no new containers are created using the same name during the recovery attempt to maximize the chances of successful restoration. Wrapping It Up Azure Blob Storage offers an exceptional platform for scalable and secure storage, but its resiliency depends on following best practices. By enabling features like soft delete, implementing redundancy, securing data, and proactively monitoring your storage environment, you can ensure that your data is resilient to failures and recoverable in any scenario. Protect your Azure resources with a lock - Azure Resource Manager | Microsoft Learn Data redundancy - Azure Storage | Microsoft Learn Overview of Azure Blobs backup - Azure Backup | Microsoft Learn Protect your Azure resources with a lock - Azure Resource Manager | Microsoft Learn
nehatiwari1994
Dec 10, 2024 Place Azure Infrastructure Blog
774Views
1like
0Comments
Azure Extended Zones: Optimizing Performance, Compliance, and Accessibility
Azure Extended Zones are small-scale Azure extensions located in specific metros or jurisdictions to support low-latency and data residency workloads. They enable users to run latency-sensitive applications close to end users while maintaining compliance with data residency requirements, all within the Azure ecosystem.
adityakumar60
Sep 02, 2024 Place Azure Infrastructure Blog
2.7KViews
2likes
0Comments
Simplify IT management with Microsoft Copilot for Azure – save time and get answers fast
Today, we’re announcing Microsoft Copilot for Azure, an AI companion, that helps you design, operate, optimize, and troubleshoot your cloud infrastructure and services. Combining the power of cutting-edge large language models (LLMs) with the Azure Resource Model, Copilot for Azure enables rich understanding and management of everything that’s happening in Azure, from the cloud to the edge. The cloud management landscape is evolving rapidly, there are more end users, more applications, and more requirements demanding more capabilities from the infrastructure. The number of distinct resources to manage is rapidly increasing, and the nature of each of those resources is becoming more sophisticated. As a result, IT professionals take more time looking for information and are less productive. That’s where Copilot for Azure can help. Microsoft Copilot for Azure helps you to complete complex tasks faster, quickly discover and use new capabilities, and instantly generate deep insights to scale improvements broadly across the team and organization. In the same way GitHub Copilot, an AI companion for development, is helping developers do more in less time, Copilot for Azure will help IT professionals. Recent GitHub data shows that among developers who have used GitHub Copilot, 88 percent say they’re more productive, 77 percent say the tool helps them spend less time searching for information, and 74 percent say they can focus their efforts on more satisfying work. 1 Copilot for Azure helps you: Design: create and configure the services needed while aligning with organizational policies Operate: answer questions, author complex commands, and manage resources Troubleshoot: orchestrate across Azure services for insights to summarize issues, identify causes, and suggest solutions Optimize: improve costs, scalability, and reliability through recommendations for your environment Copilot is available in the Azure portal and will be available from the Azure mobile app and CLI in the future. Copilot for Azure is built to reason over, analyze, and interpret Azure Resource Manager (ARM), Azure Resource Graph (ARG), cost and usage data, documentation, support, best practice guidance, and more. Copilot accesses the same data and interfaces as Azure's management tools, conforming to the policy, governance, and role-based access control configured in your environment; all of this carried out within the framework of Azure’s steadfast commitment to safeguarding customer data security and privacy. Azure teams are continuously enhancing Copilot’s understanding of each service and capability, and every day, that understanding will help Copilot become even more helpful. Read on to explore some of the additional scenarios being used in Microsoft Copilot for Azure today. Learning Azure and providing recommendations Modern clouds offer a breadth of services and capabilities—and Copilot helps you learn about every service. It can also provide tailored recommendations for the services your workloads need. Insights are delivered directly to you in the management console, accompanied by links for further reading. Copilot is up to date with the latest Azure documentation, ensuring you’re getting the most current and relevant answers to your questions. Copilot navigates to the precise location in the portal needed to perform tasks. This feature speeds up the process from question to action. Copilot can also answer questions in context of the resources you’re managing, enabling you to ask about sizing, resiliency, or the tradeoffs between possible solutions. Understanding cloud environments The number and types of resources deployed in cloud environments are increasing. Copilot helps answer questions about a cloud environment faster and more easily. In addition to answering questions about your environment, Copilot also facilitates the construction of Kusto Query Language (KQL) queries for use within Azure Resource Graph. Whatever your experience level, Copilot accelerates the generation of insights about your Azure resources and their deployment environments. While baseline familiarity with the Kusto Query Language can be beneficial, Copilot is designed to assist users ranging from novices to experts in achieving their Azure Resource Graph objectives, anywhere in Azure portal. You can easily open generated queries with the Azure Resource Graph Explorer, enabling you to review generated queries to ensure they accurately reflect the intended questions. Optimizing cost and performance It’s critical for teams to get insights into spending, recommendations on how to optimize and predictive scenarios or “what-if” analyses. Copilot aids in understanding invoices, spending patterns, changes, and outliers, and it recommends cost optimizations. Copilot can help you better analyze, estimate, and optimize your cloud costs. For example, if you prompt Copilot with questions like, "Why did my cost spike on July 8?" or "Show me the subscriptions that costs the most," you’ll get an immediate response based on your usage, billing, and cost data. Copilot is integrated with the advanced AI algorithms in Application Insights Code Optimizations to detect CPU and memory usage performance issues at a code level and provides recommendations on how to fix them. In addition, Copilot helps you discover and triage available code recommendations for your .NET applications. Metrics-based insights Each resource in Azure offers a rich set of metrics available with Azure Monitor. Copilot can help you discover the available metrics for a resource, visualize and summarize the results, enable deeper exploration, and even perform anomaly detection to analyze unexpected changes and provide recommendations to address the issue. Copilot can also access data in Azure Monitor managed service for Prometheus, enabling the creation of PromQL queries. CLI scripting Azure CLI can manage all Azure resources from the command line and in scripts. With more than 9,000 commands and associated parameters. Copilot helps you easily and quickly identify the command and its parameters to carry out your specified operation. If a task requires multiple commands, Copilot generates a script aligned with Azure and scripting best practices. These scripts can be directly executed in the portal using Cloud Shell or copied for use in repeatable operations or automation. Support and troubleshooting When issues arise, quickly accessing the necessary information and assistance for resolution is critical. Copilot provides troubleshooting insight generated from Azure documentation and built-in, service-specific troubleshooting tools. Copilot will quickly provide step-by-step guidance for troubleshooting, while providing links to the right documentation. If more help is needed, Copilot will direct you to assisted support if requested. Copilot is also aware of service-specific diagnostics and troubleshooting tools to help you choose the perfect tool to assist you, whether it's related to high CPU usage, networking issues, getting a memory dump, scaling resources to support increased demand or more. Hybrid management IT estates are complex with many workloads run across datacenters, operational edge environments, like factories, and multicloud. Azure Arc creates a bridge between the Azure controls and tools, and those workloads. Copilot can also be used to design, operate, optimize, and troubleshoot Azure Arc-enabled workloads. Azure Arc facilitates the transfer of valuable telemetry and observability data flows back to Azure. This lets you swiftly address outages, reinstate services, and resolve root causes to prevent recurrences. Building responsibly Copilot for Azure is designed for the needs of the enterprise. Our efforts are guided by our AI principles and Responsible AI Standard and build on decades of research on grounding and privacy-preserving machine learning. Microsoft’s work on AI is reviewed for potential harms and mitigations by a multidisciplinary team of researchers, engineers, and policy experts. All of the features in Copilot are carried out within the Azure framework of safeguarding our customers’ data security and privacy. Copilot automatically inherits your organization’s security, compliance, and privacy policies for Azure. Data is managed in line with our current commitments. Copilot large language models are not trained on your tenant data. Copilot can only access data from Azure and perform actions against Azure when the current user has permission via role-based access control. All requests to Azure Resource Manager and other APIs are made on behalf of the user. Copilot does not have its own identity from a security perspective. When a user asks, ‘How many VMs do I have?’ the answer will be the same as if they went to Resource Graph Explorer and wrote / executed that query on their own. What’s next Microsoft Copilot for Azure is already being used internally by Microsoft employees and with a small group of customers. Today, we’re excited about the next step as we announce and launch the preview to you! Please click here to sign up. We’ll onboard customers into the preview on a weekly basis. In the coming weeks, we'll continuously add new capabilities and make improvements based on your feedback. Learn more Azure Ignite 2023 Infrastructure Blog Adaptive Cloud Microsoft Copilot for Azure Documentation 1 Research: quantifying GitHub Copilot’s impact on developer productivity and happiness, Eirini Kalliamvakou, GitHub. Sept. 7, 2022.
Erin Chapple (BOURKE-DUNPHY)
May 15, 2024 Place Azure Infrastructure Blog
138KViews
21likes
25Comments
Detailed CSP to EA Migration guidance and crucial considerations
Detailed CSP to EA Migration Guidance and Crucial consideration
nehatiwari1994
Jan 26, 2024 Place Azure Infrastructure Blog
25KViews
7likes
4Comments
IT professional's guide to Azure Infrastructure at Microsoft Ignite 2023
Azure infrastructure is ready for #MSIgnite! Explore breakout sessions, on-demands, Learn Lives, and more.
lannateh
Nov 14, 2023 Place Azure Infrastructure Blog
10KViews
4likes
2Comments
New ramp guide available to skill up on Azure networking
Network Engineer skilling ramp guide to help you keep pace with the demands of your company's digital infrastructure.
lannateh
Dec 20, 2022 Place Azure Infrastructure Blog
8.3KViews
4likes
0Comments
Web application routing, Open service mesh and AKS
AKS Web Application Routing with Open Service Mesh AKS product team announced a public preview of Web Application Routing this year. One of the benefits of using this add-on is the simplicity of adding entry point for applications to your cluster with a managed ingress controller. This add-on works nicely with Open service mesh. In this blog, we investigate how this works, how to setup mTLS from ingress controller to OSM and the integration. While we are using AKS managed add-on, we are taking the open-source OSM approach for explaining this, but it’s important to remember that AKS also has an add-on for OSM. Web Application Routing add-on on Azure Kubernetes Service (AKS) (Preview) - Azure Kubernetes Service | Microsoft Learn Reference link above focuses on step-by-step process to implement Web application routing along with few other add-ons such as OSM and Azure Keyvault secrets provider. The intention of this blog is not to repeat same instructions but an attempt to dig into few important aspects of OSM to illustrate connectivity from this managed ingress add-on to OSM. Enterprises prefer to leverage managed services and add-ons but at the same time there is a vested interest in understanding foundational building blocks of open-source technologies used and how they are glued together to implement certain functionalities. This blog attempts to provide some insight into how these two (OSM and web app routing) are working together but not drill too much into OSM as its documented well in openservicemesh.io First step is creating a new cluster: az aks create -g webapprg -n webappaks -l centralus --enable-addons web_application_routing --generate-ssh-keys This creates a cluster along with ingress controller installed. You can check this in ingressProfile of the cluster. Ingress controller is deployed in a namespace called app-routing-system. Image is pulled from mcr registry (and not other public registries). Since this creates ingress controller, it creates public IP attached to Azure Load Balancer and used for Ingress controller. You might want to change ‘Inbound security rules’ in NSG for agentpool to your own IP address (from default Internet) to protect. This managed add-on creates an ‘Ingress controller’ with ingress class ‘webapprouting.kubernetes.azure.com’. So, any ingress definition should use this Ingress class. You can see that Nginx deployment is running with HPA config. Please understand that this is a reverse proxy, sits in data path, uses resources such as CPU+memory and lots of network I/O so it makes perfect sense to set HPA. In other words, this is the place where all traffic enters the cluster and traverses through to application pods. Some refer to this as north-south traffic into the cluster. It’s important to emphasize that there were several instances in my experience where customers use OSS nginx and didn’t set right config for this deployment, ran into unpredictable failures while moving into production. Obviously, this wouldn’t show up in functional testing! So, use this managed add-on where AKS manages it for you and maintains it with more appropriate config. You don’t need to and shouldn’t change anything in app-routing-system namespace. As stated above, we are taking under the hood approach to understand the implementation and not to change anything here. In this diagram, app container is a small circle and sidecar (envoy) is a larger circle. Using larger circle for sidecar for more space to show relevant text, so there is no significance with the sizing of the circle/eclipse! Top left side of the diagram is a copy of a diagram from openservicemesh.io site to explain the relationship between different components in OSM. One thing to note here is that there is a single service certificate for all K8S pods belonging to a particular service where there is a proxy certificate for each pod. You will understand this much better later in this blog. At this time, we have deployed a cluster with managed ingress controller (indicated by A in diagram). It’s time to deploy service mesh. Again, I’m reiterating that we are taking open source OSM installation approach to walk you through this illustration, but OSM is also an another supported AKS add-on. Let’s hydrate this cluster with OSM. OSM installation requires osm CLI binaries installed in your laptop (Windows or Linux or Mac). Link below. Setup OSM | Open Service Mesh Assuming that your context is still pointing to this newly deployed cluster, run this following command. osm install --mesh-name osm --osm-namespace osm-system --set=osm.enablePermissiveTrafficPolicy=true This completes installation of OSM (ref: B in diagram) with permissive traffic policy which means there are no traffic restrictions between services in the cluster. Here is a snapshot of namespaces. List of objects in osm-system namespace. It’s important to ensure that all deployed services are operational. In some cases, if a cluster is deployed with nodes with limited cpu/mem, this could cause issues to deployment. Otherwise, there shouldn’t be any other issues. At this time, we’ve successfully deployed ingress controller (ref: A) and service mesh (ref: B). However, there are no namespaces in the service mesh. In the diagram above, assume dotted-red rectangle without anything in that box. Let’s create new namespaces in the cluster and add them to OSM. One thing to notice from osm namespace list output is that the status of sidecar-injection. Sidecar-injection uses Kubernetes mutating admission webhook to inject ‘envoy’ sidecar into the pod definition before it is written to etcd. It also injects another init container into the pod definition which we will review later. Also create sample2 and add this to OSM. Commands below. k create ns sample2 osm namespace add sample2 Deploy sample1 (deploy-sample1.yaml) application with 3 replicas. This uses ‘default’ service account and creates a service with Cluster IP. This is a simple hello world deployment as found in Azure documentation. If you want to test, you can clone code from git@github.com:srinman/webapproutingwithosm.git Let’s inspect service account for Nginx (our Web app routing add-on in app-routing-system namespace). As you can see, in app-routing-namspace, nginx is using nginx service account and, in sample1 namespace, there is only one service account which is ‘default’ service account. k get deploy -n app-routing-system -o yaml | grep -i serviceaccountname This confirms that Nginx is indeed using nginx service account and not default one in app-routing-system. Let’s also inspect secrets in osm-system and app-routing-system namespaces. Note that there is no K8S TLS secret for talking to OSM. At this point, you have an ingress controller installed, OSM installed, sample1 and sample2 added to OSM, app deployed in sample1 namespace but there is no configuration defined yet for routing traffic from ingress controller to application. In the diagram, you can imagine that there is no connection #2 from ingress to workload in mesh. User configuration in Ingress We need to configure app-routing-system, our managed add-on, to listen for inbound traffic as known as north-south traffic and where to proxy connection to. This is done with ‘Ingress’ object in Kubernetes. Please notice some special annotations in Ingress definition. These annotations are needed for proxying connection to an application that is part of OSM. k apply -f ingress-sample1.yaml Once this is defined, you can view nginx.conf updated with this ingress definition. k exec nginx-6c6486b7b9-kg9j4 -n app-routing-system -it – sh cat nginx.conf We’ve verified the configuration for ‘Web app routing’ to listen and proxy traffic to aks-helloworld-svc service in namespace sample1. In diagram, #A configuration is complete for our traffic to sample1 namespace. If the configuration is a simple Ingress definition without any special annotations and if the target workload is not added to OSM namespace, we should be able to route north-south traffic into our workload by this time but that’s not the case with our definition. We need to configure OSM to accept connections from our managed Ingress controller. User configuration in OSM Let’s review OSM mesh configuration. You can notice that spec.certificate doesn’t have ingressGateway section. kubectl edit meshconfig osm-mesh-config -n osm-system add ingressGateway section as defined below certificate: ingressGateway: secret: name: nginx-client-cert-for-talking-to-osm namespace: osm-system subjectAltNames: - nginx.app-routing-system.cluster.local validityDuration: 24h certKeyBitSize: 2048 serviceCertValidityDuration: 24h Now, you can notice a new secret in osm-system. OSM issues and injects this certificate in osm-system namespace. Nginx is ready to use this certificate to initiate connection to OSM. Before we go further into this blog, let’s understand a few important concepts in OSM. Open service mesh data plane uses ‘Envoy’ proxy (https://www.envoyproxy.io/). This envoy proxy is programmed (in other words configured) by OSM control plane. After adding sample1 and sample2 namespace and deploying sample1, you could have noticed two containers running in that pod. One is our hello world app, other one is injected by OSM control plane with mutating webhook. It also injects init container which changes ip tables to redirect traffic. Now that Envoy is injected, it needs to be equipped with certificates for communicating with its mothership (OSM control plane) and for communicating with other meshed pods. In order to address this, OSM injects two certificates. One is called ‘proxy certificate’ for ‘Envoy’ to initiate connection to OSM control plane (refer B in diagram) and another one is called ‘service certificate’ for pod-to-pod traffic (for meshed pods – in other words pods in namespaces that are added to OSM). Service certificate uses the following for CN. <ServiceAccount>.<Namespace>.<trustdomain> This service certificate is shared for pods that are part of same service. Hence, the name service certificate. This certificate is used by ‘Envoy’ when initiating pod-to-pod traffic with mTLS. As an astute reader, you might have noticed some specifics in our Ingress annotations. It defines who the target is in proxy_ssl_name. Here our target service is default.sample1.cluster.local. default is ‘default service account’, sample1 is namespace. Remember, in OSM, it’s all based on identities. Get pod name, replace -change-here with pod name and run this following command to check this. osm proxy get config_dump aks-helloworld-change-here -n sample1 | jq -r '.configs[] | select(."@type"=="type.googleapis.com/envoy.admin.v3.SecretsConfigDump") | .dynamic_active_secrets[] | select(.name == "service-cert:sample1/default").secret.tls_certificate.certificate_chain.inline_bytes' | base64 -d | openssl x509 -noout -text You can see CN = default.sample1.cluster.local in the cert. We are also informing nginx to use secret from osm-system namespace called nginx-client-cert-for-talking-to-osm. Nginx is configured to proxy connect to default.sample1.cluster.local with TLS secret nginx-client-cert-for-talking-to-osm. If you inspect this TLS secret (use instructions below if needed), you can see “CN = nginx.app-routing-system.cluster.local” Extract cert info: use k get secret, use tls.crt data and base64 decode it, run openssl x509 -in file_that_contains_base64_decoded_tls.crt_data -noout -text At this time, we have wired up everything from client to Ingress controller listening for connections, and Nginx is set to talk to OSM. However, Envoy proxy (OSM data plane) is still not configured to accept TLS connection from Nginx. Any curl to mysite.srinman.com will result in error response. HTTP/1.1 502 Bad Gateway Please understand that we can route traffic all the way from client to ‘Envoy’ running alongside our application container but since traffic is forced to enter ‘Envoy’ with our init container setup, Envoy checks and blocks this traffic. With our configuration osm.enablePermissiveTrafficPolicy=true, Envoy is programmed by OSM to allow traffic within namespaces in the mesh but not from outside traffic to enter. In other words, all east-west traffic is allowed within the mesh and these communications automatically establish mTLS between services. Let’s configure OSM to accept this traffic. This configuration is addressed by IngressBackend. The following definition tells OSM to configure Envoy proxies used for backend service ‘aks-helloworld-svc’ to accept TLS connection from sources: defined. More information about ingressbackend. https://release-v1-2.docs.openservicemesh.io/docs/demos/ingress_k8s_nginx/#https-ingress-mtls-and-tls There are instructions in the link above for adding nginx namespace to osm. More specifically, the following command is not necessary since we’ve already configured Nginx with Ingress definition to use proxy ssl name and proxy ssl tls cert for connecting to application pod’s Envoy or OSM (#2 in the diagram. Picture shows connection from only one pod from Nginx but you can assume that this could be from any Nginx pod). OSM doesn’t need to monitor this namespace for our walk through. However, at the end of this blog, there is an additional information on how OSM is configured and how IngressBackend should be defined with managed OSM and Web app routing add-on. osm namespace add "$nginx_ingress_namespace" --mesh-name "$osm_mesh_name" --disable-sidecar-injection Earlier, we verified that Nginx uses with TLS cert with “CN = nginx.app-routing-system.cluster.local”. IngressBackend configures that source must be ‘AuthenticatedPrincipal’ with name nginx.app-routing-system.cluster.local. All others are rejected. Once this is defined, you should be able to see a successful connection to app! Basically, client connection is terminated at Ingress controller (nginx) and proxied/resent (#2 in the diagram) from nginx to application pods in namespace (sample1). Envoy proxy is intercepting this connection and sending it to the actual application which is still listening on plain port 80 but our web application routing along with open service mesh took care of accomplishing encryption-in-transit between ingress controller and application pod – essentially mitigating the need for application teams to manage and own this very critical security functionality. It’s important to remember that we were able to accomplish this mTLS with very few steps with all managed by AKS (well, provided you use add-ons for OSM and Web application routing). Once the traffic lands in service meshed data plane, Open service mesh provides lots of flexibility and configuration options to manage this traffic (east-west) within the cluster across OSM-ed namespaces. Let’s try to break this again to understand more! In our IngressBackend, let’s make a small change to the name of authenticated principal. Change it to something other than nginx. Sample below. - kind: AuthenticatedPrincipal name: nginxdummy.app-routing-system.cluster.local Apply this configuration. Attempt to connect to our service. * Trying 20.241.185.56:80… * TCP_NODELAY set * Connected to 20.241.185.56 (20.241.185.56) port 80 (#0) > GET / HTTP/1.1 > Host: mysite.srinman.com > User-Agent: curl/7.68.0 > Accept: */* > * Mark bundle as not supporting multiuse < HTTP/1.1 403 Forbidden < Date: Fri, 25 Nov 2022 13:09:16 GMT < Content-Type: text/plain < Content-Length: 19 < Connection: keep-alive < * Connection #0 to host 20.241.185.56 left intact RBAC: access denied This means that we’ve defined OSM to accept connections from identity nginxdummy from app-routing-system namespace but that’s not the case in our example. Envoy basically stops connection in the same application pod before it reaches the application container itself. Let’s try to make it work by not reverting the change but by changing a different config in IngressBackend skipClientCertValidation: true It should work fine now since we are configuring OSM to ignore client certification validation/verification. From a security viewpoint, if you think about this, you could send traffic from a different app or ingress controller to this application pod – basically unprotected. Let’s change this back to false and also fix nginx service name. Apply the config and check if you can access the service. Thus far, we’ve deployed an application in one namespace and configured ingress controller to send traffic into our mesh. What would the process for another app in a different namespace using our managed ingress controller? Let’s create another workload and understand how to define ingress and to understand the importance of service account. Sample code in deploy-sample2.yaml In this deployment, you can see that we are using serviceAccountName: sample2-sa not the default service account. (Namespace, Service account creation is not shown and its implicit that you understand!) You can see how ingress definition is slightly different from the one above (for sample1). proxy_ssl_name is set to sample2-sa in sample2 namespace. However, it uses the same TLS secret that sample1 used, which is TLS with “CN = nginx.app-routing-system.cluster.local” Ingressbackend definition looks like this below. You can see that it’s the same ‘sources’ definition with different backends. We have established TLS between Nginx and application pod (#2 in diagram). However, from client to ingress is still plan HTTP (#1 in diagram). Enabling TLS for this is straightforward and there are few ways to do this including Azure managed way, but we will explore build our own approach. Let’s create a certificate with CN=mysite.srinman.com. openssl req -new -x509 -nodes -out aks-ingress-tls.crt -keyout aks-ingress-tls.key -subj "/CN=mysite.srinman.com" -addext "subjectAltName=DNS:mysite.srinman.com" Use this command below to upload this cert to K8S secret in sample1 namespace. k create secret tls mysite-tls --key aks-ingress-tls.key --cert aks-ingress-tls.crt -n sample1 sample code in ingress-sample1-withtls.yaml This should enforce all calls to https from the client. Traffic flow Traffic enters ingress managed LB TLS traffic terminated at Ingress controller pods Ingress controller pods initiates proxy connection to backend service (specifically one of the pod that is part of that service, and even more specifically to pod’s envoy proxy container. Also remember injected init-container takes care of setting up ip tables to route requests to Envoy) App pod - Envoy container terminates TLS traffic and initiates connection to localhost on app port (remember app container shares same pod, thus same network namespace) App pod - app container listening on port, responds to the request. As traffic enters the cluster, as seen above and in the diagram, it can be inspected in 3 different logs at least. Nginx, Envoy and App itself. Check traffic in nginx logs Check traffic in envoy logs Check traffic in app logs Nginx log (you might want to check both the pods if you are not able to locate the call in one. There should be two) nn.nn.nnn.nnn - - [20/Nov/2022:17:29:11 +0000] "GET / HTTP/2.0" 502 150 "-" "curl/7.68.0" 33 0.006 [sample1-aks-helloworld-svc-80] [] 10.244.1.13:80, 10.244.1.13:80, 10.244.1.13:80 0, 0, 0 0.000, 0.000, 0.004 502, 502, 502 3f9a310a3ebb314342b590dde11 Envoy log Just to keep it simple, reduce replicas to 1 - to probe envoy side car. Replace podname with your pod in the command below. k logs aks-helloworld-65ddbc869b-t8hwq -c envoy -n sample1 copy the output to jsonformatter You can see the traffic flowing through the proxy into application pod. App log Let’s look at app container itself. k logs aks-helloworld-65ddbc869b-bt9w8 -c aks-helloworld -n sample1 [pid: 13|app: 0|req: 1/1] 127.0.0.1 () {48 vars in 616 bytes} [Sun Nov 20 16:53:54 2022] GET / => generated 629 bytes in 12 msecs (HTTP/1.1 200) 2 headers in 80 bytes (1 switches on core 0) 127.0.0.1 - - [20/Nov/2022:16:53:54 +0000] "GET / HTTP/1.1" 200 629 "-" "curl/7.68.0" "nn.nn.nnn.nnn" You could notice that request is coming from local host. This is because envoy container sends the traffic from the same host (actually pod – remember pod is same as host in Kubernetes world! “A Pod models an application-specific "logical host" – reference link ). Lastly, when you opt-in for OSM add-on along with Web application routing add-on, certain things are already taken care of; for example, TLS secret osm-ingress-client-cert is generated and written to kube-system namespace. It also automatically adds app-routing-system namespace to OSM with sidecar-injection disabled mode. This means that in the IngressBackend definition kind: Service can be added for verifying source IPs in addition to identity (AuthenticatedPrincipal) for allowing traffic. This of course adds more protection. Check this file ingressbackend-for-osm-and-webapprouting.yaml in repo. I hope that these manual steps helped to provide a bit more insight into the role of Web application routing and how it works nicely with Open Service Mesh. We also reviewed a few foundational components of Web application routing such as Nginx, IngressBackend, Envoy and OSM. Please check srinman/webapproutingwithosm (github.com) for sample code.
srinman
Dec 10, 2022 Place Azure Infrastructure Blog
7.8KViews
0likes
0Comments
Latest Azure VMware Solution Resources
Welcome to the resource hub for learning and staying up to date on Azure VMware Solution.
henryyan
Mar 02, 2021 Place Azure Infrastructure Blog
3.5KViews
1like
1Comment