failover
13 TopicsDHCP Failover Issue – Standby Server Responding When It Should Not
Hi everyone, I'm encountering an issue with my DHCP failover setup in Hot Standby mode, and I need insights into why the standby server is providing DHCP leases when it shouldn’t. Setup Overview: I manage a network with over 100 sites worldwide, each having a local DHCP server. Each site has a dedicated DHCP server running on the server VLAN. Clients reside on different VLANs, and IP helpers (DHCP relay) are configured on a Checkpoint firewall at each site. The IP helper forwards DHCP requests to: The local DHCP server (primary) in the site's server VLAN. The standby DHCP server (failover), located at an on-premises data center (DC). DHCP servers are configured in Hot Standby mode using Microsoft DHCP Failover. Issue: Despite the Hot Standby configuration, I noticed that my Cisco Meraki dashboard frequently reports a new DHCP server detected, referring to the standby DHCP server, even though the primary DHCP server at the local site is available. Cisco Meraki triggers this alert when it detects DHCPACK packets from the standby DHCP server traversing the local networks. However, in Hot Standby mode, the failover server should only issue leases if the primary server is unreachable. Example: Site-1's primary DHCP server (DHCP-1) has a failover partnership with Failover-1 at the DC. Site-1's connectivity to the DC is stable, yet Cisco Meraki occasionally detects DHCPACK packets from Failover-1, triggering alerts. Troubleshooting Done So Far: Verified that failover mode is correctly set to Hot Standby (not Load Balance). Confirmed that the primary DHCP server is healthy and responding. Checked DHCP logs on both servers but found no clear failover events. Performed packet captures of DHCP traffic, but the results were inconclusive. Investigated whether Checkpoint firewall’s IP helper can prioritize the primary DHCP server, but it appears not to support this functionality. Created a PowerShell script to check for failover-related event logs (Event IDs: 20254 and 20255). This provided better visibility but did not correlate with the Meraki alerts. Questions: Are there any known scenarios where a standby DHCP server in Hot Standby mode might mistakenly issue leases, even when the primary is active? Is there any detailed information on the failover “heartbeat” mechanism between primary and standby servers? I found that it uses TCP port 647, but I couldn’t locate official documentation on the interval and failure conditions. Could failover state synchronization delays cause this behavior? Are there specific logs or PowerShell commands I should check to confirm why the standby server is responding? Is there a way to prevent the standby server from responding unless the primary is truly unreachable (e.g., registry settings, advanced configuration)? Any guidance or troubleshooting steps would be greatly appreciated! Thanks in advance.68Views0likes1CommentImplementing Disaster Recovery for Azure App Service Web Applications
Starting March 31, 2025, Microsoft will no longer automatically place Azure App Service web applications in disaster recovery mode in the event of a regional disaster. This change emphasizes the importance of implementing robust disaster recovery (DR) strategies to ensure the continuity and resilience of your web applications. Here’s what you need to know and how you can prepare. Understanding the Change Azure App Service has been a reliable platform for hosting web applications, REST APIs, and mobile backends, offering features like load balancing, autoscaling, and automated management. However, beginning March 31, 2025, in the event of a regional disaster, Azure will not automatically place your web applications in disaster recovery mode. This means that you, as a developer or IT professional, need to proactively implement disaster recovery techniques to safeguard your applications and data. Why This Matters Disasters, whether natural or technical, can strike without warning, potentially causing significant downtime and data loss. By taking control of your disaster recovery strategy, you can minimize the impact of such events on your business operations. Implementing a robust DR plan ensures that your applications remain available and your data remains intact, even in the face of regional outages. Common Disaster Recovery Techniques To prepare for this change, consider the following commonly used disaster recovery techniques: Multi-Region Deployment: Deploy your web applications across multiple Azure regions. This approach ensures that if one region goes down, your application can continue to run in another region. You can use Azure Traffic Manager or Azure Front Door to route traffic to the healthy region. Multi-region load balancing with Traffic Manager and Application Gateway Highly available multi-region web app Regular Backups: Implement regular backups of your application data and configurations. Azure App Service provides built-in backup and restore capabilities that you can schedule to run automatically. Back up an app in App Service How to automatically backup App Service & Function App configurations Active-Active or Active-Passive Configuration: Set up your applications in an active-active or active-passive configuration. In an active-active setup, both regions handle traffic simultaneously, providing high availability. In an active-passive setup, the secondary region remains on standby and takes over only if the primary region fails. About active-active VPN gateways Design highly available gateway connectivity Automated Failover: Use automated failover mechanisms to switch traffic to a secondary region seamlessly. This can be achieved using Azure Site Recovery or custom scripts that detect failures and initiate failover processes. Add Azure Automation runbooks to Site Recovery recovery plans Create and customize recovery plans in Azure Site Recovery Monitoring and Alerts: Implement comprehensive monitoring and alerting to detect issues early and respond promptly. Azure Monitor and Application Insights can help you track the health and performance of your applications. Overview of Azure Monitor alerts Application Insights OpenTelemetry overview Steps to Implement a Disaster Recovery Plan Assess Your Current Setup: Identify all the resources your application depends on, including databases, storage accounts, and networking components. Choose a DR Strategy: Based on your business requirements, choose a suitable disaster recovery strategy (e.g., multi-region deployment, active-active configuration). Configure Backups: Set up regular backups for your application data and configurations. Test Your DR Plan: Regularly test your disaster recovery plan to ensure it works as expected. Simulate failover scenarios to validate that your applications can recover quickly. Document and Train: Document your disaster recovery procedures and train your team to execute them effectively. Conclusion While the upcoming change in Azure App Service’s disaster recovery policy may seem daunting, it also presents an opportunity to enhance the resilience of your web applications. By implementing robust disaster recovery techniques, you can ensure that your applications remain available and your data remains secure, no matter what challenges come your way. Start planning today to stay ahead of the curve and keep your applications running smoothly. Recover from region-wide failure - Azure App Service Reliability in Azure App Service Multi-Region App Service App Approaches for Disaster Recovery Feel free to share your thoughts or ask questions in the comments below. Let's build a resilient future together! 🚀SQL Server Distributed AG's Forwarder Is Not Syncing After Primary AG's Internal Failover
I have set up a SQL Server Distributed Availability Group (DAG) in Kubernetes using SQL Server on Ubuntu images. The setup consists of two availability groups (AGs) across two separate clusters: Setup Details: Primary Cluster (AG1) Pods: ag1-0 (Primary), ag1-1, ag1-2. The Primary is Exposed via the LoadBalancer service. Remote Cluster (AG2): Pods: ag2-0 (The Primary of AG2, Acting as a forwarder of DAG), ag2-1, ag2-2. The Forwarder (ag2-0) is Exposed via the LoadBalancer service. Distributed AG Configuration: AG1 and AG2 are part of the DAG. Each AG’s primary is dynamically selected using the pod's label role=primary. LISTENER_URL in the DAG configuration points to the LoadBalancer service of each AG. Issue: DAG Not Syncing After AG1 Failover For testing, I triggered a failover in AG1 using: `ALTER AVAILABILITY GROUP [AG1] FORCE_FAILOVER_ALLOW_DATA_LOSS;` The global primary changed from ag1-0 to ag1-1, and I updated the role=primary label accordingly (removed from ag1-0, added to ag1-1. However, AG2 (the forwarder and its replicas) stopped syncing and became unhealthy. From ag2-0 (forwarder) logs, I only see connection timeouts and disconnections from the global primary. AG2 is not automatically reconnecting to the new primary (ag1-1), even though the LoadBalancer service in LISTENER_URL now points to ag1-1. Logs from ag2-0 (Forwarder) Shows Like A connection timeout has occurred while attempting to establish a connection to GLOBAL PRIMARY. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance Steps I Tried: - Checked DAG Configuration – The LISTENER_URL is correctly set to the LoadBalancer of AG1, which now points to ag1-1. - Ran the Resume Command: `ALTER DATABASE [agtestdb] SET HADR RESUME;` This did not resolve the issue. - Verified Network Connectivity Questions: - What steps are required to ensure AG2 correctly syncs with the new global primary (ag1-1) after an AG1's internal failover? - Is there a specific command that needs to be run on the forwarder (ag2-0) or the new global primary (ag1-1) to reestablish synchronization? - Why isn’t AG2 automatically reconnecting, even though the LoadBalancer service points to the correct primary? - Are there any best practices for handling SQL Server DAG failovers in Kubernetes? Any insights would be greatly appreciated!98Views0likes2CommentsASR Failover network architecture
I'm new to Azure and I have requirement to set up disaster recovery for an on-prem server. I am aware of the process in replicating the server to the cloud. However, I am not able to grasp how networking should be in a disaster situation. Server is in 172.x.x.x network and I know that s2s VPN should be set up between the Azure network and the on-prem network And Azure network and on-prem can't be on the same subnet for s2s to work. So when I failover to cloud, how would the cloud server talk to the on-prem network? And devices in on-prem talk to the server in the cloud?2.7KViews0likes4CommentsDirect Routing failover behaviour and lack of recovery from MS Teams side
Hi, We have Direct Routing working fine towards two separate SBC's (each with own wildcard cert) and had calls routing from MS Teams to both of them randomly for the first test customer tenant. All good. sbc1.domain.nz sbc2.domain.nz We then wanted to understand the failover performance and recovery handling if we were to lose a regional SBC and so we performed the following test. - Perform calls from MS Teams to PSTN and calls routing towards both SBC1 and SBC2 randomly as both SBC's are within customer tenant voice route as defined below. New-CsOnlineVoiceRoute -identity "unrestricted.voiceroute" -NumberPattern ".*" -OnlinePstnGatewayList cust-tenant1.sbc1.domain.nz, cust-tenant1.sbc2.domain.nz -Priority 1 -OnlinePstnUsages "NZ.PU" - We then shutdown the SBC public interface for sbc1.domain.nz and calls were routed from MS Teams to sbc2.domain.nz interface only, as expected - We then shutdown the SBC public interface for sbc2.domain.nz and calls continued to try and route towards sbc2.domain.nz, even though OPTIONS had been failing for a while towards this interface from MS Teams. - We then activated the SBC public interface for sbc1.domain.nz. OPTIONs were sent and ACK from sbc1 to MS Teams OK. However MS Teams never started to send OPTION's to this sbc1. - We waited for 30 mins and there was no change. - We then activated the SBC public interface for sbc2.domain.nz. OPTION's were quickly established in both directions and calls started working from MS Teams towards this sbc2 quite quickly. - It has now been 4 days and I am still unable to get MS Teams to send OPTION's messages to sbc1. The MS teams admin portal shows the "SIP Options Status" for sbc1 as "Warning" "There is a problem with the SIP OPTIONS. The Session Border Controller exists in our database (your administrator created it using the command New-CSOnlinePSTNGateway). But we have difficulties determining SIP Options status. Please check in 15 minutes." I have tried the following to recover this situation. 1. Deleted the sbc1 from msteams admin portal for 24 hours and recreated. No difference. 2. Change the SIP Port used for sbc1. No difference. My observations are as follows :- 1. MS Teams appears to only try and recover SIP OPTIONS for the "last" SBC that was working. 2. Other SBC's that failed prior to last working SBC are not recovered from MS Teams side. 3. Don't perform this kind of controlled failure when everything is working fine .. you will regret it 🙂 If anyone has any advice on how we can get sbc1 working normally again (i.e. get MS Teams to sned OPTIONS towards it) .. your help would be appreciated. Thanks DavidSolved6.7KViews0likes3Comments