Forum Discussion

wellyhartanto's avatar
wellyhartanto
Copper Contributor
Mar 03, 2025

DHCP Failover Issue – Standby Server Responding When It Should Not

Hi everyone,

I'm encountering an issue with my DHCP failover setup in Hot Standby mode, and I need insights into why the standby server is providing DHCP leases when it shouldn’t.

Setup Overview:

I manage a network with over 100 sites worldwide, each having a local DHCP server.

  • Each site has a dedicated DHCP server running on the server VLAN.
  • Clients reside on different VLANs, and IP helpers (DHCP relay) are configured on a Checkpoint firewall at each site.
  • The IP helper forwards DHCP requests to:
    • The local DHCP server (primary) in the site's server VLAN.
    • The standby DHCP server (failover), located at an on-premises data center (DC).
  • DHCP servers are configured in Hot Standby mode using Microsoft DHCP Failover.
Issue:

Despite the Hot Standby configuration, I noticed that my Cisco Meraki dashboard frequently reports a new DHCP server detected, referring to the standby DHCP server, even though the primary DHCP server at the local site is available.

Cisco Meraki triggers this alert when it detects DHCPACK packets from the standby DHCP server traversing the local networks. However, in Hot Standby mode, the failover server should only issue leases if the primary server is unreachable.

Example:

Site-1's primary DHCP server (DHCP-1) has a failover partnership with Failover-1 at the DC.

Site-1's connectivity to the DC is stable, yet Cisco Meraki occasionally detects DHCPACK

packets from Failover-1, triggering alerts.

Troubleshooting Done So Far:
  • Verified that failover mode is correctly set to Hot Standby (not Load Balance).
  • Confirmed that the primary DHCP server is healthy and responding.
  • Checked DHCP logs on both servers but found no clear failover events.
  • Performed packet captures of DHCP traffic, but the results were inconclusive.
  • Investigated whether Checkpoint firewall’s IP helper can prioritize the primary DHCP server, but it appears not to support this functionality.
  • Created a PowerShell script to check for failover-related event logs (Event IDs: 20254 and 20255). This provided better visibility but did not correlate with the Meraki alerts.
Questions:
  1. Are there any known scenarios where a standby DHCP server in Hot Standby mode might mistakenly issue leases, even when the primary is active?
  2. Is there any detailed information on the failover “heartbeat” mechanism between primary and standby servers? I found that it uses TCP port 647, but I couldn’t locate official documentation on the interval and failure conditions.
  3. Could failover state synchronization delays cause this behavior?
  4. Are there specific logs or PowerShell commands I should check to confirm why the standby server is responding?
  5. Is there a way to prevent the standby server from responding unless the primary is truly unreachable (e.g., registry settings, advanced configuration)?

Any guidance or troubleshooting steps would be greatly appreciated!

Thanks in advance.

  • LainRobertson's avatar
    LainRobertson
    Silver Contributor

    Hi wellyhartanto,

     

    The only reason the standby host would be issuing new leases is because the failover state has changed, which only happens when connectivity over TCP 647 is broken - no matter how briefly.

     

    Superficially, it sounds like you're dealing with the first state transition of "communication interrupted", which in turn is not lasting long enough to progress to the "partner down" state.

     

    While you would expect to see events 20254 and/or 20255, the key event to look for is 20252.

     

    If you're running packet traces, look for any reason that traffic from/to the standby host over TCP 647 would fail. Given you have a firewalls on both sides, this might manifest as resets (RST) being sent to either or both hosts, or, as you already mentioned, it could just be it's taking too long to receive a TCP reply (which encompasses planned activities such as host reboots, firewall reboots, etc.).

     

    Timings aren't explicitly enumerated in the Microsoft documentation but it doesn't really matter since this only works where connectivity is entirely trustworthy. If it's not, then what you're seeing is an absolutely expected outcome.

     

    Relevant documentation:

     

    Cheers,

    Lain

Resources