This post has been republished via RSS; it originally appeared at: Core Infrastructure and Security Blog articles.
Some things today you just take for granted. We can download an entire movie in the time it took to download a low-resolution .jpg file back in the dial-up days. I guess I’m feeling nostalgic since I just found and AOL 3.5-inch floppy while cleaning out the basement over the weekend. Yes, you read that correctly, a floppy....not a CD…a 3.5-inch floppy. Back then, 1.44 Mb and a good phone line were the only thing standing between you and the awesomeness of the 14.4 kb internet…of course you had to wait for the squawks and screeches of the modem handshakes which were audible back then, almost like they were proud of it (we even had volume control for it).
It’s funny to think about that era and fast forward to today, where DHCP assignments and 3-way handshakes happen in milliseconds and we don’t see or hear any of it…unless something goes wrong. The Microsoft team recently put a DHCP issue to bed that brought to mind how blazing fast the world is today and how little we appreciate the minutia going on under the covers of that speed.
The Problem
Our story starts with a report that client machines are not receiving DHCP assignments from the DHCP server on a very wide scale. This is also where DORA comes into our adventure…minus the backpack. Any sort of troubleshooting DHCP requires an understanding of the DORA negotiation: Discover -> Offer -> Request -> Acknowledge. Prior to calling us in to assist, the customer had taken network captures and noticed that the “A” in DORA (the ACK or Acknowledgement) was not arriving quickly enough to prevent the DHCP negotiation from timing out.
The Troubleshooting
First rule of troubleshooting. Trust but verify. Until this point, no one was capturing traffic from the DHCP server, so we needed to see if the traffic was leaving the server. A quick analysis of the network capture from the DHCP server confirmed two things: DHCP ACKS were being sent, and that DHCP ACKS were being sent on an extremely delayed cycle. The network administrators verified that the network for this site had no strange load-balancing/split route or routing loop problems, so we turned the focus to the performance of the DHCP Server.
We checked all the usual suspects for performance and no issues were present to lead us to suspect that CPU/Memory/Disk/Network were bottlenecked. Performance counters are available for DHCP Server, so we took a quick look there. Below are the counters for our DHCP Server, the first thing that jumped out was that the Acks/Sec was abnormally low, just sporadically jumping above zero. Also, note the Active Queue Length, that is not normal. Finally, the counter for Milliseconds per packet (Avg) was very high. So, now we are starting to see a queue form on the server, but the real question is why?
Next, we did some testing to move about half of the DHCP scopes to another server to see if it was something with that server. Half of the scopes were moved to a partner server in site via an ad-hoc failover relationship, with the failover removed leaving the scopes and their configuration on the partner server. We checked Perfmon and we see the same two counters running at elevated levels. So, the issue followed the scopes. Scopes were moved back to the original server in blocks until the counters finally returned to normal to isolate the groups that had the offending configuration.
With perfmon running as the bad scope goes into place, you can see the Acks/Sec counter reflect the timing delay we are seeing in graph view:
The next step was to look at the scopes and the configured scope options to see of anything looked out of place. This environment has a well-defined policy for configuration, so anything out of the ordinary tends to stand out. Using some PowerShell Fu, the scope options for a user workstation scope were found to have a TFTP server configured and the thing that caught our eye was that the name of the server was specified, not the IP address. Below is the command and output for the search:
Get-DhcpServerV4Scope | %{Write-Host “`r`n$($_.name)” -ForegroundColor Red; Get-DHCPServerV4OptionValue –ScopeId $_.ScopeId -OptionId 66 –ErrorAction SilentlyContinue}
The text in the “Value” column is what you would be looking for. The screenshot below is just an example and the “Value” displays the actual string values that were entered into Option 66 for the TFTP server.
In our case, we found one single-label name, which was out of the ordinary for their normal configuration. A quick check of the server name revealed that it could neither be resolved via DNS nor contacted on the network. The defunct option was removed from the server and the DHCP service restarted on the server and, what do you know…the server’s Perfmon counts for Active Queue Length and Milliseconds Per Packet (Avg) returned to normal. Loosely translated, DORA is happy again.
The Takeaway
Keep it simple…when configuring this sort of thing, it is always a good idea to use IP addresses instead of building in a reliance on name resolution, especially when you factor in how early in the network configuration this process resides. In our case we also had a nonexistent server problem but assuming it was still available, using the IP address takes one link out of the complexity chain in getting to that TFTP server. RFC5859 for TFTP Server Address even specifies using the IP address to eliminate this complexity. Follow their advice and keep things as simple as possible.
The Outcome
In the IT field, you rarely hear the accolades when things return to normal, the silence of the content has to be music to your ears to know that you have fixed things. In this case the lightning fast exchange of DORA, 3-way handshakes, TLS negotiations, etc. are back in the background where they belong, with the users focusing on their work duties/web browsing/social media, whatever normal users do in your case.
Take care and stay safe!