Today was a tour-de-force in unintended consequences. It started with an old coworker, as a kind of boomerang. They came to work for us, then they moved on, only to come back. That was the premise of this story, the start of it, a coworker boomerang.
The task was really straightforward. De-compress the previously compressed user files related to this particular coworker, so that when they login, they see exactly what they left behind. It was modest, about 36GB worth of data. Looking at everything, the intended target had 365GB of open space, so plenty of room for this. I started with 7-Zip on Windows, opened the archive and extracted it to the drive with all the space. Near the end of the transaction, 7-Zip threw an error, “Out of Disk Space.” and I frowned and scratched my head. 365GB open space, and… this? Turns out, 7-Zip on Windows, at least this copy of it, unpacks the archive to a temporary folder on the temporary resource that Windows assigns, by default this ends up on C: drive. The process was filling an already low-on-capacity primary OS drive. I chased the temporary folder and removed it, correcting the issue. Or so I had thought.
An hour later, out of the apparent blue, around 12:30pm today, all the VOIP desk phones suddenly went “NO SERVICE”. I scrambled, naturally, feeling that rising panic as nothing had changed, there were no alarms, just suddenly total phone failure. I called the VOIP support line, and the official line from support was to reboot my network. A stack of eight fully packed Cisco Catalyst switches, three servers, and a gaggle of networking gear designed to offer at least a dozen vital services – reboot all of that. While talking with support, I opened up a console to my Linux box running on Hyper-V on one of my servers, which is to say, plugged into the very network core itself that I was asked to reboot. I then found my out-of-service desk phone, it’s IP was fine, it was totally functional, I grabbed the SIP password, logged into the phone, went to where it lists the VOIP endpoint for our phone carrier, and then asked mtr to show me the packet flow across the network, from my humble little wooden box of an office to the VOIP endpoint. The utility was clear, it was fine. No issues. 500 and counting packets all arriving promptly, no flaws, no errors, and NO SERVICE.
So I was growing more vexed with support, really unwilling to reboot the entirety of my network core when mtr was just merrily popping packets directly to the correct VOIP endpoint deep inside the carriers network. My traffic could get to where it had to go, the phones were NO SERVICE still. Support was flat-footed. I stopped myself, because I could feel the rage build, my old companion, the anger that comes when people aren’t listening to what I am trying to tell them. I stopped. It was not going anywhere and I promised myself that I would fight this anger, tooth and claw to the best of my ability. So I simply calmly asked for the ticket number on their side, and thanked them for their time and hung up my cell phone. I obviously muttered some choice phrases in a small voice, but otherwise I was very proud of myself. I derailed what could have become a very ugly scene.
Everything works. I am not going to reboot the core. The phones simply say NO SERVICE. Then other reports rolled in, network faults, adjacent but not the same, Wifi failures in Houston Texas, hmmm. What does Wifi out in Houston have to do with dud phones in Kalamazoo?
I had this sinking feeling, my gut screamed at me, something about the PDC, Wifi, and the Phones were all touching something common that had failed, but had failed silently. I chuckled to myself, the old IT chestnut occurred to me, “It’s always DNS.” and so, in respect to that, I opened the Hyper-V management window on the PDC and looked for my twin OpenDNS Resolvers, they are VM’s that run quietly, flawlessly, for years on years without a peep deep within Hyper-V. There it was, right there, right in front of me. The two resolver VM’s and just to the right of their names, the quaint little status indicator from Hyper-V. “PAUSED.”
The moment I saw that, I yelled out “PAUSED” and “NO SERVICE” and screamed. Right click on both VM’s, click Resume, and Hyper-V gleefully, in a heartbeat, resumed both little VM’s and just like that, another reboot to the VOIP phone and bleep-bloop-blunk, the phone was functional and just fine.
It is always DNS. I have three resolvers, the two resolvers were on the same host and the host had a wee panic and Hyper-V silently just paused everything, and then after a short while of cooking, the phones and Wifi, which also uses those resolvers, all went kaput all in one happy bunch.
Obviously the answer is to round-robin the resolvers, the primary on the PDC, then one resolver running in VMWare nearby, and then the secondary on the PDC. A sandwich right down the middle. I both thanked my past self and kicked my past self, for having the wits to set up a third resolver, which was then for a short while, the only resolver there was, except for choice parts of my network.
So, it ended happily, alls well that ends well. The next step is to spread this round-robin resolver correction throughout my network, to help avoid this from ever happening again. But then I laughed as I considered the gamut of what had transpired. 7-Zip, well meaning and purely accidentally caused an unintended disk space alert, Hyper-V silently and studiously paused its charges, and the network kind of rolled on over the speed-bumps, and at the end, proved again, “It’s always DNS.”