Since early November of 2015, I’ve been contending with the strangest behavior from my Cisco infrastructure. Only a few number of Cisco IP Phones appear to go out around 7:30 am and then pop back to life right at 8 am. The phones fail and fix all by themselves, a very strange turn of events.
My first thought was that it was power related, so I started to debug and play around with POE settings. That wasn’t the answer. I then moved some phones from the common switch to a different switch nearby and the problem went away for the moved phones. That right there proved to me that it wasn’t the phones, and it wasn’t the Cisco CallManager. It had something to do with the switch itself. So I purified the switch, moved everything that wasn’t a Cisco IP Phone off it and the problem continued.
I eventually got in touch with Cisco support, and they suggested a two-prong effort, set SPAN up on the switch and run a packet capture there, and set SPAN up on a phone and run a packet capture there as well. The capture on the switch showed a switch in distress, many ugly lines where TCP was struggling. The phone capture was immense, and I got it opened up in Wireshark and went to the start of the days phone failure event. The minute I got to the start, 7:33:00.3 am the first line appeared. It was an ICMPv6 “Multicast Listener Report” packet. One of many that filled the rest of the packet capture. Millions of packets, all the same.
The multicast packets could explain why I saw the same traffic curves on every active interface. When a switch encounters a multicast packet, every active port responds as if the packet was sent to that port. As it turns out, once I extracted the addresses of where all these offensive packets were coming from, sorted the list, and dropped the duplicates I ended up with a list of four computers. I poked around a little bit more on Google and discovered to my chagrin that there was a specific Intel Network Interface Controller, the I217-LM, which was uniquely centered in this particular network flood scenario. I looked at the affected machines, and all of them were the same, HP ProDesk 600 G1 DM’s. These tiny computers replaced a good portion of our oldest machines when I first started at Stafford-Smith and I never even gave them a second thought. Each of these systems had this very Intel NIC in them, with a driver from 2013. The fix is listed as updating the driver, and the problem goes away. So that’s exactly what I did on the workstations that were the source of the multicast packet storm.
I can’t believe that anyone would design a NIC like this, where there is a possibility of a multicast flood, which is the worst kind of flood I think. All it takes is a few computers to start flooding and it sets off a cascade reaction that drags the switch to the ground.
We will see in the days to come if this solved the issue or not. This has all the hallmarks of what was going on and has filled me with a nearly certain hope that I’ve finally overcome this three-month headache.