Today I had a chance to get to work earlier than usual and connected up all my equipment and logged into the switch that was causing all the grief with my phones. Everything looked good, and all the phones were up and behaving fine. I logged into the switch, and I had a thought, Cisco devices have impressive debug features. If power is an issue, why not debug power?
So I turned on the debug traps and adjusted the sensitivity for the log-shipper and turned on the inline power debug for both the events manager and the controller. My Syslog system started to flood with new debug logs from this switch. The phones continued to behave themselves, so I sat down and looked at the log output. There were notable sections where the switch was complaining about an IEEE short on a port. OMFG. AC power being sent down twisted-pair Ethernet, and we’ve got a short condition!? Why did Cisco never even look for short conditions? Upon further investigation, anything that was connected to a server, computer, or printer were all randomly shorting out. These shorts were causing the POE police system to scream debugs to the Syslog system. So I found all the ports that did not have Cisco IP Phones on them and were also not uplinks to the backbone switch and turned off their POE.
Now that all the POE is off for devices that would never need it, the debug list has gone silent for shorts. It is still sending out debugs, but mostly that is the POE system regularly talking back and forth to the Cisco IP Phones, and that output looks tame enough to ignore. I updated my Cisco TAC case, and now we will wait and see if the phones fail in the mornings. At least, there can’t be any more POE shorts in the system!