Cisco Phone Outage Solved!

Since early November of 2015, I’ve been contending with the strangest behavior from my Cisco infrastructure. Only a few number of Cisco IP Phones appear to go out around 7:30 am and then pop back to life right at 8 am. The phones fail and fix all by themselves, a very strange turn of events.

My first thought was that it was power related, so I started to debug and play around with POE settings. That wasn’t the answer. I then moved some phones from the common switch to a different switch nearby and the problem went away for the moved phones. That right there proved to me that it wasn’t the phones, and it wasn’t the Cisco CallManager. It had something to do with the switch itself. So I purified the switch, moved everything that wasn’t a Cisco IP Phone off it and the problem continued.

I eventually got in touch with Cisco support, and they suggested a two-prong effort, set SPAN up on the switch and run a packet capture there, and set SPAN up on a phone and run a packet capture there as well. The capture on the switch showed a switch in distress, many ugly lines where TCP was struggling. The phone capture was immense, and I got it opened up in Wireshark and went to the start of the days phone failure event. The minute I got to the start, 7:33:00.3 am the first line appeared. It was an ICMPv6 “Multicast Listener Report” packet. One of many that filled the rest of the packet capture. Millions of packets, all the same.

The multicast packets could explain why I saw the same traffic curves on every active interface. When a switch encounters a multicast packet, every active port responds as if the packet was sent to that port. As it turns out, once I extracted the addresses of where all these offensive packets were coming from, sorted the list, and dropped the duplicates I ended up with a list of four computers. I poked around a little bit more on Google and discovered to my chagrin that there was a specific Intel Network Interface Controller, the I217-LM, which was uniquely centered in this particular network flood scenario. I looked at the affected machines, and all of them were the same, HP ProDesk 600 G1 DM’s. These tiny computers replaced a good portion of our oldest machines when I first started at Stafford-Smith and I never even gave them a second thought. Each of these systems had this very Intel NIC in them, with a driver from 2013. The fix is listed as updating the driver, and the problem goes away. So that’s exactly what I did on the workstations that were the source of the multicast packet storm.

I can’t believe that anyone would design a NIC like this, where there is a possibility of a multicast flood, which is the worst kind of flood I think. All it takes is a few computers to start flooding and it sets off a cascade reaction that drags the switch to the ground.

We will see in the days to come if this solved the issue or not. This has all the hallmarks of what was going on and has filled me with a nearly certain hope that I’ve finally overcome this three-month headache.

Sparks and Shorts

Today I had a chance to get to work earlier than usual and connected up all my equipment and logged into the switch that was causing all the grief with my phones. Everything looked good, and all the phones were up and behaving fine. I logged into the switch, and I had a thought, Cisco devices have impressive debug features. If power is an issue, why not debug power?

So I turned on the debug traps and adjusted the sensitivity for the log-shipper and turned on the inline power debug for both the events manager and the controller. My Syslog system started to flood with new debug logs from this switch. The phones continued to behave themselves, so I sat down and looked at the log output. There were notable sections where the switch was complaining about an IEEE short on a port. OMFG. AC power being sent down twisted-pair Ethernet, and we’ve got a short condition!? Why did Cisco never even look for short conditions? Upon further investigation, anything that was connected to a server, computer, or printer were all randomly shorting out. These shorts were causing the POE police system to scream debugs to the Syslog system. So I found all the ports that did not have Cisco IP Phones on them and were also not uplinks to the backbone switch and turned off their POE.

Now that all the POE is off for devices that would never need it, the debug list has gone silent for shorts. It is still sending out debugs, but mostly that is the POE system regularly talking back and forth to the Cisco IP Phones, and that output looks tame enough to ignore. I updated my Cisco TAC case, and now we will wait and see if the phones fail in the mornings. At least, there can’t be any more POE shorts in the system!

Incommunicado

Here at work, I’ve got peculiar sort of failure with my Cisco phones. In the mornings, sometimes, all the phones connected to two Cisco Catalyst 3560-X 48-port POE switches all fail around 7:35 am and then all un-fail, all by themselves around 7:50 am.

I’ve tried to engage with Cisco TAC over this issue. I started a ticket when we first started noticing it, in November 2015. Yesterday I got in touch with the first Cisco TAC Engineer and was told that it was a CallManager fault, not a switching fault and that the ticket on the switches would close.

So I opened a new ticket for CallManager. Once I was able to prove my Cisco entitlements I was underway with a new Cisco TAC Engineer. So we shall see how this goes. What concerns me is that Cisco told me that obviously it wasn’t the Catalyst switches. I am a little at odds with this determination because as part of the early diagnosis of this problem we had a small group of users who couldn’t endure phone failures at all, so I moved their connections from the Catalyst 3560 to the other switch, a Catalyst 3850. For the phones that failed on the 3560, they stopped failing when attached to the other switch. That shows me that the issue isn’t with the phones, or the CallManager, but rather the switches themselves. But now that TAC has ruled out the switches, we’re looking at phones and CallManager.

My experience with Cisco so far is checkered. Their hardware is very handsome and works generally. That’s as far as I can go under the auspices of “If you can’t say anything nice, don’t say anything at all.” because the rest of what I have to say is unpleasant to hear. Alas, they have a name and public credibility, and one checkered customer isn’t going to alter the path of a machine as large and determined as Cisco.

We’ll see what TAC has for me next. I am eagerly in suspense, and I’ll update this blog if we find the answer. Holding the breath is probably inadvisable.

Apple’s Activation Lock

open-159121_640I just spent the last hour bashing my head against Apple’s Activation Lock on a coworkers iPad 2. They brought it to me because it had nearly every assistive mode option turned on, and it was locked with an unknown iCloud account. I tried to get around the lock to no avail, even to return the device to factory specifications. Even the factory reset ends up crashing into the Activation Lock.

It’s heartening to know that Activation Lock took the guts out of the stolen devices market for Apple mobile devices, but in this particular case it’s creating a huge headache. There is no way for me to move forward with treating this issue because the iPad only refers to its owner by a guesstimate email address, b******@gmail.com. I don’t know what this is, and there is no way for me to figure it out. So this device is pretty much bricked, and I have no choice but to send the user directly to an Apple store with the instructions to throw the device on their mercy.

If you are going to give away or sell your Apple device, make sure you TURN OFF ACTIVATION LOCK. There is no way, not even DFU-mode or Factory Reset that can defeat the lock. There are some hacks that used to work, but Apple catches on quickly and updates their iOS to close each possible hack soon after it appears.

I don’t pitch a fight with Apple over this, it was a clear and present requirement that they met, it just makes dealing with this particular issue impossible for people like me to resolve. The best way around this issue is to secure each and every device with an iCloud account and write the iCloud username and password down in a very legible and memorable safe place! Without the iCloud account details or a trip to the Apple Store, the device is so much plastic, metal, and glass.

Vexatious Microsoft

Microsoft never ceases to bring the SMH. Today I attempted to update a driver for a Canon 6055 copier here at the office. The driver I had was a dead duck, so out to get the “handy dandy UFR II driver”. I downloaded it, noted that it was for 64-bit Windows 2012 R2 server and selected it. Then I went to save it, and this is the error that greets me:

Capture
“Printer Properties – Printer settings could not be saved. This operation is not supported.”

So, what the hell does this mean? Suddenly the best and the brightest that Microsoft has to offer cannot save printer settings, and saving of printer settings is an operation that is not supported. Now step back and think about that for a second, saving your settings is not supported.

The error is not wrong, but it is massively misleading. The error doesn’t come from the print driver system but rather from the print sharing system. That there is no indication of that is just sauce for the goose. What’s the fix? You have to unshare the printer on the server, and then update the driver, and then reshare the printer. The path is quick, just uncheck the option to share from the neighboring tab, go back, set your new driver, then turn sharing back on. It’s an easy fix however because the error is not written properly, you don’t know where to go to address it. A more elegant system would either tell you to disable sharing before changing drivers or because you are already sharing and trying to install a new driver, programmatically unshare, save the driver, then reshare. Hide all of this from the administrator, as you do. That’s not what Microsoft does; they do awkward and poorly stated errors leading you on a wild goose chase.

But now I know, so that’s half the battle right there. Dumb, Microsoft. So Dumb.

Network Monitoring

I’m in the middle of a rather protracted evaluation of network infrastructure monitoring software. I’ve started looking at Paessler’s PRTG, also SolarWinds Orion product and in January I’ll be looking at Ipswitch’s products.

I also started looking at Nagios and Cacti. That’s where the fun-house mirrors start. The first big hurdle is no cost vs. cost. The commercial products mentioned before are rather pricey while Nagios and Cacti are GPL, and open sourced, principally available for no cost.

With PRTG, it was an engaging evaluation however I ran into one of the first catch-22’s with network monitoring software, that Symantec Endpoint Protection considers network scanning to be provocative, and so the uneducated SEP client blocks the poller because it believes it to be a network scanner. I ran into a bit of a headache with PRTG as the web client didn’t register changes as I expected. One of the things that I have come to understand about the cost-model network products is that each one of them appears to have a custom approach to licensing. Each company approaches it differently. PRTG is based on individual sensor, Orion is based on buckets, and I can’t readily recall Ipswitches design, but I think it was based on nodes.

Many of these products seem to throw darts at the wall when it comes to their products, sometimes hit and sometimes miss. PRTG was okay, it created a bumper crop of useless alarms, Solarwinds Orion has an exceptionally annoying network discovery routine, and I haven’t uncorked Ipswitch’s product yet.

I don’t know if I want to pay for this sort of product. Also, it seems that this is one of those arrangements that if I bite on a particular product, I’ll be on a per-year budget cost treadmill for as long as I use the product unless I try the no-cost options.

This project may launch a new blog series, or not, depending on how things turn out. Looking online didn’t pan out very much. There is somewhat of a religious holy war surrounding these products. Some people champion the GPL products; other people push the solution they went with when they first decided on a product. It’s funny but now that I care about the network, I’m coming to the party rather late. At least, I don’t have to worry about the hot slag of “alpha revision software” and much of the provider space seems quite mature.

I really would like anyone who works in the IT industry to please comment with your thoughts and feelings about this category if you have any recommendations or experiences. I’m keenly aware of what I call “show-stopper” issues.

Archiving and Learning New Things

As a part of the computing overhaul at my company, each particular workstation that we overhauled had its user profile extracted. This profile contains documents, downloaded files, anything on the Desktop, that sort of information. There never really was any centralized storage until I brought a lot of it to life, later on, so many of these profiles are rather heavy with user data. They range all the way up to about 144 gigabytes each. This user data primarily just serves as a backup, so while it’s not essential for the operation of the company, I want to keep as much as I can for long-term storage and maximally compress it.

The process started with setting up an Ubuntu server on my new VMWare Host and giving it a lot of RAM to use. Once the Ubuntu server was established, which on its own took a whole five minutes to install, I found a version of the self-professed “best compression software around” 7zip and got that installed on the virtual Ubuntu server. Then I did some light reading on 7zip and the general rule of thumb appears to be “throw as much as you can at it and it will compress better”, so I maxed out the application with word size, dictionary size, the works. Then started to compress folders containing all the profile data that I had backed up earlier. Throwing 144 gigabytes of data at 7zip when it’s maxed out takes a really long time. Then I noticed the older VMWare cluster and realized that nothing was running on that so for its swan song I set up another Ubuntu server and duplicated the settings from the first one on the second one and pressed that into service as well.

I then thought about notification on my phone when the compression routine was done, but by the time I had thought about it, I had already started the 7zip compressor on both servers. Both of these were far enough along where I didn’t want to cancel either operation and lose the progress I had made compressing all these user profiles. I am not a Bash Shell expert so it took a little digging around to find that there already was a way, temporarily, to freeze an application and insert more commands after it so that when the first application completes, the next application will go immediately into operation. You use Control-Z, which freezes the application and then the command “bg %1 ; wait %1 ; extra command”. Then I thought about how I’d like to be notified and dug around for some sort of email method. None of these servers that I put together had anything at all in the way of email servers and I really wasn’t keen on screwing around with postfix or sendmail. I discovered a utility called ssmtp which did the trick. Once I configured it for use with my workplace Office365 account and did some testing, I had just the thing that I was looking for. I stopped the application on both servers doing the compression and inserted the email utility to the end of the application finishing. When the compression is done, I will be emailed.

All in all, quite nifty and it only took a few minutes to set up. Once I’m done with this particular task, I can eliminate the “junky” Ubuntu server altogether on the old VMWare host and trim back the Ubuntu server running on my new VMWare host. I quite love Ubuntu, it’s quick and easy, set up what you want, tear it down when you don’t need it anymore, or put the VMWare guest on ice as an appliance until you do need it sometime later. Very handy. Not having to worry about paying for it or licensing it is about as refreshing as it can get. I just need something to work a temporary job, not a permanent solution. Although considering how much malware is out there, the breakpoint between the difficulty-to-use for end users in Linux may eventually give way to the remarkable computing safety of using Linux as a primary user workstation operating system. There is still a long while before Linux is ready for end-user primetime. I sometimes wonder what it will take for the endless vulnerabilities of Windows to break Microsoft. Hope springs eternal!

Trials

A major Fortune 500 company has a world-renowned hiring trial for their new IT staff. There are all the usuals, the resumes, the interviews, but there is also a fully funded practical trial as part of the job application process. The job itself is cherry, practically autonomous, with real challenges and true financial backing so the winner can dig in and achieve serious results.

The trial is rather straightforward, given a property address, you must approach, perform an intake procedure to discover what is required and then plan and execute whatever is needed to solve the IT need.

The property has one person, a newly hired young woman who is sitting at a central desk on the ground floor. She has a folder, within it, a script that she reads to each candidate:

“Welcome to your trial, this building has everything required to run a branch of our company. Every computer, networking component, and server component is placed and wired properly. Your task is to configure all the equipment throughout the branch properly. You will find all the resources you need to complete this task within the building. You have one week to complete this task. Good Luck.”

The young woman then folds her hands together and waits.

Several candidates engage with the trial, hoping to get the cherry job and have learned about the young lady at the reception desk. They pass all the requirements, and they eagerly arrive to try their hand at the trial. They impatiently sit through her canned speech and quickly head off to the basement to start in the server room.

Candidates come and go, some pass and some fail. The trial is to get the branch fully operational and on the last day of the week the branch becomes staffed, and the candidate must ensure that all the preparations are in place and that everyone can work without a technological failure. The trial is winnable but very arduous.

The young lady sitting at the central desk on the ground floor has a secret. She has a shoebox locked in a drawer attached to her desk and around her neck is a key on a golden necklace. She has specific instructions, which if a candidate approaches her and engages pleasantly and shows sincere interest in her role in the branch without being the destination of a last-ditch effort, she is to pause the conversation, unlock the desk and produce the shoebox to the candidate. Within the shoebox is the answer to the trial, it is every specific requirement written in clear, actionable text with a memory stick containing every proper configuration and a full procedure list that will bring the branch to full operation without a single hiccup. Everything from networking configurations to the copier codes for the janitorial staff is covered and once executed virtually guarantees a win.

How many people would simply ignore the receptionist and get cracking on the trial and how many would take their time to get to know everyone and their roles in that particular branch? Either kind of candidate can win, either through a sheer act of will or simply being kind, careful, and honestly interested in the welfare of each of their coworkers. Nobody knows about the secret key, but sometimes the answer you need comes from a place you would never expect.

Peer to Peer File Transfer, Reep.io

I recently needed to move about ten gigabytes of data from me to a friend and we used a new website service called reep.io. It’s quite a neat solution. It relies on a technology that has exists in many modern browsers, like Chrome, Firefox, and Opera called WebRTC.

The usual way to move such a large set of data from one place to another would probably best be mailing a USB memory stick or waiting to get together and then just sneaker-net the files from one place to another. The issue with a lot of online services that enable people to transfer files like this is that many of them are limited. Most of the online offerings cap out at around two gigabytes and then ask you to register either for a paid or free account to transfer more data. Services like Dropbox exist, but you need the storage space to create that public link to hand to your friend so they can download the data, plus it occupies the limited space in your Dropbox. With reep.io, there is no middleman. There are no limits. It’s browser to browser and secured by TLS. Is that a good thing? It’s better than nothing. The reason I don’t like any of the other services, even the free-to-use-please-register sites is because there is always this middleman irritation in the way, it’s inconvenient. Always having to be careful not to blow the limit on the transfer, or if it’s a large transfer like ten gigabytes, chopping up the data into whatever bite-sized chunk the service arbitrarily demands is very annoying.

To use this site, it’s dead simple. Visit reep.io, and then either click and drag the file you want to share or click on the File Add icon area to bring up a file open dialog box and find the file you want to share. Once set, the site generates a link that you can then send to anyone you wish to engage with a peer-to-peer file exchange. As long as you leave your browser running, the exchange will always work with that particular link. You don’t need any extra applications, and it works across platforms, so a Windows peer can send a file to a Mac client, for example. That there is no size limit is a huge value right there.

If you have a folder you want to share, you can ZIP it up and share that file. It’s easy to use, and because there are no middlemen, there aren’t any accounts to create, and thanks to TLS, nobody peeping over your shoulder.

Shifting Platforms

I go through cycles of having an interest, and then not having an interest in social media. Twitter and Facebook are the core services that I’m thinking about here. Amongst these services, I’ve given up on Twitter. I no longer engage with anyone in Twitter and the leading edge of loud, noisy chatter has carried on without me. If I do run the Twitter application, it’s mostly to witness some event as it unfolds, like a news source, or to jump on some shame bandwagon when a public figure makes a terrible mess of their lives by saying or doing something stupid.

I am about to give up on Facebook as well. There are many reasons for this renewed effort to leave the system. I am tired of the see-saw polarity between stories. The negative political stories mixed in with the positive reaffirming stories build up a kind of internal mental noise that clouds my day and keeps me from being focused. Another reason to leave is the interface has become somewhat moribund on its own. You can sometimes comment, sometimes not. The only option to express your reactions when it comes to feelings is “Like” and the entire service has become self-balkanized. I have friends and family on Facebook, but out of all of them, I only follow a few and I’ve muted the rest. I don’t really miss the engagement, but always having to think about tailoring my thoughts based on the audience has started to give me fatigue.

I think then that it may be time for me to go back to writing blog posts on my WordPress blog. The blog encourages longer format writing, and I expect that engagement will drop as I won’t be using Facebook. In a lot of ways, it is a kind of social addiction and the only way to break it is to wean off of it. Perhaps cold turkey is not right, but rather cool turkey.

I don’t expect anyone to follow me off of Facebook. I will share my blog posts to Facebook so people can still see what I write, but the engagement will drop off. Feel free to comment on my blog if you wish. Otherwise, that will be that.

On a more technical note, I changed how the stories are shared across systems. The original way was to publish a WordPress entry, which would share to Tumblr, and that would then share to Twitter and Facebook. I have torn that down and set it so that WordPress itself shares to Facebook, Google Plus, Tumblr, and Twitter. It’s a more direct path that doesn’t require people to slog through my Tumblr. I think it’s more direct this way.