FreeBSD Crater

I started out looking at FreeBSD based on a draw from FreeNAS, which then led to ZFS, the primary file system that FreeNAS and FreeBSD use. At work, I am looking at the regular handling of enormous archival files and the further along I went the more I realized that I would also need storage for a long time. There are a lot of ways to ensure that archival files remain viable, error correcting codes, using the cloud, rotating media. So all of this has led me to learn more about ZFS.

I have to admit that at first, ZFS was very strange to me. I’m used to HFS and EXT3 and EXT4 type file systems with their usual vocabularies. You can mount it, unmount it, and check it with an option to repair it. ZFS adds a whole new universe of vocabulary to file systems. There are two parts, the zpool creates the definition of the devices and files you want to use for your file system, and the zfs command allows you to manipulate it, in terms of mounting and unmounting. When it comes to error-checking and repair, that is the feature called scrub. The commands themselves aren’t difficult to grasp but the nature of this new file system is very different. It enables the administrator to perform actions that other file systems just don’t have. You can create snapshots, manipulate them, and even draw older snapshots – even out of order – forward as clones. So let us say that you have a file system, and you’ve been making regular snapshots every 15 minutes. If you need something from that filesystem at snapshot 5 out of 30, you don’t have to roll back the file system manually; you can just pluck snapshot 5 and create a clone. The cloning procedure feels a lot like “mounting” a snapshot so you can access it directly. If you destroy a clone, the snapshot is undamaged, it just goes back into the pile from whence it came. The big claim to fame for ZFS is that it is regarded by many as the safest file system, if one of the parts of it, in the zpool should fail the file system can heal itself. You can tear out that bad part, put in a new part, and the file system will rebuild and recover. In a lot of ways, ZFS is a lot like RAID 1, 5, or 6. Apparently there is a flaw with RAID 5 when you get to big data volumes and from what I can gather, ZFS is the answer to those problems.

So I have ZFS ported over to my Macbook Pro, and I’ve been playing around with it for a little while. It works as advertised so I’ve been enjoying that. One of the biggest stumbling blocks I had to deal with was the concepts of zfs mounting, unmounting and how they relate to zpool’s export and import commands. I started with a fully functional ZFS file system, created the zpool, then mounted it to the operating system. Then the next step is to unmount the file system and export the zpool. Exploring the way you can fully disconnect a ZFS file system from a host machine and then reverse the process. While doing this, I was reticent on using actual physical devices, so I instead used blank files as members in my zpool. I was able to create, mount, and then unmount the entire production, and then export the zpool. When I looked over how to reverse that, import the zpool I just had the system told me that there weren’t any pools in existence to import. This had me thinking that ZFS was a crock. What is the point of exporting a zpool if there is no hope on importing it afterwards? It turns out, there is a switch, -d, which you have to use – and that’s the trick of it. So once I got that, I became much more comfortable using ZFS, or at least exploring it.

So then today I thought I would explore the source of FreeNAS, which is FreeBSD. BSD is a kind of Unix/Linux operating system, and so I thought I would download an installation image and try it out in my VirtualBox on my Macbook Pro. So, I started with the image FreeBSD-10.2-RELEASE-amd64-dvd1.iso and got VirtualBox up and running. The installation was very familiar and I didn’t run into any issues. I got the FreeBSD OS up and running and thought I should add the VirtualBox Guest Additions. I thought I could just have VirtualBox add the additions as an optical drive and that the OS would notice and mount it for me in /mnt or /media. No. So that was a no-go. I then looked online and searched for VirtualBox Guest Additions. I found references to procedures to follow in the “ports” section of the FreeBSD OS. I tried it, and it told me that it couldn’t proceed without the kernel sources. So then I searched for that. This turned into a fork/branch mess and I knew that familiar sinking feeling all too well. You try and fix something and that leads to a failure, so you look for help on Google and follow a fix, which leads to another failure, and then you keep on going. This branching/forking leads you on a day-wasting misadventure. The notion that you couldn’t get what you wanted from the start just sits there on your shoulder, reminding you that everything you do from this point forward is absurd. There is a lot of bullshit you are wading through, and the smart move would be to give up. You can’t give up because of the time investment, and you want to fight it out, to justify the waste of time. The battle with FreeBSD begins. At the start we need the kernel sources, okay, use svn. Not there, okay, how to fix that? Get svn. Sorry, can’t do it as a regular user. Try sudo, command doesn’t exist, look for su, nope, not that either. Try to fix that, can’t. Login as root and try, nope. So I pretty much just reached my limit on FreeBSD and gave up. I couldn’t get VirtualBox Additions added, svn is impossible to load, sudo is impossible to load. Fine. So then I thought about just screwing around with ZFS on FreeBSD, to rescue some semblance of usefulness out of this experience. No, you aren’t root, piss off. I even tried SSH, but you can’t get in as root and without sudo there is no point to go forward.

So, that’s that for FreeBSD. We’re up to version 10 here, but it is still firmly bullshit. There are people who are massively invested in BSD and they no doubt are grumpy when I call out their OS for its obnoxiousness. Is it ready for prime time use? Of course not. No kernel sources included, no svn, no sudo, no su, no X for that matter, but honestly, I wasn’t expecting X.

It points to the same issues that dog Linux. If you don’t accept the basic spot where you land post-install then you are either trapped with Google for a long while or you just give up.

My next task will be to shut down the FreeBSD system and dump all the files. At least I only wasted two hours of my life screwing around with the bullshit crater of FreeBSD. What have I learned? Quite a lot. BSD I’m sure is good, but to use it and support it?

Thank god it’s free. I got exactly what I paid for. Hah.

Surprise! Scan-to-Folder is broken!

That’s what we faced earlier this week in our Grand Rapids office. It was a mystery as to why all of a sudden a Canon iR-3235 copier would stop working when it came to its “Scan to Folder” function. For Canon, the “Scan to Folder” function opens a CIFS connection to wherever you tell it to go and deposits a scanned PDF file to the destination. Everything up to Monday was working well for us.

After Monday, it was broken. Thanks to a Google Form linked to a Google Spreadsheet I have a handy way to log changes I make to the network in a very convenient way. I open up the form, enter my name and the change, and the Google spreadsheet catches the timestamp automatically. So what changed on Monday? I was using Wireshark and found a flurry of broadcast traffic on using two protocols, LLMNR and NBNS. The first protocol, LLMNR is only useful for small ad-hoc networks that don’t have a standard DNS infrastructure, since we do have a fully-fleshed DNS system running, LLMNR is noisy and superfluous. NBNS is an old protocol, and turning it off system-wide is an accepted best-practice. So I turned off NBNS for all the workstations and turned NBNS off on the servers also. It’s 2016, what could need NBNS?

Then we discovered that our older Canon ir3235 copiers suddenly couldn’t save data to CIFS folders. We verified all the settings, and there was no reason the copiers couldn’t send data to the server, whatsoever, or so we thought. The error from the copier was #751, which was a vague error code and nothing we could find online pointed to error #751 being a protocol problem.

I can’t recommend instituting some change tracking system enough for any other IT shop. Having a log, and being able to pin down exactly what happened and when was invaluable to solving this problem. As it turns out, Canon copiers require NBNS, but not specifically that protocol. When you turn off NBNS on a server, that closes port TCP/139. The other port for CIFS traffic, TCP/445 is used by modern implementations of CIFS. These Canon copiers only use TCP/139. So when I turned off NBNS to tamp down the broadcast traffic, I accidentally made the server deaf to the copiers. Turn NBNS back on, re-open TCP/139, and that fixes these old Canon copiers.

Cisco Phone Outage Solved!

Since early November of 2015, I’ve been contending with the strangest behavior from my Cisco infrastructure. Only a few number of Cisco IP Phones appear to go out around 7:30 am and then pop back to life right at 8 am. The phones fail and fix all by themselves, a very strange turn of events.

My first thought was that it was power related, so I started to debug and play around with POE settings. That wasn’t the answer. I then moved some phones from the common switch to a different switch nearby and the problem went away for the moved phones. That right there proved to me that it wasn’t the phones, and it wasn’t the Cisco CallManager. It had something to do with the switch itself. So I purified the switch, moved everything that wasn’t a Cisco IP Phone off it and the problem continued.

I eventually got in touch with Cisco support, and they suggested a two-prong effort, set SPAN up on the switch and run a packet capture there, and set SPAN up on a phone and run a packet capture there as well. The capture on the switch showed a switch in distress, many ugly lines where TCP was struggling. The phone capture was immense, and I got it opened up in Wireshark and went to the start of the days phone failure event. The minute I got to the start, 7:33:00.3 am the first line appeared. It was an ICMPv6 “Multicast Listener Report” packet. One of many that filled the rest of the packet capture. Millions of packets, all the same.

The multicast packets could explain why I saw the same traffic curves on every active interface. When a switch encounters a multicast packet, every active port responds as if the packet was sent to that port. As it turns out, once I extracted the addresses of where all these offensive packets were coming from, sorted the list, and dropped the duplicates I ended up with a list of four computers. I poked around a little bit more on Google and discovered to my chagrin that there was a specific Intel Network Interface Controller, the I217-LM, which was uniquely centered in this particular network flood scenario. I looked at the affected machines, and all of them were the same, HP ProDesk 600 G1 DM’s. These tiny computers replaced a good portion of our oldest machines when I first started at Stafford-Smith and I never even gave them a second thought. Each of these systems had this very Intel NIC in them, with a driver from 2013. The fix is listed as updating the driver, and the problem goes away. So that’s exactly what I did on the workstations that were the source of the multicast packet storm.

I can’t believe that anyone would design a NIC like this, where there is a possibility of a multicast flood, which is the worst kind of flood I think. All it takes is a few computers to start flooding and it sets off a cascade reaction that drags the switch to the ground.

We will see in the days to come if this solved the issue or not. This has all the hallmarks of what was going on and has filled me with a nearly certain hope that I’ve finally overcome this three-month headache.

Sparks and Shorts

Today I had a chance to get to work earlier than usual and connected up all my equipment and logged into the switch that was causing all the grief with my phones. Everything looked good, and all the phones were up and behaving fine. I logged into the switch, and I had a thought, Cisco devices have impressive debug features. If power is an issue, why not debug power?

So I turned on the debug traps and adjusted the sensitivity for the log-shipper and turned on the inline power debug for both the events manager and the controller. My Syslog system started to flood with new debug logs from this switch. The phones continued to behave themselves, so I sat down and looked at the log output. There were notable sections where the switch was complaining about an IEEE short on a port. OMFG. AC power being sent down twisted-pair Ethernet, and we’ve got a short condition!? Why did Cisco never even look for short conditions? Upon further investigation, anything that was connected to a server, computer, or printer were all randomly shorting out. These shorts were causing the POE police system to scream debugs to the Syslog system. So I found all the ports that did not have Cisco IP Phones on them and were also not uplinks to the backbone switch and turned off their POE.

Now that all the POE is off for devices that would never need it, the debug list has gone silent for shorts. It is still sending out debugs, but mostly that is the POE system regularly talking back and forth to the Cisco IP Phones, and that output looks tame enough to ignore. I updated my Cisco TAC case, and now we will wait and see if the phones fail in the mornings. At least, there can’t be any more POE shorts in the system!

Incommunicado

Here at work, I’ve got peculiar sort of failure with my Cisco phones. In the mornings, sometimes, all the phones connected to two Cisco Catalyst 3560-X 48-port POE switches all fail around 7:35 am and then all un-fail, all by themselves around 7:50 am.

I’ve tried to engage with Cisco TAC over this issue. I started a ticket when we first started noticing it, in November 2015. Yesterday I got in touch with the first Cisco TAC Engineer and was told that it was a CallManager fault, not a switching fault and that the ticket on the switches would close.

So I opened a new ticket for CallManager. Once I was able to prove my Cisco entitlements I was underway with a new Cisco TAC Engineer. So we shall see how this goes. What concerns me is that Cisco told me that obviously it wasn’t the Catalyst switches. I am a little at odds with this determination because as part of the early diagnosis of this problem we had a small group of users who couldn’t endure phone failures at all, so I moved their connections from the Catalyst 3560 to the other switch, a Catalyst 3850. For the phones that failed on the 3560, they stopped failing when attached to the other switch. That shows me that the issue isn’t with the phones, or the CallManager, but rather the switches themselves. But now that TAC has ruled out the switches, we’re looking at phones and CallManager.

My experience with Cisco so far is checkered. Their hardware is very handsome and works generally. That’s as far as I can go under the auspices of “If you can’t say anything nice, don’t say anything at all.” because the rest of what I have to say is unpleasant to hear. Alas, they have a name and public credibility, and one checkered customer isn’t going to alter the path of a machine as large and determined as Cisco.

We’ll see what TAC has for me next. I am eagerly in suspense, and I’ll update this blog if we find the answer. Holding the breath is probably inadvisable.

Apple’s Activation Lock

open-159121_640I just spent the last hour bashing my head against Apple’s Activation Lock on a coworkers iPad 2. They brought it to me because it had nearly every assistive mode option turned on, and it was locked with an unknown iCloud account. I tried to get around the lock to no avail, even to return the device to factory specifications. Even the factory reset ends up crashing into the Activation Lock.

It’s heartening to know that Activation Lock took the guts out of the stolen devices market for Apple mobile devices, but in this particular case it’s creating a huge headache. There is no way for me to move forward with treating this issue because the iPad only refers to its owner by a guesstimate email address, b******@gmail.com. I don’t know what this is, and there is no way for me to figure it out. So this device is pretty much bricked, and I have no choice but to send the user directly to an Apple store with the instructions to throw the device on their mercy.

If you are going to give away or sell your Apple device, make sure you TURN OFF ACTIVATION LOCK. There is no way, not even DFU-mode or Factory Reset that can defeat the lock. There are some hacks that used to work, but Apple catches on quickly and updates their iOS to close each possible hack soon after it appears.

I don’t pitch a fight with Apple over this, it was a clear and present requirement that they met, it just makes dealing with this particular issue impossible for people like me to resolve. The best way around this issue is to secure each and every device with an iCloud account and write the iCloud username and password down in a very legible and memorable safe place! Without the iCloud account details or a trip to the Apple Store, the device is so much plastic, metal, and glass.

Vexatious Microsoft

Microsoft never ceases to bring the SMH. Today I attempted to update a driver for a Canon 6055 copier here at the office. The driver I had was a dead duck, so out to get the “handy dandy UFR II driver”. I downloaded it, noted that it was for 64-bit Windows 2012 R2 server and selected it. Then I went to save it, and this is the error that greets me:

Capture
“Printer Properties – Printer settings could not be saved. This operation is not supported.”

So, what the hell does this mean? Suddenly the best and the brightest that Microsoft has to offer cannot save printer settings, and saving of printer settings is an operation that is not supported. Now step back and think about that for a second, saving your settings is not supported.

The error is not wrong, but it is massively misleading. The error doesn’t come from the print driver system but rather from the print sharing system. That there is no indication of that is just sauce for the goose. What’s the fix? You have to unshare the printer on the server, and then update the driver, and then reshare the printer. The path is quick, just uncheck the option to share from the neighboring tab, go back, set your new driver, then turn sharing back on. It’s an easy fix however because the error is not written properly, you don’t know where to go to address it. A more elegant system would either tell you to disable sharing before changing drivers or because you are already sharing and trying to install a new driver, programmatically unshare, save the driver, then reshare. Hide all of this from the administrator, as you do. That’s not what Microsoft does; they do awkward and poorly stated errors leading you on a wild goose chase.

But now I know, so that’s half the battle right there. Dumb, Microsoft. So Dumb.

Network Monitoring

I’m in the middle of a rather protracted evaluation of network infrastructure monitoring software. I’ve started looking at Paessler’s PRTG, also SolarWinds Orion product and in January I’ll be looking at Ipswitch’s products.

I also started looking at Nagios and Cacti. That’s where the fun-house mirrors start. The first big hurdle is no cost vs. cost. The commercial products mentioned before are rather pricey while Nagios and Cacti are GPL, and open sourced, principally available for no cost.

With PRTG, it was an engaging evaluation however I ran into one of the first catch-22’s with network monitoring software, that Symantec Endpoint Protection considers network scanning to be provocative, and so the uneducated SEP client blocks the poller because it believes it to be a network scanner. I ran into a bit of a headache with PRTG as the web client didn’t register changes as I expected. One of the things that I have come to understand about the cost-model network products is that each one of them appears to have a custom approach to licensing. Each company approaches it differently. PRTG is based on individual sensor, Orion is based on buckets, and I can’t readily recall Ipswitches design, but I think it was based on nodes.

Many of these products seem to throw darts at the wall when it comes to their products, sometimes hit and sometimes miss. PRTG was okay, it created a bumper crop of useless alarms, Solarwinds Orion has an exceptionally annoying network discovery routine, and I haven’t uncorked Ipswitch’s product yet.

I don’t know if I want to pay for this sort of product. Also, it seems that this is one of those arrangements that if I bite on a particular product, I’ll be on a per-year budget cost treadmill for as long as I use the product unless I try the no-cost options.

This project may launch a new blog series, or not, depending on how things turn out. Looking online didn’t pan out very much. There is somewhat of a religious holy war surrounding these products. Some people champion the GPL products; other people push the solution they went with when they first decided on a product. It’s funny but now that I care about the network, I’m coming to the party rather late. At least, I don’t have to worry about the hot slag of “alpha revision software” and much of the provider space seems quite mature.

I really would like anyone who works in the IT industry to please comment with your thoughts and feelings about this category if you have any recommendations or experiences. I’m keenly aware of what I call “show-stopper” issues.

Archiving and Learning New Things

As a part of the computing overhaul at my company, each particular workstation that we overhauled had its user profile extracted. This profile contains documents, downloaded files, anything on the Desktop, that sort of information. There never really was any centralized storage until I brought a lot of it to life, later on, so many of these profiles are rather heavy with user data. They range all the way up to about 144 gigabytes each. This user data primarily just serves as a backup, so while it’s not essential for the operation of the company, I want to keep as much as I can for long-term storage and maximally compress it.

The process started with setting up an Ubuntu server on my new VMWare Host and giving it a lot of RAM to use. Once the Ubuntu server was established, which on its own took a whole five minutes to install, I found a version of the self-professed “best compression software around” 7zip and got that installed on the virtual Ubuntu server. Then I did some light reading on 7zip and the general rule of thumb appears to be “throw as much as you can at it and it will compress better”, so I maxed out the application with word size, dictionary size, the works. Then started to compress folders containing all the profile data that I had backed up earlier. Throwing 144 gigabytes of data at 7zip when it’s maxed out takes a really long time. Then I noticed the older VMWare cluster and realized that nothing was running on that so for its swan song I set up another Ubuntu server and duplicated the settings from the first one on the second one and pressed that into service as well.

I then thought about notification on my phone when the compression routine was done, but by the time I had thought about it, I had already started the 7zip compressor on both servers. Both of these were far enough along where I didn’t want to cancel either operation and lose the progress I had made compressing all these user profiles. I am not a Bash Shell expert so it took a little digging around to find that there already was a way, temporarily, to freeze an application and insert more commands after it so that when the first application completes, the next application will go immediately into operation. You use Control-Z, which freezes the application and then the command “bg %1 ; wait %1 ; extra command”. Then I thought about how I’d like to be notified and dug around for some sort of email method. None of these servers that I put together had anything at all in the way of email servers and I really wasn’t keen on screwing around with postfix or sendmail. I discovered a utility called ssmtp which did the trick. Once I configured it for use with my workplace Office365 account and did some testing, I had just the thing that I was looking for. I stopped the application on both servers doing the compression and inserted the email utility to the end of the application finishing. When the compression is done, I will be emailed.

All in all, quite nifty and it only took a few minutes to set up. Once I’m done with this particular task, I can eliminate the “junky” Ubuntu server altogether on the old VMWare host and trim back the Ubuntu server running on my new VMWare host. I quite love Ubuntu, it’s quick and easy, set up what you want, tear it down when you don’t need it anymore, or put the VMWare guest on ice as an appliance until you do need it sometime later. Very handy. Not having to worry about paying for it or licensing it is about as refreshing as it can get. I just need something to work a temporary job, not a permanent solution. Although considering how much malware is out there, the breakpoint between the difficulty-to-use for end users in Linux may eventually give way to the remarkable computing safety of using Linux as a primary user workstation operating system. There is still a long while before Linux is ready for end-user primetime. I sometimes wonder what it will take for the endless vulnerabilities of Windows to break Microsoft. Hope springs eternal!

Trials

A major Fortune 500 company has a world-renowned hiring trial for their new IT staff. There are all the usuals, the resumes, the interviews, but there is also a fully funded practical trial as part of the job application process. The job itself is cherry, practically autonomous, with real challenges and true financial backing so the winner can dig in and achieve serious results.

The trial is rather straightforward, given a property address, you must approach, perform an intake procedure to discover what is required and then plan and execute whatever is needed to solve the IT need.

The property has one person, a newly hired young woman who is sitting at a central desk on the ground floor. She has a folder, within it, a script that she reads to each candidate:

“Welcome to your trial, this building has everything required to run a branch of our company. Every computer, networking component, and server component is placed and wired properly. Your task is to configure all the equipment throughout the branch properly. You will find all the resources you need to complete this task within the building. You have one week to complete this task. Good Luck.”

The young woman then folds her hands together and waits.

Several candidates engage with the trial, hoping to get the cherry job and have learned about the young lady at the reception desk. They pass all the requirements, and they eagerly arrive to try their hand at the trial. They impatiently sit through her canned speech and quickly head off to the basement to start in the server room.

Candidates come and go, some pass and some fail. The trial is to get the branch fully operational and on the last day of the week the branch becomes staffed, and the candidate must ensure that all the preparations are in place and that everyone can work without a technological failure. The trial is winnable but very arduous.

The young lady sitting at the central desk on the ground floor has a secret. She has a shoebox locked in a drawer attached to her desk and around her neck is a key on a golden necklace. She has specific instructions, which if a candidate approaches her and engages pleasantly and shows sincere interest in her role in the branch without being the destination of a last-ditch effort, she is to pause the conversation, unlock the desk and produce the shoebox to the candidate. Within the shoebox is the answer to the trial, it is every specific requirement written in clear, actionable text with a memory stick containing every proper configuration and a full procedure list that will bring the branch to full operation without a single hiccup. Everything from networking configurations to the copier codes for the janitorial staff is covered and once executed virtually guarantees a win.

How many people would simply ignore the receptionist and get cracking on the trial and how many would take their time to get to know everyone and their roles in that particular branch? Either kind of candidate can win, either through a sheer act of will or simply being kind, careful, and honestly interested in the welfare of each of their coworkers. Nobody knows about the secret key, but sometimes the answer you need comes from a place you would never expect.