There are times when linux frustrates me. Not with issues specific to one distro, but software packages in general which are written for linux.
My prime example here is Watch. In Gentoo, it's included in the procps package. I recently was confused to find that my ident daemon, which I keep running because I'm an avid IRC user, was being flooded with traffic near constantly. Netstat told me it was because of two IRC servers I ran. So I logged in, checked netstat there, and sure enough, it was them. But I had no idea what process on the servers was actually creating the connections.
I assumed it would be the ircd itself, but I wanted proof before investigating further. No problem, I thought, I'll just run
watch --differences=cumulative -n 0.1 'lsof +M -i4|grep auth', which according to watch's manpage, would show what's changed in a command's output, rather than clearing the screen and displaying the output every .1 seconds. It did do this, in a way, however, because the program creating the connection to my ident server only kept that connection for a fraction of a second, the output vanished, and thanks to the unhelpful way that watch handles output which only shows up once, all I got was some white blocks showing that there had at one point been text there.
My solution? Throw together a bash one-liner which looped infinitely until the offending program was identified:
while true; do UNR=$(lsof -M -i4|grep auth); if [ -n "$UNR" ]; then echo "$UNR"; break; fi; done
This did eventually work, and it turned out to be a runaway process on my personal box constantly creating connections to both IRC servers.
I run a number of different servers from a number of different providers. I also run servers for friends. In this post, I'll be discussing one server in particular, a friend's server that's primarily used for IRC, web hosting and minecraft. This server runs Arch.
Now I'm going to start by saying I don't always make fantastic decisions and I'm not always known for making sure everything's perfect before rebooting a server. I have, in the past, screwed servers up. But this case was a little different.
Yesterday I log into the server to update a firewall rule, and discover that, because I've never used the nat table in iptables before on that box, the module was never loaded. Of course, the box has fantastic uptime and hasn't been rebooted in over 160 days. Now I'll stop for a second to mention how Arch handles kernel upgrades. When the kernel's upgraded, the previous kernel is left on the system, and all kernels older than that are removed completely. This includes the currently running kernel, and all modules. And this box hadn't been rebooted since kernel 3 was released. I have the box set up to be as zero-maintenance as possible. Emails whenever anything happens, cronjobs taking care of updates and removal of old packages from the cache, scripts to reboot any services that have crashed, gone down or have stopped responding. But as I discovered during a routine check a week prior, the Arch box hadn't been upgrading any of its packages due to a recent change to the filesystem package. I manually started the update process, it informed me that it couldn't upgrade the system because of a file conflict (as the news article mentioned). No problem, I force-installed the update to filesystem and then upgraded the rest of the system as usual.
Now fast-forward back to yesterday, I'm telling my friend I need to reboot his server to apply the firewall rule. He gives the okay, and I reboot the server. Emails flood in saying that various services are down and that the server's offline, as always. But then a few minutes pass, there's no email that everything's back online again. I ping the server, nothing. Log into the server host's control panel, it's listed as online, so I VNC into it to see what the issue is. It can't find the harddrive and has thrown itself into a recovery terminal. What.
I figure a kernel change has messed with the partitions, so I boot the server into my usual recovery system - the gentoo live CD. Nothing seems out of place with grub or the fstab, so I look at the next culprit - the config file for mkinitcpio. It's blank.
Somehow, pacman disregarded the usual rules about protecting config files against being overwritten and messed with the config file, so the initramfs Arch needs in order to boot up properly was completely broken. No problem, I'll just chroot in -- OH WAIT. The Gentoo LiveCD runs kernel 2.6.31, and arch refuses to do /absolutely anything/ unless the kernel is new enough (In this case it had to be 2.6.32 or newer). The server host isn't exactly good, and doesn't provide any recent install media. Cue ten minutes of me googling for the kernel versions of everything in the media list, eventually settling for a Fedora 13 disk that had a recent-enough kernel. I get chrooted in, fix the mkinitcpio config and start generating the images. It complains that /dev isn't mounted. Okay. I look at the source for mkinitcpio, discover it's trying to access /dev/fd, which somehow doesn't exist. I symlink it over from /proc/self/fd and start it again. Everything seems to work, so I reboot.
This time, it recognises the hard drive, but the partition device names have changed. It's now seeing the disk as /dev/xvda. Bizarre, but has happened before. I boot the gentoo livecd again, edit grub's menu.lst and fstab, reboot back into Arch. It boots! But doesn't have a network connection. By this point, I'm close to pulling out my hair. I google around and find that because I included a bunch of xen drivers that mkinitcpio forgot before, everything's working a little differently. Specifically, it now works closer with Xen, and is trying to unplug everything at boot. No problem, I'll put
xen-emul-unplug=none in the kernel boot line. Reboot the server and the harddrive device name has changed again. I boot into gentoo, change menu.lst and fstab to use /dev/sda again, and reboot.
Finally, the server's booted and has a network connection. And this is why I no longer install arch on any of my boxen. I can't trust it to reboot without throwing a hissy fit and killing itself.
TL;DR Arch package issue sends me on an hour-long crusade to make a box boot again.
So for the millionth time, I'm starting a blog. Hopefully, this time I'll actually have stuff to write about. As a bit of an introduction, I'm Maff. I'm a computer science student, sysadmin and server support drone from Scotland. I'm also a bit of a designer, a developer and a security dork. I also love videogames, music and a whole host of other things.
I'm starting this blog in part because I want somewhere to document my pursuits (software or otherwise), but also as a place to post about the funny things that happen in real life, and as a place to rant. I may also post about the software I write, from time to time - I'm a semi-active developer over at github, and I'm the gentoo package maintainer for Monitorix.