It's not a bug, it's a feature...

Tue, 19 Jan 2010

Changes

Without change, something sleeps inside us, and seldom awakens.
-- Duke Leto Atreides, Dune

This past week, after nine and a half years with CSAIL and its predecessor, LCS, I announced my resignation. I prefer, however, the term "graduation." I've been with CSAIL for its entire history, and for my entire professional career. I started work at LCS in 2000, while still an undergrad at Northeastern. When the need to build an infrastructure from scratch presented itself with the formation of CSAIL, my group merged with its counterpart from the AI Lab to form TIG, and we set to work. With what I believe were very limited resources, particularly on the personnel front, we accomplished what I believe was a small miracle when we coordinated the new Lab's move to the Stata Center while also redefining how computers would be used and managed at the Lab in the coming years. In the time since then, I have worked on numerous projects within TIG, ranging from email service to GNU/Linux installation systems.

Preparing the Lab for life after my departure has been interesting. It's not the sort of thing I have any experience with. Every once in a while, when talking about long-term plans for some service or system, it will occur to me that I won't be here to see those plans to fruition. When interviewing candidates for the two open positions within TIG, I'll occasionally have to stop and remind myself that the candidates won't end up being my coworkers, but rather my successors. It's hard to come to grips with the thought of leaving behind a lot of things that I've built and maintained for years, to say nothing of the relationships that I've (slowly) cultivated with my coworkers and other CSAIL members.

It's been observed countless times over the years that one never really leaves CSAIL. Once the Lab gets its hooks into you, it doesn't let go easily. Whatever that means for me personally remains to be seen, but no matter what happens, you'll never be able to get all of me out of the Lab, or all of the Lab out of me.

My first day at LCS was August 14, 2000. My last day at CSAIL will be January 29, 2010. Next stop, sunny California.

Thu, 03 Dec 2009

Well isn't this fancy

I recently took an hour or so to apply a fancy new layout to my blog. Thanks go to free-css-templates.com for creating some nice CSS templates and distributing them under a Creative Commons license. Of course, just because my blog looks pretty doesn't mean I'll spend any more time updating it. I always mean to, though, because there's a lot of good stuff that goes on behind the scenes at CSAIL that most people don't get to see.

Sun, 01 Nov 2009

CSAIL Debian installer updates

I spent a couple hours this weekend reworking some components of the CSAIL Debian installer. The early user interface looks much nicer and is menu driven, which is going to make end-users much happier. Having to type (and remember to type!) something like "install64" to get a 64 bit installation was suboptimal. There's also an option to boot into grub, which is convenient for certain rescue situations. Grub also has default entries for useful tools like memtest86+.

I also rewrote the tools that TIG uses to create installation disks. The process is much less complex now, and in most cases will just amount to typing 'make'. That'll make life much easier for us when we need to update the installer disks. That's always nice.

Wed, 19 Aug 2009

A new CSAIL uptime record?

A research group at CSAIL just contacted TIG to see about getting some help upgrading their group server, which was running CSAIL Debian. I logged in to have a look and noted the following:

 09:26:40 up 1139 days, 15:28,  3 users,  load average: 1.08, 0.83, 0.54

The machine hadn't been rebooted since early July, 2006, and was still running Debian 3.1 (sarge), close to a year and a half after support for sarge was terminated upstream and by TIG. The reliability is certainly something to be appreciated, but I'm always concerned when I see machines in this state that haven't had much attention from professional sysadmins.

The requested upgrade to lenny, with an intermediate stop in etch, is under way. I'll feel guilty typing reboot when the time comes, though.

Wed, 01 Jul 2009

Have I mentioned how much I love graphs?

Performance on the IMAP server had been getting progressively worse following a recent Cyrus upgrade. A periodic maintenence job on the IMAP server was taking an increasing amount of time to run, and was driving the system load progressively higher. One-minute load averages of over 70 were starting to show up with regularity, and the problem was getting more and more noticable.

A fix was eventually found. Can you guess when it was put in place?

IMAP server load
Tue, 05 May 2009

Interesting spam trends

I'm not sure yet what it shows, but clearly something about the nature of the spam reaching CSAIL's mail exchanger changed a couple of weeks ago:

spam trends graph

Notice how the number of messages rejected outright (red line) usually tracks the number of received and completed messages fairly closely. Then things change fairly dramatically. I still don't know how to explain the increase in rejected spam, especially considering that there's no apparent increase in mail that actually gets processed. It's as though we're somehow rejecting an entire spam campaign, but that doesn't seem very likely unless the spammers are really clueless.

There are several reasons why a given message could be rejected by our servers. I haven't yet investigated deep enough to see which mechanism is responsible for this increase.

Thu, 27 Nov 2008

Network Scanning

Recently there was a discussion about whether or not we should permit the IS&T security people to scan our network for a relatively new and quite dangerous Windows security vulnerability. The vulnerability was patched by Microsoft some time ago, and the only systems left without protection are those that are both poorly administered and that have automatic updates disabled. We need to know about machines on our network that fit that description. That alone is good enough to warrant this sort of active scanning. We asked a representative subsection of the lab if they thought we should go ahead with the scanning, and received close enough to a concensus that we went ahead with it. However, reflecting on this tonight leads me to believe that we should not have even bothered asking. The prime reason for this conclusion is the clear fact that the "bad guys" are doing this all the time, and with much more nefarious purposes in mind. We already identify many hundreds of unique attempts each day to actually probe our systems for vulnerabilities. If we're concerned that our own scans our somehow going to disrupt normal operation of the target systems, we better be doing a whole lot more to protect these systems from the rest of the Internet.

Mon, 25 Aug 2008

Back from vacation...

I took most of August off again, similar to last year. Today was my first day back, and wow, how different work is from relaxing at a small lake house in Maine! While things had mostly gone smoothly in my absense, there was plenty of stuff piled up for me to deal with. Naturually a lot of it is urgent. Much of it has to do with the new CSAIL web site, which is nearing deployment. Those are the most time sensitive issues, anyway, since the administration wants to deploy the site very soon (a week, roughly, depending on various bug squashing cycles.)

It's good to be back, though.

Sun, 06 Apr 2008

New IMAP hardware deployed.

Yesterday finally saw the deployment of the new IMAP server, and there was much rejoicing. Feedback from users ranged from "definite visible performance improvements" to "holy dogshit this new server is fast." Hopefully it'll continue to be fast. A big part of what slowed down the old system was the multiple I/O intensive tasks running simultaneously, and we haven't seen that happen on the new hardware yet. There are two main tasks that run periodically and really hit the disks hard. The first is the filesystem backup, which stats every file on the system and reads a whole bunch of them. The second is the server side search index generator, called squatter, which does something fairly similar in order to build indexes of message comments so fast server-side searches can be performed. Think of this as somewhat similar to the Unix updatedb/locate commands, except that it indexes files by content. More like Google Desktop, I guess.

One thing that makes me somewhat optimistic, though, is that I manually triggered a squatter run against a subset of our mailbox list, and it took 2.5 hours on the new hardware. Looking at the most recent two automated runs over the same set of folders on the old hardware shows that they each took well over 13 hours! That's a huge difference! I don't know if the backup process was running at the time or not, but I suspect it was. It wasn't when I triggered the manual run. Either way, we should probably work out a way to avoid having backups and squatter run at the same time. That's somewhat difficult, though, since backups could take several days on the old hardware. Hopefully the newer system really is faster!

Tue, 04 Mar 2008

15:25:42 up 472 days, 21:19, 1 user, load average: 0.11, 0.14, 0.15

It's sort of sad that most people never realize how reliable most of TIG's services really are. The above uptime is from a production server that holds slightly less than a terabyte of critical data. The system literally has never rebooted. That's pretty cool.

Mon, 04 Feb 2008

IMAP server stuff

I think we've got a nice solution to the IMAP server upgrade that will really minimize downtime, which is great news. Since the current IMAP server stores its disks on an external fibre-channel RAID box, we'll just connect this box to the new server (once we buy an HBA for it) and move the mailboxes to the new filesystem using the Cyrus IMAP software itself, which can move live mailboxes similar to the way AFS can move live volumes. This means that we really only need an outage lasting long enough to move the RAID array, which shouldn't take more than about 15 minutes. Very cool.

Fri, 01 Feb 2008

New hardware for the IMAP server

I mentioned recently that we've received new hardware for the lab's IMAP server. I've begun work toward migrating to the new system, but I think it's going to be more difficult than originally planned. I simply haven't figured out a way to synchronize all the mailboxes quickly enough. The IMAP service needs to be shut down completely for a period of time so we can make sure the filesystem on the new system matches the filesystem on the old system. Unfortunately, I don't see how we can keep this outage to an acceptably short period of time. There are simply too many files.

I've tried a few different approaches to synchronize the filesystem. It's tempting to try rsync, since it only copies the files that actually change, and in this case would leave most of the filesystem completely untouched. Unfortunately, rsync needs to construct a detailed index of the files before it can work, and this operation takes a very long time when dealing with ~14 million files. I've tried to optimize this by splitting the filesystem into smaller chunks (typically individual users' mailboxes) and synchronizing them individually. I've tried running rsync over multiple chunks in parallel with varying numbers of rsync processes. This has helped, but not nearly enough.

Doing a complete filesystem copy gets us away from the rsync overhead, but requires that we copy the entire filesystem contents. That's not a cheap operatoin, either, since the filesystem contains over half a terabyte of small files.

There are a couple of options left available to us, but they both involve Real Work. Kcr has been advocating that we switch to a Cyrus Murder configuration, which could help us here. With the new server and the old server configured as backend IMAP servers, we could serve IMAP mailboxes from both machines at the same time, taking individual (or small groups of) mailboxes offline to move to the new server. This would likely still involve downtime, but would allow us to spread the downtime across several days or weeks, keeping the individual outages very short. If we could find a way to limit the outages only to the specific mailboxes being copied, that would be even better, but I'm not sure that's possible.

Another option might be to disconnect the current IMAP server's RAID array and plug it in to the new server. The new server would then take over as the IMAP server, and we could synchronize to local disk via the direct fibre-channel connection, rather than over the network. I don't think this will help, though, because the network is not the bottleneck in the current setup. The bottleneck seems to be the RAID array itself (and the configuration of the filesystem on it), which we'd carry right over to the new system with us.

So, the point of all this, I think is that we're not as close to rolling out the new server as I'd hoped.

Wed, 23 Jan 2008

Hardware avalanche

The production lifetime of a whole bunch of TIG hardware has passed, and we've started ordering replacement machines. Our storage room is currently overflowing with new systems that nobody has had time to unbox yet. We've got something like 8 terabytes of new AFS storage space, some new mail infrastructure hardware, and some new hardware to host Xen virutal machines. We'll probably be getting more before too long.

I'm really excited about finally replacing the IMAP server and really taking the time to tune and optimize the new system. Filesystem access on the current server is painfully slow at this point. The new system has much faster disks, and, having seen how bad things are with the current system, will get a lot more tuning before being deployed. The IMAP server's workload is particularly difficult to handle, because it involves lots of small files (one per message) with lots of random seeking. Disks seeks are expensive, so any optimization we can do there will help a whole lot. I'm not sure when this system will go online, but hopefully within the next couple of weeks.

Sun, 30 Dec 2007

spamhaus

We use spamhaus.org's blacklists as part of our anti-spam strategy for incoming mail. Unfortunately, we recently exceeded some threshold for accesses per day to the spamhaus.org public dnsbl servers and they blocked our access. This resulted in a higher than normal load on our spamassassin servers due to messages that otherwise would have been rejected before the content-examination phase being accepted. It's likely that this has also led to a higher than normal false-negative rate (spam not being tagged as such), simply because, if the percentages remain constant but the volume of mail increases, more spam will get through. Fortunately most of the mail that we were rejected really is easily detectable as spam by spamassassin, so nearly all of the increased volume did get properly tagged. A quick examination of my Spam folder backs up this hypothesis. We automatically delete messages older than 2 weeks from Spam folders on the CSAIL IMAP server, and so normally I can expect to find just over 6000 messages in my spam folder at any given time. Tonight, however, I find 11,000 messages.

I've signed the lab up for a trial membership of the spamhaus data feed service, which essentially allows us to mirror the spamhaus DNS zones. I deployed this over the weekend and just re-enabled spamhaus checks on the incoming mail hub. It seems to be working very well, and I'm excited to see the mail delivery performance return to normal. We'll sign up for the paid service soon, if everything continues to go well.

Sat, 01 Dec 2007

AFS fileserver issue

One of our AFS fileservers lost a disk late this afternoon, resulting in a couple hours of downtime. A single disk failure shouldn't result in any downtime, but in this case it did. The disk was part of a mirror set hosting the machine's root filesystem and boot blocks, and for some reason it didn't seem to notice correctly that the disk had failed, so it continued trying to access it. This resulted in access attempts hanging, causing the machine to develop a backlog of AFS fileserver requests eventually triggering an alert to the TIG oncall people (which included me this weekend).

The dead disk has been replaced, and things are OK again...

Mon, 15 Oct 2007

Spam updates...

I just enabled SpamAssassin's Shortcircuit plugin on incoming.csail.mit.edu, which should really help with throughput in cases where ~900 messages suddenly show up from e.g. eventcalendar@csail.mit.edu.

I was recently asked for some statistics about how much mail we're seeing, what percent is identifiable as spam, etc. So here they are, the daily averages from week 40 of 2007:

* 181786 messages accepted per day

* 93656 messages rejected per day

* 46000 message examined by spamassassin per day

* 30017 messages tagged as spam by spamassassin per day

IMAP server upgraded to etch

imap.csail.mit.edu has been upgraded from Debian sarge to etch. This was mostly very simple, except for the odd situation where the SCSI controller refused to see the disks until the machine was powercycled. This change doesn't actually accomplish much, but it gets us ready for a future upgrade to Apache 2.2 and Cyrus 2.2, as well as a move to new hardware.

Thu, 27 Sep 2007

Power management: Progress

So, after applying the various settings and updates I mentioned in my previous post, I've generated a new graph of the power consumption by the same ThinkPad. The original plots are still included for reference. This time the machine wasn't sitting completely idle, since I was using it at the time, mostly to post my previous entry and to continue researching power tweaks, so the graph is noticably less stable. But it's still very interesting:

graph of T60p with power management tweaks enabled

You can see that, while it's not yet quite as battery friendly as the T40p, the T60p running Linux 2.6.22 uses much less power than it used to!

Tue, 25 Sep 2007

Won't somebody please think of the electrons?

Some time ago, CSAIL bought me a new laptop to replace my aging, but still quite capable, ThinkPad T40p. The new laptop, a ThinkPad T60p, is basically a straight upgrade from the old one. Same physical size, but "bigger" in the abstract sense. Except when it comes to battery life, where it's clearly worse than the T40p. I live with this for almost a year until stumbling across LessWatts.org, an Intel-run site with a bunch of information about power management and tweaks in Linux.

Since my goal is to make the T60p act more like the T40p in terms of power consumption, my first task was to quantify the actual difference between the two, so I don't need to rely on feel. Since both machines run Debian stable (etch) with the same kernel, (and thus the same power management code), this was fairly easy. Here's a graph comparing the power consumption of the two machines, in very similar configurations, sitting entirely idle:

power
consumption rate graph

The graph shows that, without doing anything unusual to minimize power consumption, the T60p uses approximately twice the power of the T40p.

So now it's time to start actually trying to do something about what seems to me to be a horribly inneficient use of power. I'll start by switching to a newer Linux kernel supporting dynticks, being more aggressive in the use of wireless network device power management, and the power saving CPU scheduler options. Since my laptop contains and Intel AHCI SATA controller, I can also experiement with SATA power management, which appears to have the most potential for dramatically reducing power consumption. I'll get to that when I have some idea about the affects of the first few tweaks.

Thu, 19 Jul 2007

Cool app of the day

I'm probably behind the times, but I just recently started using Synergy, a really slick little program for sharing keyboard and mouse input between multiple machines. It allows me to sit here typing at my laptop, as I'm doing right now, but by simply dragging my mouse pointer off the edge of the screen, I can magically start typing on my desktop machine. It also synchronizes the X clipboard between machines, so I can easily copy & paste from one desktop to another.

Fri, 06 Jul 2007

New SpamAssassin deployed

I just finished updating to a newer spamassassin on the mail servers. I also took a while to re-work the sources we use for our third-party updates, so we should do better at keeping up with changing rules in response to new tactics being employed by the spammers.

Thu, 15 Mar 2007

New kredentials packages on their way

Earlier this year I was sent a patch to kredentials from somebody at the Max Planck Institute in Germany. It's great to see that it's been adopted outside of CSAIL, and it's even better that they're sending patches! The patch is great because it was obviously written by somebody who writes a lot more C++ than I do. He fixed some memory leaks and wrote a new feature allowing Kredentials to prompt for the user's kerberos password and get new tickets. I've rolled his patch in to Kredentials and also fixed a bug that prevented autoconf from properly configuring the package on Debian GNU/kFreeBSD systems. (What's scarier than the fact that Debian has been ported to the FreeBSD kernel is that it actually sounds like the port is pretty mature at this point.) I'm testing these new packages at the lab now, and will upload to Debian soon...

Wed, 21 Feb 2007

Still waiting on Debian

We're still waiting for the etch release. The big blockers are this point are apparently the kernel and installer, though progress is definitely being made. The second release candidate of the main installer was announced recently, along with a new kernel upload.

Though it's not released yet, several people within the lab, myself included, are testing etch on our workstations. For the most part things are going very smoothly. The upgrade process from sarge is still somewhat tricky, especially if you've got a whole lot of custom changes on your machine, and they also take a lot longer than clean installs. I can promise you that if you've got a dual-monitor configuration in X right now, we're going to break it when we upgrade your machine. It's not difficult to restore the working configuration, but it does require a little bit of intervention.

While working on CSAIL Debian I also took the time to update the INQUIR clients, fixing a number of bugs. I've started a little bit of work toward a web front end based on this tool, but keep getting sidetracked...

Wed, 24 Jan 2007

Spam filtering updates

I made some changes to the way we process spam on our servers. Since ai.mit.edu and some of the other "legacy" domains are entirely virtualized (i.e. all addresses in them simply forward to address at other domains) I disabled content-based spam filtering on mail to them. This won't affect most people, since they've undoubtedly got spam filtering at their final destination (e.g. csail.mit.edu or mit.edu). I've re-enabled rejection of messages to lists.csail.mit.edu that get SA scores of 20 or higher during the SMTP "data" phase. However, in order to avoid sending spurious bounces, we don't reject mail on the list server if it was received from another server on our network, since that server would end up sending a bounce to some (most likely completely innocent) third party. In those cases, we add the standard message headers so list admins can filter traffic to their lists as described at Mailing List Spam Filtering.

Another change that I made effectively whitelists mail from authenticated senders. That's helpful because some people send mail from hosts that might otherwise look like spam senders, but we can be sure that they're actually sending legit mail because they've provided a username and password or client certificate.

Wed, 17 Jan 2007

My ears!

This past weekend the cache batteries failed on our BlueArc Titan, setting off an audible alarm. BlueArc sent us replacement batteries, but until they arrived, the alarm could not be turned off. The "acknowledge" button worked, but a few seconds later the system would apparently "re-notice" that the battery had failed and trigger the alarm again. It was loud!

Finally replacing the batteries on the machine was a somewhat nerve-wracking experience. It involved actually removing the RAID controller from the Titan and opening the canister in which it's housed. There are two redundant RAID controllers in the Titan, so pulling one shouldn't be too much trouble. But still, it's not something one feels entirely comfortable doing! It seems to have gone OK, though. It's been about 30 hours and nothing has died yet.

Wed, 03 Jan 2007

CSAIL Debian etch installations

I've just completed the first successful from-scratch installation of etch using the FAI-based CSAIL Debian installer. There were a few updates I had to make in order to get our installer to work using newer versions of FAI, but nothing too complicated. Over all it took just a couple of days of work, and is now working quite well. I've fixed a number of outstanding bugs in the process, and have added the ability to install on machines without static IP addresses. I'm not sure we actually want to support running CSAIL Debian on dynamic IPs, but the ability to do the installation will come in handy.

Fri, 29 Dec 2006

More RAM added to mail servers

I've recently added 2 GB of RAM to a couple of the mail servers in CSAIL, doubling their memory to 4GB. This is in part due to SpamAssassin 3, along with the rulesemporium.com rulesets, using significantly more RAM than previous versions. The volume of incoming mail (both spam and ham) has also grown since we first deployed these servers. Thus far the improvement in throughput on the spamassassin system is noticable.

This re-emphasizes the need for better trend monitoring on these systems, though. I want to be able to point to a change in a graph and say "This is because we added more RAM to these servers." We need to come up with a good strategy for tracking the workload on these machines over time and presenting it in a meaningful way. This will be interesting for a number of reasons, and I'm disappointed that we haven't been keeping good historical performance data all along.

Wed, 27 Dec 2006

Still drowning in log files...

...But at least they don't taste so bad.

Like most sysadmins, I spend a fair bit of time reading log files. These come from roughly 3 dozen servers and a few hundred workstations. There's some helpful software out there (notably logcheck), but there's still a lot to read. Logcheck works by excluding certain patterns from log files, and mailing the rest of the content to the admin. The more time one spends tuning the logcheck database, the easier it gets to read the rest.

One thing I've always wished logcheck could do was use some sort of threshold system. There are many messages that, if they only happen once, are no big deal and can be ignored. If they happen many times, however, they are quite important. Logcheck doesn't have a mechanism for dealing with this sort of thing. So I suffer through a bunch more messages than I really need to.

There are a number of other log analyzers that I'd like to investigate, some as a suppliment to logcheck, and others as a replacement. splunk and logwatch are a couple of them. I use logwatch on a machine at home, and it generates decent summaries of logfiles. I've tried it here at the lab, though, and it doesn't seem to work well in an environment where it runs on a machine that is an aggregation point for logs from many machines.

Captchas enabled

Damn the spammers. I've been forced to enable captchas in an effort to combat spam on this blog. It's amazing how persistent the spammers are at finding and spamming blogs, even very obscure blogs based on relatively obscure software. It was sort of interesting to watch how the spammers probed the blog. They found it and posted a couple of individual comments the first few times, which I removed fairly quickly. Then, suddenly, several days after the first spam events they appear to have automated the posting process, and flooded the blog with several hundred spams.

Thu, 07 Dec 2006

LISA, day n

I'll write more about Cory Doctorow's keynote and that sort of thing later. For now, here's a link to the pictures I've taken so far.

Tue, 05 Dec 2006

LISA, day 3

It's hard to believe it's already Tuesday. The conference has certainly kept me very busy, and there have been numerous topics on which I've meant to write, only to find myself exhausted and in desperate need of sleep before I found the chance to do anything about it.

Sunday's Perl class focussed on a topic that seems really popular among a certain group of hardcore Perl programmers. They all seem to really enjoy hacking on the language itself. (Look at modules like Coy.pm, which outputs error messages as haiku, or Language::l33t, a l33t-speak interpreter.) We spent a fair bit of time talking about accessing the various symbol tables and namespaces and that sort of thing. It was actually quite interesting, though I don't know how useful it'll ever be for me.

Monday was divided between two classes. I spent the morning in "The Latest Hacking Tools and Defenses", an interesting class that focused largely on web vulnerabilities like XSS, SQL injection, and code injection. The afternoon session was "Documentation Techniques for Sysadmins", a class that I felt had the potential to really be applicable to TIG.

The hacking class was interesting because a lot has changed since I last paid close attention to the security scene. A few older players like Nessus have gone closed-source, triggering forks of the previously open code, such as OpenVAS. There are a number of new packages out there that look either interesting or disturbing, depending on your perspective. BiDiBLAH is a good example of such a tool. It combines the scanning capabilities of Nmap with the probing of Nessus and the exploit framework of Metasploit. Damn. We also covered Bluetooth vulnerabilities and scanners, as well as some really far out stuff like the use of keystroke audio analysis to "sniff" passwords simply by recording the sound of somebody typing. (OK, it's more complex than that, using statistics and neural nets to learn what each key sounds like, but it has been accomplished, and the accuracy is impressive.)

The documentation class was interesting as well. It covered things that I have always had some sense of, but never quite though out in depth, such as document lifecycles and presentation issues. It was pretty much all stuff that I'd learned in the technical writing class I had to take at Northeastern, if not even sooner, but it's good to refresh that memory, and hopefully I'll be able to take something back to work from that class.

One of the topics we discussed was the use of wikis to organize, manage, and present documentation. The topic was interesting enough that we decided we'd have a BoF later that night where we could discuss wikis in further detail. About 40 people showed up for over an hour and a half. That sort of event is what makes LISA so valuable. The discussion was active for the whole time, and we got to share loads experience with wikis in various different environments, used by different sorts of people, solving different sort of problems. We discussed the various social barriers that sometimes get in the way of wiki adoption, different ways to use wikis, and the technical details, features, and drawbacks of various wiki implementations. I was somewhat surprised to find that roughly half the people were using TWiki, which powers the TIG web site, while the rest were a mix of Confluence, MediaWiki, DocuWiki, and Moin Moin. I had never even heard of Confluence, which is apparently a commercial product, and was somewhat surprised to see relatively few MediaWiki users given its popularity within the lab.

I spent today in David Blank-Edelman's "Over the Edge System Administration" classes. They were fun, with a focus on creative problem solving. Examples include the now famous lpd jukebox (which I can't seem to find using google... am I retarded, or just over tired?) and the use of IRC or various IM services as automated alert channels.

Tomorrow begins the technical sessions and invited talks, which I'm really looking forward to. I need to sit down with the schedule and try to work out which ones I can go to. It sucks when multiple really interesting sounding sessions overlap.

I've already got a number of interesting ideas to take home. I just hope I get them all written down and organized so I remember them...

Sun, 03 Dec 2006

It's so we can get to slide 89!

While going over one of the slides in this morning's portion of his class, Tom Christiansen said something to the effect of "You're probably wondering when you'd ever use something like this. Well, it's so we can get to slide 89!" That's become a theme of class so far. For the most part it's been good, though. Lots of stuff about typeglobs and symbol tables and that sort of stuff. It's stuff I haven't found myself using in my code, which may be because I've never been quite comfortable with how to use them. So this class is useful.

Sat, 02 Dec 2006

LISA 2006 day 0

So the conference hasn't actually started yet, but there have been a couple introductory events. Sign-in has happened, and I was able to pick up my conference materials, schwag, t-shirt, and USENIX tote bag. The shirt and bag are nicer than the ones from past conferences I've attended. I might actually wear the shirt! The hotel leaves a little bit to be desired; it's probably the least interesting of all the conference hotels I've stayed at for USENIX conferences. But the neighborhood is interesting, so I might find myself spending more time outside the hotel than in the past. And that's probably a good thing.

Tomorrow is Tom Christiansen's Advanced Perl Programmming class. It'll be interesting, but I really haven't written very much perl code recently, and my next big project is likely to be written in something other than perl. But advanced programming topics, in general, are interesting and fun to learn about, and adapting the high-level ideas to other languages will be fun. So I look forward to the class.

Thu, 30 Nov 2006

Writebacks enabled

I installed the writeback blugin on this blog, so if you see something interesting and want to reply, or want to point out an error or whatever, please click the comments link below each story!

Fri, 17 Nov 2006

LISA 2006

Approval has been given for me to attend the 2006 LISA conference in Washington D.C. It's in just a few weeks, so I'm busy picking out the classes and tech sessions I want to attend, booking air travel and a hotel room, and generally getting myself ready for my first conference in four years.

LISA is a conference for sysadmins sponsored by USENIX. It's an annual conference that has been going on for the past 20 years and is basically the biggest gathering of sysadmins and related professionals and researchers in the world. My schedule will consist of three days of classes, followed by the conference itself (that is, the technical sessions, invited talks, "guru sessions", BoFs etc.). The keynote address this year is being given by Cory Doctorow of sci-fi, Boing Boing, and EFF fame. That will be a lot of fun.

I'm currently signed up for a number of interesting classes, including Advanced Perl Programming, Documentation Techniques for Sysadmins, The Latest Hacking Tools and Defenses, and Over the Edge System Administration, volume 1 and 2. Following those classes will be a number of tech session on topics ranging from anti-spam techniques to VoIP to legal issues surrounding recent and proposed federal wiretapping legislation as they affect the system administrator.

More will follow...

Tue, 17 Oct 2006

It's not just your email, you know...

My last several posts here have related in some way to the CSAIL mail system. While that's a big part of what I work on, it's definitely not everything. One of my other big projects is the development and maintenence of CSAIL Debian, our custom variant of Debian GNU/Linux. Even with recent delays, Debian is getting ready to release a new version (4.0, codenamed "etch", after the "etch-a-sketch" character from Toy Story) somewhere around the end of this year. Once that happens, we need to start preparing to upgrade all the CSAIL Debian machines in the lab. There are several hundred of them, all in various different states. One of the unique things about CSAIL is that computer owners here get root access on their machines, and we still support them. Most places give you a choice between root access and support, if they even give you the choice of root access at all.

In preparation for the upgrade, I first plan on upgrading to cfengine2, and that's the primary focus of my CSAIL Debian efforts right now. There's other random stuff, including kernel updates to support some very new hardware, worrying about some difficult to resolve security problems, and some work for Debian itself, mainly in my role as a security officer.

Mon, 16 Oct 2006

And the mystery remains unsolved...

I've again disabled all use of TLS on outgoing mail on outgoing.csail.mit.edu. It seems like it might still be misbehaving. OTOH, the queue runners I've seen crashing may be left over from before I made the change to disable client-side certificates. In either case, I was very disappointed to see that exim was still segfaulting all weekend.

For the record, it's worth noting that no mail is getting lost, or even bounced, when exim crashes in this case. The mail has already been accepted into exim's queue by the time we see the crashes. That means that our server has taken responsibility for handing the message delivery. When it encounters a message that causes an unexpected problem like these signal 11s, it marks the message as "frozen" and saves it to disk. It's then up to an administrator (me) to investigate the problem and get the message delivered.

Sat, 14 Oct 2006

The case of the sudden segfaults

The CSAIL mail servers run exim, a very flexible and powerful mail server. Typically a very stable one, too, at least until just recently. It was brought to my attention yesterday that CSAIL users were unable to send mail to addresses @broad.mit.edu. Investigation revealed that yes, there was a problem. The exim process responsible for handling the remote SMTP session was crashing when communicating with the Broad servers. Trial and error revealed that this was related to the SSL certificate exchange taking place. I disabled SSL for remote SMTP sessions to work around the problem while I investigated further. I copied outgoing.csail.mit.edu's Exim configs over to a test system, only to find that exim would crash there as well. How is this possible? It had never done this before!

More testing revealed that the list server, tweety.csail.mit.edu, which runs exim and handles its own deliveries, was able to send mail over an SMTP+SSL session to Broad. But it runs the same version of exim. The exact same binaries, in fact. I determined that the only difference was that outgoing was configured to use certificates when negotiating an SSL session as a client, and tweety was not. So I was able to re-enable SSL in client mode on outgoing, but I had to leave the certificate-related directives out. Given that it isn't likely that anybody out there was actually verifying the certificates, this isn't likely to ever be an issue. But it's very weird.

At least my initial fear didn't turn out to be true. I released a security advisory for the Debian OpenSSL packages the other day. I was concerned that this was somehow related to the problem. However, reverting to the previous version didn't change anything. I guess that's good. Sort of.

Tue, 10 Oct 2006

Webmail upgrade on its way

The feedback on the new Horde/IMP/Turba installation has been generally very positive, and I'm about ready to go ahead with the upgrade on the production host. We'll see how that goes tonight

Mon, 02 Oct 2006

horde upgrades

The CSAIL webmail system is based on a package called IMP, written in PHP using a framework called HORDE. It was installed sometime in 2003, and hasn't been upgraded since. It's a couple major revisions behind what the developers are currently supporting. So I've been spending the past few days preparing to upgrade it. The new version is up for testing at https://imap-stage.csail.mit.edu/horde/ and looks pretty good. Thus far the feedback has been positive. Check it out and let me know what you think...


Work blog by Noah Meyerhans is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.