It's not a bug, it's a feature...

Sat, 01 Dec 2007

AFS fileserver issue

One of our AFS fileservers lost a disk late this afternoon, resulting in a couple hours of downtime. A single disk failure shouldn't result in any downtime, but in this case it did. The disk was part of a mirror set hosting the machine's root filesystem and boot blocks, and for some reason it didn't seem to notice correctly that the disk had failed, so it continued trying to access it. This resulted in access attempts hanging, causing the machine to develop a backlog of AFS fileserver requests eventually triggering an alert to the TIG oncall people (which included me this weekend).

The dead disk has been replaced, and things are OK again...


Work blog by Noah Meyerhans is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.