At the end of 2012, Cybera (a nonprofit with a mandate to oversee the development of cyberinfrastructure in Alberta, Canada) deployed an updated OpenStack cloud for their DAIR project (http://www.canarie.ca/en/dair-program/about). A few days into production, a compute node locks up. Upon rebooting the node, I checked to see what instances were hosted on that node so I could boot them on behalf of the customer. Luckily, only one instance.
The nova reboot command wasn't working, so
            I used virsh, but it immediately came back
            with an error saying it was unable to find the backing
            disk. In this case, the backing disk is the Glance image
            that is copied to
                /var/lib/nova/instances/_base when the
            image is used for the first time. Why couldn't it find it?
            I checked the directory and sure enough it was
            gone.
I reviewed the nova database and saw the
            instance's entry in the nova.instances table.
            The image that the instance was using matched what virsh
            was reporting, so no inconsistency there.
I checked Glance and noticed that this image was a snapshot that the user created. At least that was good news — this user would have been the only user affected.
Finally, I checked StackTach and reviewed the user's events. They
            had created and deleted several snapshots—most likely
            experimenting. Although the timestamps didn't match up, my
            conclusion was that they launched their instance and then deleted
            the snapshot and it was somehow removed from
                /var/lib/nova/instances/_base. None of that
            made sense, but it was the best I could come up with.
It turns out the reason that this compute node locked up was a hardware issue. We removed it from the DAIR cloud and called Dell to have it serviced. Dell arrived and began working. Somehow or another (or a fat finger), a different compute node was bumped and rebooted. Great.
When this node fully booted, I ran through the same scenario of seeing what instances were running so I could turn them back on. There were a total of four. Three booted and one gave an error. It was the same error as before: unable to find the backing disk. Seriously, what?
Again, it turns out that the image was a snapshot. The three other instances that successfully started were standard cloud images. Was it a problem with snapshots? That didn't make sense.
A note about DAIR's architecture:
                /var/lib/nova/instances is a shared NFS
            mount. This means that all compute nodes have access to
            it, which includes the _base directory.
            Another centralized area is /var/log/rsyslog
            on the cloud controller. This directory collects all
            OpenStack logs from all compute nodes. I wondered if there
            were any entries for the file that virsh is
            reporting:
            
dair-ua-c03/nova.log:Dec 19 12:10:59 dair-ua-c03
2012-12-19 12:10:59 INFO nova.virt.libvirt.imagecache
[-] Removing base file:
/var/lib/nova/instances/_base/7b4783508212f5d242cbf9ff56fb8d33b4ce6166_10
            
Ah-hah! So OpenStack was deleting it. But why?
A feature was introduced in Essex to periodically check
            and see if there were any _base files not in use.
            If there
            were, Nova would delete them. This idea sounds innocent
            enough and has some good qualities to it. But how did this
            feature end up turned on? It was disabled by default in
            Essex. As it should be. It was decided to be turned on in Folsom
            (https://bugs.launchpad.net/nova/+bug/1029674). I cannot
            emphasize enough that:
Actions which delete things should not be enabled by default.
Disk space is cheap these days. Data recovery is not.
Secondly, DAIR's shared
                /var/lib/nova/instances directory
            contributed to the problem. Since all compute nodes have
            access to this directory, all compute nodes periodically
            review the _base directory. If there is only one instance
            using an image, and the node that the instance is on is
            down for a few minutes, it won't be able to mark the image
            as still in use. Therefore, the image seems like it's not
            in use and is deleted. When the compute node comes back
            online, the instance hosted on that node is unable to
            start.

