Chapter 11. Maintenance, Failures, and Debugging

Downtime, whether planned or unscheduled, is a certainty when running a cloud. This chapter aims to provide useful information for dealing proactively, or reactively, with these occurrences.

 Cloud Controller and Storage Proxy Failures and Maintenance

The cloud controller and storage proxy are very similar to each other when it comes to expected and unexpected downtime. One of each server type typically runs in the cloud, which makes them very noticeable when they are not running.

For the cloud controller, the good news is if your cloud is using the FlatDHCP multi-host HA network mode, existing instances and volumes continue to operate while the cloud controller is offline. For the storage proxy, however, no storage traffic is possible until it is back up and running.

 Planned Maintenance

One way to plan for cloud controller or storage proxy maintenance is to simply do it off-hours, such as at 1 a.m. or 2 a.m. This strategy affects fewer users. If your cloud controller or storage proxy is too important to have unavailable at any point in time, you must look into high-availability options.

 Rebooting a Cloud Controller or Storage Proxy

All in all, just issue the "reboot" command. The operating system cleanly shuts down services and then automatically reboots. If you want to be very thorough, run your backup jobs just before you reboot.

 After a Cloud Controller or Storage Proxy Reboots

After a cloud controller reboots, ensure that all required services were successfully started. The following commands use ps and grep to determine if nova, glance, and keystone are currently running:

# ps aux | grep nova-
# ps aux | grep glance-
# ps aux | grep keystone
# ps aux | grep cinder

Also check that all services are functioning. The following set of commands sources the openrc file, then runs some basic glance, nova, and keystone commands. If the commands work as expected, you can be confident that those services are in working condition:

# source openrc
# glance index
# nova list
# keystone tenant-list

For the storage proxy, ensure that the Object Storage service has resumed:

# ps aux | grep swift

Also check that it is functioning:

# swift stat

 Total Cloud Controller Failure

The cloud controller could completely fail if, for example, its motherboard goes bad. Users will immediately notice the loss of a cloud controller since it provides core functionality to your cloud environment. If your infrastructure monitoring does not alert you that your cloud controller has failed, your users definitely will. Unfortunately, this is a rough situation. The cloud controller is an integral part of your cloud. If you have only one controller, you will have many missing services if it goes down.

To avoid this situation, create a highly available cloud controller cluster. This is outside the scope of this document, but you can read more in the draft OpenStack High Availability Guide.

The next best approach is to use a configuration-management tool, such as Puppet, to automatically build a cloud controller. This should not take more than 15 minutes if you have a spare server available. After the controller rebuilds, restore any backups taken (see Chapter 14, Backup and Recovery).

Also, in practice, the nova-compute services on the compute nodes do not always reconnect cleanly to rabbitmq hosted on the controller when it comes back up after a long reboot; a restart on the nova services on the compute nodes is required.

 Compute Node Failures and Maintenance

Sometimes a compute node either crashes unexpectedly or requires a reboot for maintenance reasons.

 Inspecting and Recovering Data from Failed Instances

In some scenarios, instances are running but are inaccessible through SSH and do not respond to any command. The VNC console could be displaying a boot failure or kernel panic error messages. This could be an indication of file system corruption on the VM itself. If you need to recover files or inspect the content of the instance, qemu-nbd can be used to mount the disk.

[Warning]Warning

If you access or view the user's content and data, get approval first!

To access the instance's disk (/var/lib/nova/instances/instance-xxxxxx/disk), use the following steps:

If you do not follow steps 4 through 6, OpenStack Compute cannot manage the instance any longer. It fails to respond to any command issued by OpenStack Compute, and it is marked as shut down.

Once you mount the disk file, you should be able to access it and treat it as a collection of normal directories with files and a directory structure. However, we do not recommend that you edit or touch any files because this could change the access control lists (ACLs) that are used to determine which accounts can perform what operations on files and directories. Changing ACLs can make the instance unbootable if it is not already.

  1. Suspend the instance using the virsh command, taking note of the internal ID:

    # virsh list
    Id Name                 State
    ----------------------------------
    1 instance-00000981    running
    2 instance-000009f5    running
    30 instance-0000274a    running
    
    # virsh suspend 30
    Domain 30 suspended
  2. Connect the qemu-nbd device to the disk:

    # cd /var/lib/nova/instances/instance-0000274a
    # ls -lh
    total 33M
    -rw-rw---- 1 libvirt-qemu kvm  6.3K Oct 15 11:31 console.log
    -rw-r--r-- 1 libvirt-qemu kvm   33M Oct 15 22:06 disk
    -rw-r--r-- 1 libvirt-qemu kvm  384K Oct 15 22:06 disk.local
    -rw-rw-r-- 1 nova         nova 1.7K Oct 15 11:30 libvirt.xml
    # qemu-nbd -c /dev/nbd0 `pwd`/disk
  3. Mount the qemu-nbd device.

    The qemu-nbd device tries to export the instance disk's different partitions as separate devices. For example, if vda is the disk and vda1 is the root partition, qemu-nbd exports the device as /dev/nbd0 and /dev/nbd0p1, respectively:

    # mount /dev/nbd0p1 /mnt/

    You can now access the contents of /mnt, which correspond to the first partition of the instance's disk.

    To examine the secondary or ephemeral disk, use an alternate mount point if you want both primary and secondary drives mounted at the same time:

    # umount /mnt
    # qemu-nbd -c /dev/nbd1 `pwd`/disk.local
    # mount /dev/nbd1 /mnt/
    # ls -lh /mnt/
    total 76K
    lrwxrwxrwx.  1 root root    7 Oct 15 00:44 bin -> usr/bin
    dr-xr-xr-x.  4 root root 4.0K Oct 15 01:07 boot
    drwxr-xr-x.  2 root root 4.0K Oct 15 00:42 dev
    drwxr-xr-x. 70 root root 4.0K Oct 15 11:31 etc
    drwxr-xr-x.  3 root root 4.0K Oct 15 01:07 home
    lrwxrwxrwx.  1 root root    7 Oct 15 00:44 lib -> usr/lib
    lrwxrwxrwx.  1 root root    9 Oct 15 00:44 lib64 -> usr/lib64
    drwx------.  2 root root  16K Oct 15 00:42 lost+found
    drwxr-xr-x.  2 root root 4.0K Feb  3  2012 media
    drwxr-xr-x.  2 root root 4.0K Feb  3  2012 mnt
    drwxr-xr-x.  2 root root 4.0K Feb  3  2012 opt
    drwxr-xr-x.  2 root root 4.0K Oct 15 00:42 proc
    dr-xr-x---.  3 root root 4.0K Oct 15 21:56 root
    drwxr-xr-x. 14 root root 4.0K Oct 15 01:07 run
    lrwxrwxrwx.  1 root root    8 Oct 15 00:44 sbin -> usr/sbin
    drwxr-xr-x.  2 root root 4.0K Feb  3  2012 srv
    drwxr-xr-x.  2 root root 4.0K Oct 15 00:42 sys
    drwxrwxrwt.  9 root root 4.0K Oct 15 16:29 tmp
    drwxr-xr-x. 13 root root 4.0K Oct 15 00:44 usr
    drwxr-xr-x. 17 root root 4.0K Oct 15 00:44 var
  4. Once you have completed the inspection, unmount the mount point and release the qemu-nbd device:

    # umount /mnt
    # qemu-nbd -d /dev/nbd0
    /dev/nbd0 disconnected
  5. Resume the instance using virsh:

    # virsh list
    Id Name                 State
    ----------------------------------
    1 instance-00000981    running
    2 instance-000009f5    running
    30 instance-0000274a    paused
    
    # virsh resume 30
    Domain 30 resumed

 /var/lib/nova/instances

It's worth mentioning this directory in the context of failed compute nodes. This directory contains the libvirt KVM file-based disk images for the instances that are hosted on that compute node. If you are not running your cloud in a shared storage environment, this directory is unique across all compute nodes.

/var/lib/nova/instances contains two types of directories.

The first is the _base directory. This contains all the cached base images from glance for each unique image that has been launched on that compute node. Files ending in _20 (or a different number) are the ephemeral base images.

The other directories are titled instance-xxxxxxxx. These directories correspond to instances running on that compute node. The files inside are related to one of the files in the _base directory. They're essentially differential-based files containing only the changes made from the original _base directory.

All files and directories in /var/lib/nova/instances are uniquely named. The files in _base are uniquely titled for the glance image that they are based on, and the directory names instance-xxxxxxxx are uniquely titled for that particular instance. For example, if you copy all data from /var/lib/nova/instances on one compute node to another, you do not overwrite any files or cause any damage to images that have the same unique name, because they are essentially the same file.

Although this method is not documented or supported, you can use it when your compute node is permanently offline but you have instances locally stored on it.

 Storage Node Failures and Maintenance

Because of the high redundancy of Object Storage, dealing with object storage node issues is a lot easier than dealing with compute node issues.

 Rebooting a Storage Node

If a storage node requires a reboot, simply reboot it. Requests for data hosted on that node are redirected to other copies while the server is rebooting.

 Handling a Complete Failure

A common way of dealing with the recovery from a full system failure, such as a power outage of a data center, is to assign each service a priority, and restore in order. Table 11.1, “Example service restoration priority list” shows an example.

Table 11.1. Example service restoration priority list
Priority Services

1

Internal network connectivity

2

Backing storage services

3

Public network connectivity for user virtual machines

4

nova-compute, nova-network, cinder hosts

5

User virtual machines

10

Message queue and database services

15

Keystone services

20

cinder-scheduler

21

Image Catalog and Delivery services

22

nova-scheduler services

98

cinder-api

99

nova-api services

100

Dashboard node

Use this example priority list to ensure that user-affected services are restored as soon as possible, but not before a stable environment is in place. Of course, despite being listed as a single-line item, each step requires significant work. For example, just after starting the database, you should check its integrity, or, after starting the nova services, you should verify that the hypervisor matches the database and fix any mismatches.

 Configuration Management

Maintaining an OpenStack cloud requires that you manage multiple physical servers, and this number might grow over time. Because managing nodes manually is error prone, we strongly recommend that you use a configuration-management tool. These tools automate the process of ensuring that all your nodes are configured properly and encourage you to maintain your configuration information (such as packages and configuration options) in a version-controlled repository.

[Tip]Tip

Several configuration-management tools are available, and this guide does not recommend a specific one. The two most popular ones in the OpenStack community are Puppet, with available OpenStack Puppet modules; and Chef, with available OpenStack Chef recipes. Other newer configuration tools include Juju, Ansible, and Salt; and more mature configuration management tools include CFEngine and Bcfg2.

 Working with Hardware

As for your initial deployment, you should ensure that all hardware is appropriately burned in before adding it to production. Run software that uses the hardware to its limits—maxing out RAM, CPU, disk, and network. Many options are available, and normally double as benchmark software, so you also get a good idea of the performance of your system.

 Adding a Compute Node

If you find that you have reached or are reaching the capacity limit of your computing resources, you should plan to add additional compute nodes. Adding more nodes is quite easy. The process for adding compute nodes is the same as when the initial compute nodes were deployed to your cloud: use an automated deployment system to bootstrap the bare-metal server with the operating system and then have a configuration-management system install and configure OpenStack Compute. Once the Compute Service has been installed and configured in the same way as the other compute nodes, it automatically attaches itself to the cloud. The cloud controller notices the new node(s) and begins scheduling instances to launch there.

If your OpenStack Block Storage nodes are separate from your compute nodes, the same procedure still applies because the same queuing and polling system is used in both services.

We recommend that you use the same hardware for new compute and block storage nodes. At the very least, ensure that the CPUs are similar in the compute nodes to not break live migration.

 Adding an Object Storage Node

Adding a new object storage node is different from adding compute or block storage nodes. You still want to initially configure the server by using your automated deployment and configuration-management systems. After that is done, you need to add the local disks of the object storage node into the object storage ring. The exact command to do this is the same command that was used to add the initial disks to the ring. Simply rerun this command on the object storage proxy server for all disks on the new object storage node. Once this has been done, rebalance the ring and copy the resulting ring files to the other storage nodes.

[Note]Note

If your new object storage node has a different number of disks than the original nodes have, the command to add the new node is different from the original commands. These parameters vary from environment to environment.

 Replacing Components

Failures of hardware are common in large-scale deployments such as an infrastructure cloud. Consider your processes and balance time saving against availability. For example, an Object Storage cluster can easily live with dead disks in it for some period of time if it has sufficient capacity. Or, if your compute installation is not full, you could consider live migrating instances off a host with a RAM failure until you have time to deal with the problem.

 Databases

Almost all OpenStack components have an underlying database to store persistent information. Usually this database is MySQL. Normal MySQL administration is applicable to these databases. OpenStack does not configure the databases out of the ordinary. Basic administration includes performance tweaking, high availability, backup, recovery, and repairing. For more information, see a standard MySQL administration guide.

You can perform a couple of tricks with the database to either more quickly retrieve information or fix a data inconsistency error—for example, an instance was terminated, but the status was not updated in the database. These tricks are discussed throughout this book.

 Database Connectivity

Review the component's configuration file to see how each OpenStack component accesses its corresponding database. Look for either sql_connection or simply connection. The following command uses grep to display the SQL connection string for nova, glance, cinder, and keystone:

# grep -hE "connection ?=" /etc/nova/nova.conf /etc/glance/glance-*.conf
/etc/cinder/cinder.conf /etc/keystone/keystone.conf
sql_connection = mysql://nova:nova@cloud.alberta.sandbox.cybera.ca/nova
sql_connection = mysql://glance:password@cloud.example.com/glance
sql_connection = mysql://glance:password@cloud.example.com/glance
sql_connection = mysql://cinder:password@cloud.example.com/cinder
    connection = mysql://keystone_admin:password@cloud.example.com/keystone

The connection strings take this format:

mysql:// <username> : <password> @ <hostname> / <database name>

 Performance and Optimizing

As your cloud grows, MySQL is utilized more and more. If you suspect that MySQL might be becoming a bottleneck, you should start researching MySQL optimization. The MySQL manual has an entire section dedicated to this topic: Optimization Overview.

 HDWMY

Here's a quick list of various to-do items for each hour, day, week, month, and year. Please note that these tasks are neither required nor definitive but helpful ideas:

 Hourly

  • Check your monitoring system for alerts and act on them.

  • Check your ticket queue for new tickets.

 Daily

  • Check for instances in a failed or weird state and investigate why.

  • Check for security patches and apply them as needed.

 Weekly

  • Check cloud usage:

    • User quotas

    • Disk space

    • Image usage

    • Large instances

    • Network usage (bandwidth and IP usage)

  • Verify your alert mechanisms are still working.

 Monthly

  • Check usage and trends over the past month.

  • Check for user accounts that should be removed.

  • Check for operator accounts that should be removed.

 Quarterly

  • Review usage and trends over the past quarter.

  • Prepare any quarterly reports on usage and statistics.

  • Review and plan any necessary cloud additions.

  • Review and plan any major OpenStack upgrades.

 Semiannually

  • Upgrade OpenStack.

  • Clean up after an OpenStack upgrade (any unused or new services to be aware of?).

 Determining Which Component Is Broken

OpenStack's collection of different components interact with each other strongly. For example, uploading an image requires interaction from nova-api, glance-api, glance-registry, keystone, and potentially swift-proxy. As a result, it is sometimes difficult to determine exactly where problems lie. Assisting in this is the purpose of this section.

 Tailing Logs

The first place to look is the log file related to the command you are trying to run. For example, if nova list is failing, try tailing a nova log file and running the command again:

Terminal 1:

# tail -f /var/log/nova/nova-api.log

Terminal 2:

# nova list

Look for any errors or traces in the log file. For more information, see Chapter 13, Logging and Monitoring.

If the error indicates that the problem is with another component, switch to tailing that component's log file. For example, if nova cannot access glance, look at the glance-api log:

Terminal 1:

# tail -f /var/log/glance/api.log

Terminal 2:

# nova list

Wash, rinse, and repeat until you find the core cause of the problem.

 Running Daemons on the CLI

Unfortunately, sometimes the error is not apparent from the log files. In this case, switch tactics and use a different command; maybe run the service directly on the command line. For example, if the glance-api service refuses to start and stay running, try launching the daemon from the command line:

# sudo -u glance -H glance-api

This might print the error and cause of the problem.

[Note]Note

The -H flag is required when running the daemons with sudo because some daemons will write files relative to the user's home directory, and this write may fail if -H is left off.

Questions? Discuss on ask.openstack.org
Found an error? Report a bug against this page


loading table of contents...