I was on-site in Kelowna, British Columbia, Canada setting up a new OpenStack cloud. The deployment was fully automated: Cobbler deployed the OS on the bare metal, bootstrapped it, and Puppet took over from there. I had run the deployment scenario so many times in practice and took for granted that everything was working.
On my last day in Kelowna, I was in a conference call from my hotel. In the background, I was fooling around on the new cloud. I launched an instance and logged in. Everything looked fine. Out of boredom, I ran ps aux and all of the sudden the instance locked up.
Thinking it was just a one-off issue, I terminated the instance and launched a new one. By then, the conference call ended and I was off to the data center.
At the data center, I was finishing up some tasks and remembered the lock-up. I logged into the new instance and ran ps aux again. It worked. Phew. I decided to run it one more time. It locked up. WTF.
After reproducing the problem several times, I came to the unfortunate conclusion that this cloud did indeed have a problem. Even worse, my time was up in Kelowna and I had to return back to Calgary.
Where do you even begin troubleshooting something like this? An instance just randomly locks when a command is issued. Is it the image? Nope — it happens on all images. Is it the compute node? Nope — all nodes. Is the instance locked up? No! New SSH connections work just fine!
We reached out for help. A networking engineer suggested it was an MTU issue. Great! MTU! Something to go on! What's MTU and why would it cause a problem?
MTU is maximum transmission unit. It specifies the maximum number of bytes that the interface accepts for each packet. If two interfaces have two different MTUs, bytes might get chopped off and weird things happen -- such as random session lockups.
Note | |
---|---|
Not all packets have a size of 1500. Running the ls command over SSH might only create a single packets less than 1500 bytes. However, running a command with heavy output, such as ps aux requires several packets of 1500 bytes. |
OK, so where is the MTU issue coming from? Why haven't we seen this in any other deployment? What's new in this situation? Well, new data center, new uplink, new switches, new model of switches, new servers, first time using this model of servers… so, basically everything was new. Wonderful. We toyed around with raising the MTU at various areas: the switches, the NICs on the compute nodes, the virtual NICs in the instances, we even had the data center raise the MTU for our uplink interface. Some changes worked, some didn't. This line of troubleshooting didn't feel right, though. We shouldn't have to be changing the MTU in these areas.
As a last resort, our network admin (Alvaro) and myself sat down with four terminal windows, a pencil, and a piece of paper. In one window, we ran ping. In the second window, we ran tcpdump on the cloud controller. In the third, tcpdump on the compute node. And the forth had tcpdump on the instance. For background, this cloud was a multi-node, non-multi-host setup.
One cloud controller acted as a gateway to all compute nodes. VlanManager was used for the network config. This means that the cloud controller and all compute nodes had a different VLAN for each OpenStack project. We used the -s option of ping to change the packet size. We watched as sometimes packets would fully return, sometimes they'd only make it out and never back in, and sometimes the packets would stop at a random point. We changed tcpdump to start displaying the hex dump of the packet. We pinged between every combination of outside, controller, compute, and instance.
Finally, Alvaro noticed something. When a packet from the outside hits the cloud controller, it should not be configured with a VLAN. We verified this as true. When the packet went from the cloud controller to the compute node, it should only have a VLAN if it was destined for an instance. This was still true. When the ping reply was sent from the instance, it should be in a VLAN. True. When it came back to the cloud controller and on its way out to the public internet, it should no longer have a VLAN. False. Uh oh. It looked as though the VLAN part of the packet was not being removed.
That made no sense.
While bouncing this idea around in our heads, I was randomly typing commands on the compute node:
$ ip a … 10: vlan100@vlan20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br100 state UP …
"Hey Alvaro, can you run a VLAN on top of a VLAN?"
"If you did, you'd add an extra 4 bytes to the packet…"
Then it all made sense…
$ grep vlan_interface /etc/nova/nova.conf vlan_interface=vlan20
In nova.conf
, vlan_interface
specifies what interface OpenStack should attach all VLANs
to. The correct setting should have been:
vlan_interface=bond0
.
As this would be the server's bonded NIC.
vlan20 is the VLAN that the data center gave us for outgoing public internet access. It's a correct VLAN and is also attached to bond0.
By mistake, I configured OpenStack to attach all tenant VLANs to vlan20 instead of bond0 thereby stacking one VLAN on top of another which then added an extra 4 bytes to each packet which cause a packet of 1504 bytes to be sent out which would cause problems when it arrived at an interface that only accepted 1500!
As soon as this setting was fixed, everything worked.