my favorite outage... so far

2023-05-26 · 9min

Table of Contents

That usage of favorite is doing some heavy lifting. I wouldn't ever say that I enjoyed any of them but I have one in mind and it's a good one because it had so many moving parts.

It stands out particularly because it was one of the few times where I was on-call over Christmas break. Normally, in an enterprise setting, these are the quietest times since no one else wants to be working either so no one is breaking anything. This is a key plot point. My peak angst and sadness was sitting in the basement of Grandma's house on Christmas Eve, on a laptop, on a call, with extremely bad internet coverage, having to answer why our most visible service was bouncing.

the setup

Around 10 years ago, VMWare was still pretty big. The Cloud was looming ominously and containers were but a glint in Docker's eye. It was a time where you had a datacenter or two and if you wanted to be doing the trendy thing, you had had virtual machines everywhere. VMWare had an attractive pitch and it appealed to everyone. Need a machine? Spin one up over there. It needs more resources? VMotion it to another host while it was running. It's still cool in my mind even in today's environments if you need that¹. If you didn't need the VM, you could hibernate it. No more CPU or RAM usage (you know, the expensive bits) and you could spin up some more.

This company had particularly compelling product for collaboration. You can integrate it with pretty much everything (if it didn't exist, professional services could figure something out) and while they cringed if you called it Facebook for Business (because Facebook did that slightly later and the stock took a huge dive), it was basically an internal facebook for your company. Great for knowledge base, inter-departmental updates, chat, etc. And they offered trials. If you liked it, you could pay and keep it going. If it was big enough, you could pay some more and they'd make you bigger instances to handle your data. If you didn't keep using it, your environment was hibernated. If you happened to come back, it could automatically be turned back on.

It's been a dog's age since I've thought about this, but to my recollection, they had something around 12,000 virtual machines in play for this service². We had several racks of large HP c7000 chassis with multiple 10g connections with many terabytes of Netapp appliances for storage. The network was using Juniper EX4500s for Top of Rack switching and they would connect back to some other large chassis switches, a fairly hefty Juniper SRX firewall pair, and some beefy F5 load balancers. The ToR switches were the default gateways for these VM networks so that we didn't have to have to do battle against spanning tree for the entire data center³ and they terminated pretty large network subnets. A /20 (~4,000 hosts) per switch pair sounds about right for the scale per rack.

graph TD
  core --- ex4500-1 & ex4500-2
  subgraph lb
    core --- ha-loadbalancer
  end
  subgraph layer-3: top of rack
    ex4500-1 & ex4500-2 --- vmware-esxi
    vmware-esxi --- vm1 & vm2 & vmN
  end

A small visualization to describe what was happening. This traffic was all being routed rather than plugging a VLAN directly into the load balancer.

So how was this all managed? A deployment had at least 2 hosts. 1 app server and 1 database made it easier to scale if the client needed to do so. The OS was all templated out with golden images in VMWare and final adjustments on bring-up happened with some internal tooling. The switches were static since once a chassis and the vmware cluster came up, they sat and moved packets back and forth. The F5s, which had the most intensely long configuration that I had ever worked with were pre-populated with possible node IPs. It relied entirely on wildcard SSL certificates and doing a virtualhost style lookup in order to get the the traffic from the client over to the correct cluster. That process was managed by the internal automation tool. The F5 would do health checks on the nodes every minute so it could quicky return a pretty page to the user if the deployment was frozen, down, etc and that was pretty standard. What didn't happen though, is that if an environment was frozen, that wasn't communicated to the load balancer. It would continue doing checks, noting it was down, and carrying on. That's fine, right?

it started getting weird

For a few months leading up to my christmas misery, monitoring would start freaking out about large swathes of nodes being unreachable for a polling cycle or two and then clear up. It was irregular and left everyone scratching heads as to the cause. We opened a ticket with Juniper to try and figure out what was happening because on paper, everything should have been great and this must be a bug.

We started doing packet capture and were able to get some data when the monitoring was failing. We'd see an ARP request go out from the switch, a machine would respond, and the switch would ignore it. This meant that the switch was trying to get the ethernet address to deliver some IP traffic to a host but since it couldn't figure out how to deliver it, the traffic was dropped. Now, the 'why' here was something we didn't figure out and we were in the holiday freeze. Getting changes approved, especially experimental ones, were basically an act of congress. We decided to let it sit for after the freeze, though, someone had the idea (and it was approved) to run a round of VM de-provisionings to shed some load on the VMWare servers.

The big one came

So, Christmas Eve night, as I said, grandma's basement, on a call, internet connection where I had to perch the phone in just the right position, being read the riot act about the network being more broken in the last few hours than it had ever been. I wasn't in a great mind set to start out with but it ended with some weird hacks involved. I think the temporary solution was that we knew it was ARP related, so I increased the ARP timeout from 20 miutes to the maximum value and the network started to settle. The higher-ups begrudgingly let me go, satisfied that we were ok for now but we need to grill Juniper as hard as possible in order to get it sorted.

And Juniper did eventually give me a mysterious list of commands where I would telnet into the line cards and run a special command to reveal that indeed, yes, we were doing bad things. On a Juniper EX4500 (accurate as of 10 years ago at least), there's a 255 entry queue that holds onto in-flight ARP requests. If you get a flood of arp requests, it can, at max, resolve 255 of them at once. If you have more than 255 pending, say perhaps due to machines being offline and being unable to respond, that queue can be full and cause machines that are alive and responsive to be claimed as dead. So, those F5 health checks, pinging thousands of nodes, that was the source of our sadness.

the permanent, temporary fix

Because we could see the ARP queue depth, I began to monitor it. It would vary between 150 and 200 but sometimes we'd see it top out at 255 though they never developed into full fledged outages. It still wasn't something that we could really sit on and have it go away.

Ideally, the machinery that would freeze environments would have also been able to freeze the load balancer health checks but for reasons beyond my control, they had no engineering time available for that one. At least not right now. Ok then. I guess they were really trying to get that multi-tenant service out the door⁴.

Now what could I find in my magical bag of tools? I didn't have the ability to modify the F5 because the behavior as it existed was required. I did my spelunking and I found a similar problem and use case where a bunch of ARP could clog up routers. routers were (and still are, at least compared to a desktop CPU off the shelf today) particularly underpowered so any time spent handling unnecessary traffic could be service affecting. Amsterdam IX (AMS-IX) knew this ARP pain specificaly and they had made a tool, arpsponge to shed that load.

It works by watching the wire and if an ARP request without a response is seen X times, it will answer with it's own bogus data. If it sees a response to an IP it was sponging, it stops. I got the tiniest of VMS provisioned in each rack, plumbed it into the vlans, and started up arpsponge. The resulting graphs are long gone but use your imagination to envision a line that was hovering around the 180 mark down to 2. Potential future crisis averted.

take aways

Getting automation done right is hard. The provisioning system was doing exactly as architected but it all collapsed in a way that it didn't account for. Keep in mind that if you provision something, you must also take all the steps to deprovision something as well. Accumulated cruft did some bad things here.

Unless you have an amazing amount of pull with a vendor, it took us a whole bunch of back and forth with Juniper to decipher our foot-gun. This wasn't the only time either that I had to do battle with Juniper. Ask me about that one time where we had an EX8200 in a virtual chassis.

Freezing all those nodes before a holiday break was a bad idea.

Try not be on call on Christmas Eve.

How about you, have a favorite outage?

You shouldn't need that if you are building something today. Single points of failure are bad. ²: They were working on a multi-tenant architecture, but they weren't there yet. ³: I would have done unspeakable things for EVPN/VXLAN back then but MPLS wasn't something you did unless you were a service provider. ⁴: It was several years before it ever happened, if it did. IIRC, they were acquired by a competitor.