February 12, 2016

Diary of a network outage

Advanced

Rob N ★

Alum

From around 5am on the morning of Sunday 7 February (6pm 6 Feb UTC, 1pm 6 Feb US East), FastMail was unavailable for almost two hours. In this post we’ll take a look at what happened, why it happened and what we’re doing to restore our reputation as one of the most reliable email providers in the business.

Background

Back in November 2015 we were hit with a DDoS attack. As part of building out our defenses, our primary datacentre provider NYI moved us from our existing network-based DDoS protection service (Black Lotus) to a new one at Level 3. The main reason for this is capacity — Level 3 form a large part of the internet backbone, a global network with effectively unlimited capacity that the internet services we use every day sit on top of. Since a DDoS attack is fundamentally an attempt to overwhelm the target, being connected to a network that can’t be overwhelmed gives us a much better ability to identify and block the incoming traffic.

This kind of network-level DDoS protection is built from two main mechanisms. One is BGP, a protocol that operates inside the large networks that form the global network. BGP is used by network providers to advertise that they are responsible for particular segments of the network. Before the change, NYI would use BGP to advertise our IP range (66.111.4.0/24) to their network providers, which would distribute that advertisement to their peers, and so on, until eventually every network provider and ISP in the world know how to find a path through the network to our servers. After the change, Level 3 advertised our network as belonging to their own network, which results in all traffic to our servers being directed through their network where they can apply various filtering and blocking techniques to protect our servers.

The problem is that Level 3 now has all our traffic, and no way to get it to us because the global network believes they are responsible for those packets! Another technology called GRE tunneling is thus employed, which is kind of like a VPN for big networks. GRE tunnels are connected from Level 3 to NYI, and once accepted and filtered, packets destined for FastMail are pushed down those tunnels and onto our network, where our servers process them as normal.

Before the change, we didn’t have “active” DDoS protection. When monitoring showed a large spike in traffic, NYI would push out new BGP advertisements directing our traffic to Black Lotus, which had the same tunneling arrangement back to NYI already set up. When the attack subsided, they would again push out advertisements returning the traffic directly to NYI. After the change, the protection was “always on”, with all traffic, even at quiet times, going through Level 3 and down the tunnel to NYI, and NYI had no direct control to change this.

What happened

At 4.50am on Sunday (Melbourne time), the GRE tunnels between Level 3 and NYI went down, with the obvious result that our traffic was never making it into our own network. NYI noticed quickly, but no longer had the ability to fix it. They were quickly on the phone to Level 3’s support, continuing to escalate with Level 3 until our public IP ranges were finally pointed back to NYI. Service was restored at 6.45am (7.45pm Sat UTC, 2.45pm Sat US East).

What’s next?

It’s clear that this kind of delay restoring service is unacceptable. NYI are continuing to work with Level 3 to try to resolve the process problems that got us here.

FastMail and NYI were initially cautious when shifting to Level 3’s DDoS protection service last year because a big part of their (and our) reliability over the year has come from being in control of NYI’s infrastructure. At the time the change was necessary because we needed the capacity to deal with the DDoS attacks. Since then, our original provider Black Lotus has added more capacity to the point where they could handle any of the DDoS attacks we’ve seen so far. So, for the moment at least, we’ve reverted back to the old system, with Black Lotus providing the protection under explicit control of NYI.

Meanwhile, we’re taking steps to improve our ability to respond to a network loss. Back in November, we set up a secondary network between our datacentres so that they could continue to function without the primary network at NYI being available. This has proved very reliable — all internal systems were working just fine for the duration of the outage. (That, incidentally, is why no Pobox outage was recorded at the same time — Pobox Mailstore customers that have had their mail storage migrated to FastMail already are coming in on a different network, which continued working). Additionally, DNS has been moved to CloudFlare’s Virtual DNS which also connects to us on the secondary network so again, DNS continues to work.

We’re in the process of building new frontends (nginx) at our Los Angeles datacentre, which talk back to the application and mail servers at NYI via the secondary network. In the event of our primary network at NYI failing in the future, the plan is to switch our DNS to point www.fastmail.com, mail.messagingengine.com, etc to the LA frontends. That way customers can continue to access FastMail until the problem at NYI is resolved.

The other problem we have is not as noticable to regular users. When our primary network is unavailable, our secondary incoming mail servers in Amsterdam have to take over accepting incoming mail. We don’t really have enough server capacity there yet to handle everything coming in, so we defer a good amount of it. That’s acceptable — SMTP is designed to work that way — but it does mean mail delivery is delayed because every other server talking to us waits a while before trying again. So the plan there is to get more hardware in Amsterdam and make some internal changes so the delivery system doesn’t need to talk back to NYI quite so much.

Finally, we’re working with NYI to better understand their internal processes for detecting and responding to network faults. There were a couple of minor communication mixups during this outage which made it harder to properly inform our customers about how things were progressing. In many ways this is the easiest bit though — NYI take the reliability of their service as seriously as we do, which is why we’ve been happy customers of theirs for so many years.

Conclusion

The last few months have seen us take a bit of a hit on our record for reliability. We’re not at all happy with that, since our reliability is one of our major selling points and one of the things we pride ourselves on. To all our customers we apologise, and we thank you for continuing to trust us with your email. We believe in open and honest communication. If you have any questions or concerns, please contact us via support or on Twitter and we’ll be happy to address them.