We know how vital email is to all our customers, whether personal or business customers. We sincerely apologise for the disruption caused by downtime. We appreciate that not being able to access your email causes frustration and concern.
FastMail was unavailable to the vast majority of our customers for nearly 3 hours on August 1st.
This was due to a network equipment failure in our primary datacentre at NYI New Jersey. FastMail staff were working closely with NYI during the entire outage period; the failure was within NYI’s equipment, and the fix was done entirely by their staff.
No mail was lost during the outage because power remained on, and no servers were reset. Incoming mail to FastMail customers was either delayed until the New Jersey network came back up, or delivered to the inbound mail service in our Seattle datacenter.
12:28am PDT / 3:28am EDT / 7:28am GMT / 5:28pm AEST
We went offline to the majority of networks. During this time our servers could still contact some networks, some customers found they could route via VPNs in some parts of the world, and some smartphones still received push updates.
3:15am PDT / 6:15am EDT / 10:15am GMT / 8:15pm AEST
Full service was restored.
Total time of seriously degraded connectivity: 2 hours and 47 minutes.
A memory leak on an NYI BGP backbone router caused BGP sessions to re-initialize. The router was successfully deprioritized and taken out of production by the NYI automated systems, and traffic was diverted to alternate paths. However, this revealed an unrelated issue with advertisements of routes within NYI, which caused FastMail’s primary IP range (
18.104.22.168/24) to not be advertised correctly.
22.214.171.124/24 range is quite special. It was originally hosted at NYI’s New York datacentre, and was routed to the New Jersey datacentre when we moved our servers last year. It’s also automatically protected behind a Level 3 DDoS filter when needed. This complexity led to it taking the overnight staff longer to diagnose and repair the network issue caused by the router failure than it otherwise may have done.
The memory leak appears to be a bug in the underlying firmware, which NYI have reported to the vendor and are working with engineers on getting resolved and updated as quickly as possible. They do not anticipate the issue re-occurring prior to the firmware update and no downtime is expected when the update is applied.
Whenever we experience any outage, we receive a bunch of feedback asking if we have redundant servers or not.
We do have secondary IMAP servers, currently split between two locations – NYI Seattle and Switch in Amsterdam. Both these datacentres were up, and indeed able to communicate with our New York datacentre via VPN links which run over private IP ranges which aren’t advertised to the public, because they would become DDoS risks.
In theory we could have routed traffic via those existing links, however we currently have a 1 hour Time To Live (TTL) on all our DNS records, which means it would take 1 hour from us making that decision until our customers were back online via that link. Given that we expected NYI’s network to stabilise at any second, we figured that it would not be valuable to start the timer on rerouting all traffic to another datacentre.
In hindsight, we may have been able to shorten the outage slightly by switching, but hindsight is always easy, and it still would have been over an hour!
While this outage was entirely outside our control, there are steps we can take to reduce the impact of any similar issue in future. NYI have already put into place additional automated checks and regression tests to catch the secondary issue with the advertisements on alternate paths, and tested the automated scripts thoroughly to ensure full functionality.
They have also engaged with a third party to provide a full audit of the relevant network architecture to assure maximum reliability. Given that this has never happened before in the over 15 years we have been with NYI, we’re confident that it won’t happen again.
It would be easy to react in a knee-jerk fashion and add measures which actually reduced our reliability overall. Any course of action which works around a rare failure needs to be at least so reliable that it fails less often than the case being protected against! We have learned this lesson in the past from high availability solutions which failed more than our servers did.
Having said that, we will take actions which are valuable for a wider range of potential outage situations, such as:
126.96.36.199/24range, and be able to switch to it without long delays. This gives FastMail operations another option if our main range is hurt for some reason, and will be tested frequently.
You may still be tracked even while using a “private” window like Incognito or VPN. Here are the best private browsers to protect your privacy.
Introducing nine privacy-friendly tools to control more of the information you are sharing with third parties.