The Fastmail storage architecture
This is a highly detailed technical document, explaining exactly how our email storage is configured to maximise speed and reliability.
Each mail server is a standalone 2U or 4U storage server. Storage on each server is split into multiple slots (currently 15, 16, or 40 per machine depending on hardware model).
All current slots are called "teraslots", consisting of:
- A 1 or 2 TB partition for long-term email storage.
- Some space on the shared high-speed SSD for indexes, caches, and recently delivered email, among other things.
- Some space on a separate shared partition for long-term search indexes.
- Some space on a RAM filesystem for search indexes of recently delivered email.
All partitions are encrypted, either with LUKS or directly on the hardware if supported.
Every slot runs an entirely separate instance of the Cyrus IMAP server, complete with its own configuration files and its own internal IP address. Instances can be started and stopped independently. All configuration files are generated by a template system driven from a master configuration file for consistency and ease of management.
A store is the logical "mail server" on which a user's email is stored. A store is made up of a number of slots, using Cyrus's built-in replication to keep them synchronised. At any time, only one is the master slot for a store, and it replicates changes to all the other slots.
A normal store currently consists of three slots, two in New Jersey and one in Seattle. To move a slot between machines, it's easiest to just set up a new replica slot, wait until all data is replicated and the slot is up to date, and then remove the old unneeded replica slot. In the case where we're moving things around, a store may consist of more slots — there's no real limit to how many replicas you can run at once.
We spread the related slots of different stores so that no one machine has too many "pairs" to another machine. This means that if a single machine fails, the load spreads to many other machines, rather than doubling the load on one other machine.
For each store, one of the slots is marked in our database as the "master slot". In the past we bound another IP address to the master slot and use magic IP failover, but no more. Instead, all access is either via a Perl library, which knows which slot is the master, or via the nginx frontend proxy, which selects a backend using the same Perl library during login.
The Cyrus replication system doesn't record the actual changes to mailboxes: it just writes "something changed in mailbox X" to a log file (or in our multi-replica configuration, to a separate log file per replica).
A separate process, sync_client, runs on the master slot. It takes a working copy of the replication log files that other Cyrus processes create. If the sync process fails or is interrupted, it always starts from the last log file it was processing and re-runs all the mailboxes. This means that a repeating failure stops all new replication for that replica, which becomes relevant later.
The sync_client process combines duplicate events, then connects to a sync_server on the other end and requests a list of the current state for each mailbox named in the log. It compares the state on the replica with the state on the master, and determines which commands need to be run to bring the replica into sync, including potentially renaming mailboxes. Mailboxes have a unique identifier as well as their name, and that identifier persists through renames.
We have extensively tested this replication process. Over the years we have fixed many bugs and added several features, including the start of progress towards a true master-master replication system. We still run a task every week which compares the state of all folders on the master and replica, to ensure that replication is working correctly.
Our failover process is slightly slow, but fairly seamless. If everything goes well, the only thing users see is a disconnection of their IMAP/POP connection, followed by a slightly slow reconnection. There's no connection error and no visible downtime. Ideally it takes under 10 seconds. In practice it's sometimes up to 30 seconds, because we give long-running commands 10 seconds to complete, and replication can take a moment to catch up as well.
Failover works like this (failover from slot A to slot B):
- Check the size of log files in replica-channel directories on slot A. If any are more than a couple of KB in size, abort. We know that applying a log file of a few KB usually only takes a few seconds. If they're much bigger than that, replication has fallen behind and we should wait until it catches up, or work out what's wrong. It's possible to override this with a "FORCE" flag. We also check that there's a valid target slot to move to, and perform other sanity checks.
- Mark the store as "moving" in the database.
- Wait 2 seconds. Every proxy re-reads the store status every second from a local status daemon which gets updates from the database, so within 1 second, all connection attempts to the store from our web UI or from an external IMAP/POP connection will be replaced by a sleep loop. The sleep loop just holds off responding until the moving flag gets cleared, and it can get a connection to the new backend.
- Shut down master slot A. At this point, all existing IMAP/POP connections are dropped. It will wait for up to 10 seconds to let them shut down cleanly before killing the remaining processes.
- Inspect all channel directories for log files on slot A again. Run sync_client on each log file with flags to ensure that they sync completely. If there are any failures, bail out by restarting with the master still on slot A.
- Shut down the target slot B.
- Change the database to label the new slot B as the master for this store.
- Start up slot A as a replica.
- Start up slot B as the master (this also marks the store as "up").
Within a second, the proxies will know that the backend is up again and they will continue the login process for all the waiting connections.
In the case of a clean failover, all the log files are run, and the replica is 100% up-to-date. If for some reason we need to do a forced failover (say, a machine has failed completely and we need to get new hardware in), then we can have what's called a "split brain", where changes were written to one machine, but have not been seen by the other machine, so it makes its own changes without knowledge of the other machine.
Split-brain is a common problem in distributed systems. It can prove particularly problematic with IMAP because there are many counters with very strict semantics. There's the UID number, which must increase without gaps (at least in theory), and there's the MODSEQ number, which similarly must increase without changes ever being made "in the past", otherwise clients will be confused.
Recovering from split brain without breaking any of the guarantees to clients was a major goal of the rewrite of both the replication protocol and mailbox storage which was done in 2008-2009. These changes eventually led to Cyrus version 2.4.
We also want to be safe against random corruption. We have found that corruption occurs a couple of times per year across all our servers. The cause of this is hard to say, though it is most likely due to hard drive or RAID controller issues from what we've seen. We have been bitten hard in the past by a particularly horrible linux kernel bug which wrote a few zeros into our metadata files during repack.
Since then we have added checksums everywhere. A separate CRC32 for each record in a Cyrus index, cache or DB, and a SHA1 of each message file. Cyrus also sends a CRC32 of the "sync state" along with every mailbox, allowing it to determine if the set of changes did actually create the same mailbox state during a sync and trigger a full comparison of all data on mismatch.