This is the third post in the FastMail 2015 Advent Calendar. Stay tuned for another post tomorrow.
A few months ago we happened to be looking at graphs of CPU load on some of our backend mail servers, and we noticed that over time our servers have gotten busier. As you may remember, our servers are optimised for disk throughput over CPU, since most of the time the CPU is waiting for IO to complete. That’s still the case, but we noticed that the CPUs were more active than they used to be a few years ago. Not enough to present any kind of problem, but enough to make us wonder why.
With the assistance of Linux’s perf utility, we found that most of our CPU time (~10%) was spent in one of the many Cyrus processes, in a function called crc32()
. This function computes a checksum (using the common CRC32 algorithm) of some arbitrary chunk of data. The idea is to store the data and the checksum separately and then later, when you read the data, you recompute the checksum and compare with the original. If they’re different, then you know that either the data or the checksum have been corrupted and you can take appropriate action. Over the years, we’ve added checksums all over Cyrus, particularly in its data storage engine (known as twoskip [PDF]), and they’ve saved us more than once.
At that point it became obvious – we calculate billions of CRC32 checksums, and when you add it all up, that’s a lot of CPU time. So we started looking into alternative implementations, because even a small gain will translate into a big win once you run it a few billion times.
By default, Cyrus used the CRC32 from the widely-available zlib library. In our case, that was the one that shipped with Debian because we’ve never had a reason to change it. I spent some time researching alternatives, and came up with the following list:
PCLMULQDQ
) (actually, they took it from the Linux kernel).PCLMULQDQ
implementation of CRC32 (the same one CloudFlare uses) as well as a table-based one. That would be handy for deployment; we have a small number of machines with no CPU support for PCLMULQDQ
.There were also a few non-contenders:
crc32()
, you get the regular zlib version. It’s not easily lifted out because it’s implemented as a kind of memcpy-with-checksum, which apparently the compression engine needs.For anyone following along at home, the code used in the tests is available on GitHub: crc32-bench.
The actual test method probably has a few more variables than true science would allow, but it didn’t take too long to put together and we’re only looking for indications and trends anyway. I ran them on my mostly-idle laptop, on AC power with no weird CPU scaling or similar shenanigans in the way. I didn’t make any effort to stop the tests being pre-empted, and times reported are wall-clock times.
My laptop is a 8-core Haswell i7 with 16GB RAM. It has more grunt than some of our IMAP servers! It’s not particularly important for the test itself because we’re looking at relative performance, but I mention it in case you’re running this yourself and getting wildly different numbers.
The "large" test is 1000 rounds with a 64MB data buffer. We don’t do anything like this in Cyrus, but it was a good test to make sure everything was working properly. The really interesting tests are the ones with small buffers. The buffers we checksum are small, the minimum being 24 bytes (a twoskip header), average perhaps 32 bytes. This is our target case.
All implementations except debian were compliled with GCC 5.4 with -O3 -march=sandybridge -mtune=intel
. That’s about the nicest "easy" set of optimisations I can get for my laptop, and roughly matches the production hardware too.
Column headers show input buffer size and number of rounds. Results is wall-clock time in seconds (lower is better).
64MB x 1000 | 256B x 100M | 128B x 100M | 64B x 100M | 32B x 100M | 24B x 100M | 16B x 100M | |
debian | 60.79 | 22.76 | 11.32 | 5.47 | 2.71 | 1.94 | 1.24 |
zlib | 57.82 | 21.56 | 10.19 | 4.85 | 2.15 | 1.49 | 0.98 |
cloudflare | 6.25 | 3.02 | 1.89 | 4.85 | 2.22 | 1.61 | 1.07 |
cyrus | 166.01 | 62.55 | 30.32 | 14.31 | 6.34 | 4.31 | 2.49 |
slice4 | 58.25 | 22.20 | 10.69 | 5.00 | 2.16 | 1.49 | 0.98 |
slice8 | 38.94 | 14.40 | 6.86 | 3.24 | 1.63 | 1.15 | 0.78 |
slice16 | 18.92 | 7.11 | 3.63 | 1.97 | 6.56 | 4.60 | 2.67 |
slice16-prefetch | 18.65 | 63.63 | 30.66 | 14.75 | 6.59 | 4.66 | 2.78 |
kernel | 9.94 | 42.61 | 41.39 | 37.06 | 35.80 | 35.59 | 35.09 |
From this we learn that:
Based on this data, I’m pretty sure that the best immediate plan is to have Cyrus use slice16 and slice8, chosen by the size of the input buffer. While not the absolute fastest we can do, it will still be much faster than zlib, and fairly easy to maintain (no assembly).
I did consider trying to extend the kernel/cloudflare code to have good assembly implementations for very small buffers, but my assembly isn’t that good, the license isn’t friendly (GPL code can’t go into Cyrus) and we have some old servers that don’t have the PCLMULQDQ
instructions so we’d need a fallback there anyway.
We have been considering switching to a different algorithm that does a better job on small buffers. There’s a license-friendly assembler implementation of CRC32 that uses the Intel CPU support and has a special codepath for buffers smaller than 256 bytes. That may very well prove to be the fastest of all. Because it produces different results, it would render all our stored checksums invalid, so we’d have to recalculate them all. That’s a big job, so we’d need to do a lot of testing first to make sure that it’s so much better that it’s actually worth the effort.
So now it’s time to write a replacement CRC32 function for Cyrus. That’s actually very simple – it’s just bringing the slice16 and slice8 into a single file, with the main entry point selecting one based on the size of the buffer. You can read the whole crc32.c in the Cyrus source, but the only interesting bit in there apart from the CRC32 functions themselves is the entry point function:
static uint32_t crc32(uint32_t prev, const void *data, size_t length)
{
if (length < 64)
return crc32_slice8(prev, data, length);
return crc32_slice16(prev, data, length);
}
Running our tests against this version, we get (original zlib for comparison):
64MB x 1000 | 256B x 100M | 128B x 100M | 64B x 100M | 32B x 100M | 24B x 100M | 16B x 100M | |
zlib | 57.68 | 21.17 | 10.24 | 4.83 | 2.15 | 1.48 | 0.98 |
combo | 18.82 | 7.00 | 3.70 | 2.01 | 1.36 | 1.07 | 1.06 |
So 16 bytes is slower than the original, which is understandable because we know that slice-by-8 falls over down there, and zlib uses slice-by-4 internally. Since our minimum buffer size is 24 bytes, this should never be a problem. If it ever became one, then we can add a crc32_slice4()
as well.
Next, we have to get our optimisation settings right. We compile Cyrus with no optimisations and full debugging because it makes it really easy to work with crash dumps. Lately we’ve looked at enabling compiler optimisations for specific hot functions via GCC attributes. Because that’s convenient, I tried adding #pragma GCC optimize("O3")
to the top of crc32.c
to cover the whole file. The thing is, it doesn’t really cover the whole file, it just makes that the default for each function. Which means cross-function optimisations (ie inlining) don’t happen, as we can see from a symbol table dump.
With #pragma GCC optimize("O3")
:
00000000000004f8 T crc32
00000000000000cc t crc32_slice16
0000000000000000 t crc32_slice8
With -O3
:
0000000000000000 T crc32
The difference is an extra function call. That’s negligible (under 0.1s on the 24-byte test), but it would have bothered me to have done nothing about it. A build system change is the solution.
(Meanwhile, Bron added a pile of tests to make sure all this new stuff actually returned the same checksum values. It did, phew).
Rerunning the profiling on a busy server this morning, it looks like we’re down to ~7% CPU spent calculating checksums, so perhaps a 3% gain. It’s not huge, but it is faster than before, and in our experience the way to make a really fast service is to make lots of tiny performance improvements across the board. This is just another one of those changes that contribute to a great experience for our users.
Upgrade your privacy and productivity and join the best in email.
Want more information? Visit our side-by-side comparison chart to learn more about why Fastmail is
a great alternative to Gmail.
Fastmail customer, Xavier, gets the most out of his email account by using Fastmail Identities, Aliases, and Masked Email!