December 3, 2015

Dec 3: The search for a faster CRC32

Advanced

Rob N ★

Alum

This is the third post in the FastMail 2015 Advent Calendar. Stay tuned for another post tomorrow.

A few months ago we happened to be looking at graphs of CPU load on some of our backend mail servers, and we noticed that over time our servers have gotten busier. As you may remember, our servers are optimised for disk throughput over CPU, since most of the time the CPU is waiting for IO to complete. That’s still the case, but we noticed that the CPUs were more active than they used to be a few years ago. Not enough to present any kind of problem, but enough to make us wonder why.

With the assistance of Linux’s perf utility, we found that most of our CPU time (~10%) was spent in one of the many Cyrus processes, in a function called crc32(). This function computes a checksum (using the common CRC32 algorithm) of some arbitrary chunk of data. The idea is to store the data and the checksum separately and then later, when you read the data, you recompute the checksum and compare with the original. If they’re different, then you know that either the data or the checksum have been corrupted and you can take appropriate action. Over the years, we’ve added checksums all over Cyrus, particularly in its data storage engine (known as twoskip [PDF]), and they’ve saved us more than once.

At that point it became obvious - we calculate billions of CRC32 checksums, and when you add it all up, that’s a lot of CPU time. So we started looking into alternative implementations, because even a small gain will translate into a big win once you run it a few billion times.

The contenders

By default, Cyrus used the CRC32 from the widely-available zlib library. In our case, that was the one that shipped with Debian because we’ve never had a reason to change it. I spent some time researching alternatives, and came up with the following list:

debian: the stock zlib1g from Debian. On my machine, that’s 1:1.2.8.dfsg-2+b1.
zlib: zlib 1.2.8 from upstream. This should always perform roughly the same as the Debian one, but I wanted to include it specifically so I could compile with the same optimisations as the other implementations being tested.
cloudflare: CloudFlare’s version of zlib, where they reimplement the CRC32 functions (among other things) using modern CPU instructions (mostly PCLMULQDQ) (actually, they took it from the Linux kernel).
cyrus: the fallback implementation inside Cyrus, used when zlib is not available. It’s a very simple algorithm, and very slow.
slice4, slice8, slice16: table-based slice-by-N, all from Stephan Brumme’s CRC32 collection. Each increase in N adds more tables and unrolls loops further, relying more and more on a modern CPU’s ability to parallelise work.
slice16-prefetch: same as slice16 but with a strategically-placed instruction to force the CPU to bring the next 256 bytes into the cache while it’s doing the math.
kernel: crc32-kernel.c. The kernel implements a load of crypto stuff, partly for its internal use (eg crypto filesystems), partly to expose crypto devices to userspace. This is appealing, because it has a PCLMULQDQ implementation of CRC32 (the same one CloudFlare uses) as well as a table-based one. That would be handy for deployment; we have a small number of machines with no CPU support for PCLMULQDQ.

There were also a few non-contenders:

Intel’s zlib with SSE4.2/PLCMUL CRC32: Unfortunately they hooked their new CRC32 code into the compression code, rather than doing what CloudFlare did and changing the stock CRC32 implementation. So when you call crc32(), you get the regular zlib version. It’s not easily lifted out because it’s implemented as a kind of memcpy-with-checksum, which apparently the compression engine needs.
Google’s crcutil library: I wanted to try something not based on zlib, but I can’t actually follow their code well enough to figure out how to lift out just the important bits.
Anything using Intel’s CRC32 CPU instructions. It uses different inputs to the CRC32 algorithm (known as the polynomial) which is apparently more robust, and is used in networks, filesystems, that sort of thing. It gives different results to the “standard” polynomial, typically used in compression.

Test method

For anyone following along at home, the code used in the tests is available on GitHub: crc32-bench.

The actual test method probably has a few more variables than true science would allow, but it didn’t take too long to put together and we’re only looking for indications and trends anyway. I ran them on my mostly-idle laptop, on AC power with no weird CPU scaling or similar shenanigans in the way. I didn’t make any effort to stop the tests being pre-empted, and times reported are wall-clock times.

My laptop is a 8-core Haswell i7 with 16GB RAM. It has more grunt than some of our IMAP servers! It’s not particularly important for the test itself because we’re looking at relative performance, but I mention it in case you’re running this yourself and getting wildly different numbers.

The “large” test is 1000 rounds with a 64MB data buffer. We don’t do anything like this in Cyrus, but it was a good test to make sure everything was working properly. The really interesting tests are the ones with small buffers. The buffers we checksum are small, the minimum being 24 bytes (a twoskip header), average perhaps 32 bytes. This is our target case.

All implementations except debian were compliled with GCC 5.4 with -O3 -march=sandybridge -mtune=intel. That’s about the nicest “easy” set of optimisations I can get for my laptop, and roughly matches the production hardware too.

Results

Column headers show input buffer size and number of rounds. Results is wall-clock time in seconds (lower is better).

64MB x 1000

256B x 100M

128B x 100M

64B x 100M

32B x 100M

24B x 100M

16B x 100M

debian

60.79

22.76

11.32

5.47

2.71

1.94

1.24

zlib

57.82

21.56

10.19

4.85

2.15

1.49

0.98

cloudflare

6.25

3.02

1.89

4.85

2.22

1.61

1.07

cyrus

166.01

62.55

30.32

14.31

6.34

4.31

2.49

slice4

58.25

22.20

10.69

5.00

2.16

1.49

0.98

slice8

38.94

14.40

6.86

3.24

1.63

1.15

0.78

slice16

18.92

7.11

3.63

1.97

6.56

4.60

2.67

slice16-prefetch

18.65

63.63

30.66

14.75

6.59

4.66

2.78

kernel

9.94

42.61

41.39

37.06

35.80

35.59

35.09

From this we learn that:

zlib and debian mostly match, which tells us that recompiling the Debian package with more optimisations aren’t going to help much.
cloudflare is amazing until the input buffer gets under 80 bytes. That’s the point where it stops using the optimised implementation and falls back to the regular zlib implementation (slice-by-4). I’m not sure why (no explanatory comments I could find), but it’s a showstopper for our uses. A shame :(
cyrus is obviously rubbish, not much to say there :)
slice4/8/16 do pretty much what you’d expect, getting faster each time because the CPU can do more of the work in parallel. I was impressed at how consistent it is on large and particularly small buffers. Once the buffers get very small small, then things degenerate. I don’t fully understand why but my suspicion is that whatever optimisations the compiler has done in the checksum loop, something is now having to wait on memory more often than not, slowing everything down. I didn’t pursue this further because it intuitively makes sense to me, and there’s not much I could have done about it anyway.
slice16-prefetch seems like a waste of time. On the small buffers it makes sense that waiting for the next 256 bytes only to not use them is pointless, though I didn’t expect quite this much difference. On the large buffer it made no particular difference. Maybe it makes a lot of difference if you’re checksumming enormous data sets?
kernel, oh, how I wanted to like this. Every CRC32 operation consists of a write to and read from a socket. It’s all zero-copy, but still two context switches per round, and it shows. On larger buffers that cost is probably acceptable (remember this is the same code that cloudflare used). On small buffers, you lose everything to those switches. The 64B and 32B numbers are interesting because, like the cloudflare version, the kernel switches to an alternative implementation. In the kernel’s case, it’s a slice-by-8, so it’s giving good results apart from the context switching overhead.

Based on this data, I’m pretty sure that the best immediate plan is to have Cyrus use slice16 and slice8, chosen by the size of the input buffer. While not the absolute fastest we can do, it will still be much faster than zlib, and fairly easy to maintain (no assembly).

I did consider trying to extend the kernel/cloudflare code to have good assembly implementations for very small buffers, but my assembly isn’t that good, the license isn’t friendly (GPL code can’t go into Cyrus) and we have some old servers that don’t have the PCLMULQDQ instructions so we’d need a fallback there anyway.

We have been considering switching to a different algorithm that does a better job on small buffers. There’s a license-friendly assembler implementation of CRC32 that uses the Intel CPU support and has a special codepath for buffers smaller than 256 bytes. That may very well prove to be the fastest of all. Because it produces different results, it would render all our stored checksums invalid, so we’d have to recalculate them all. That’s a big job, so we’d need to do a lot of testing first to make sure that it’s so much better that it’s actually worth the effort.

The replacement

So now it’s time to write a replacement CRC32 function for Cyrus. That’s actually very simple - it’s just bringing the slice16 and slice8 into a single file, with the main entry point selecting one based on the size of the buffer. You can read the whole crc32.c in the Cyrus source, but the only interesting bit in there apart from the CRC32 functions themselves is the entry point function:

static uint32_t crc32(uint32_t prev, const void *data, size_t length)
{
    if (length < 64)
        return crc32_slice8(prev, data, length);
    return crc32_slice16(prev, data, length);
}

Running our tests against this version, we get (original zlib for comparison):

64MB x 1000

256B x 100M

128B x 100M

64B x 100M

32B x 100M

24B x 100M

16B x 100M

zlib

57.68

21.17

10.24

4.83

2.15

1.48

0.98

combo

18.82

7.00

3.70

2.01

1.36

1.07

1.06

So 16 bytes is slower than the original, which is understandable because we know that slice-by-8 falls over down there, and zlib uses slice-by-4 internally. Since our minimum buffer size is 24 bytes, this should never be a problem. If it ever became one, then we can add a crc32_slice4() as well.

Next, we have to get our optimisation settings right. We compile Cyrus with no optimisations and full debugging because it makes it really easy to work with crash dumps. Lately we’ve looked at enabling compiler optimisations for specific hot functions via GCC attributes. Because that’s convenient, I tried adding #pragma GCC optimize("O3") to the top of crc32.c to cover the whole file. The thing is, it doesn’t really cover the whole file, it just makes that the default for each function. Which means cross-function optimisations (ie inlining) don’t happen, as we can see from a symbol table dump.

With #pragma GCC optimize("O3"):

00000000000004f8 T crc32
00000000000000cc t crc32_slice16
0000000000000000 t crc32_slice8

With -O3:

0000000000000000 T crc32

The difference is an extra function call. That’s negligible (under 0.1s on the 24-byte test), but it would have bothered me to have done nothing about it. A build system change is the solution.

(Meanwhile, Bron added a pile of tests to make sure all this new stuff actually returned the same checksum values. It did, phew).

The end

Rerunning the profiling on a busy server this morning, it looks like we’re down to ~7% CPU spent calculating checksums, so perhaps a 3% gain. It’s not huge, but it is faster than before, and in our experience the way to make a really fast service is to make lots of tiny performance improvements across the board. This is just another one of those changes that contribute to a great experience for our users.