How fast can you sort arrays of integers in Java?

Programming languages come with sorting functions by default. We can often do much better. For example, Downs has showed that radix sort can greatly surpass default sort functions in C++. Radix sort is you friend if you want to sort large arrays of integers.

What about Java? Richard Startin and Gareth Andrew Lloyd have been working hard to improve the sorting function used inside the RoaringBitmap library. Though we use a custom radix sort function, it is not difficult to make it more generic, so that it can sort any array of integers. I came up with the following code:

public static void radixSort(int[] data) {
  int[] copy = new int[data.length];
  int[] level0 = new int[257];
  int[] level1 = new int[257];
  int[] level2 = new int[257];
  int[] level3 = new int[257];
  for (int value : data) {
    value -= Integer.MIN_VALUE;
    level0[(value & 0xFF) + 1] ++;
    level1[((value >>> 8) & 0xFF) + 1] ++;
    level2[((value >>> 16) & 0xFF) + 1] ++;
    level3[((value >>> 24) & 0xFF) + 1] ++;
  for (int i = 1; i < level0.length; ++i) {
    level0[i] += level0[i - 1];
    level1[i] += level1[i - 1];
    level2[i] += level2[i - 1];
    level3[i] += level3[i - 1];
  for (int value : data) {
    copy[level0[(value - Integer.MIN_VALUE) & 0xFF]++] = value;
  for (int value : copy) {
    data[level1[((value - Integer.MIN_VALUE)>>>8) & 0xFF]++] 
       = value;
  for (int value : data) {
    copy[level2[((value - Integer.MIN_VALUE)>>>16) & 0xFF]++] 
       = value;
  for (int value : copy) {
    data[level3[((value - Integer.MIN_VALUE)>>>24) & 0xFF]++] 
      = value;

It is about as unsophisticated as it looks. We compute four histograms, one per byte in an integer: Java stores integers using 4-byte words. Then we do 4 passes through the data. We could make it more sophisticated by examining the histogram: if the higher-level histograms are trivial, we can skip some passes. We could extend it to Java longs though we would then need 4 extra passes. It is also possible to generalize to floating-point numbers.

The strange subtraction with MIN_VALUE are to accommodate the fact that Java has signed integers (positive and negative) under a two complement’s format.

Let us compare it against the default Arrays.sort function in Java. We want to sort 1 million integers, generated uniformly at random. Using Java 8 on an Apple M1 processor, we get that RadixSort is ten times faster than Arrays.sort.

Arrays.sort 60 ms
RadixSort 5 ms

There are some caveats. The radix sort function is likely to use more memory. Furthermore, the results are sensitive to the input data (both its size and its distribution). Nevertheless, for some systems, radix sort can be a net win.

My code is available.

Daniel Lemire, "How fast can you sort arrays of integers in Java?," in Daniel Lemire's blog, April 9, 2021.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

7 thoughts on “How fast can you sort arrays of integers in Java?”

  1. FWIW for large data arrays I’d recommend a 3-pass radix sort (using 11, 11 and 10 bits of the input values). This requires a bit more space, with the histogram/offset array adding up to 20 KB but that still fits within L1 reasonably comfortably and as such usually wins significantly on moderately large inputs.

    1. At least with a C++ LSD radix sort, I found 8 bits generally better than 11 bits: performance started dropping off heavily right after 8 bits as there are too many buckets to be easily held in the L1 cache (typically has 512 cache lines on x86) and the L1 TLB.

      Of course, this depends a lot on the hardware, the size of the data being sorted, and other details of the implementation (e.g., if there is an intermediate buffering step where smaller buckets are accumulated before being copied out to the final buckets, larger radices may perform better).

      Especially, it depends on the data distribution: if many fewer than available the 2^radix buckets are actually used, larger radices are better since the caching penalty is reduced or eliminated. In principle, one could look at the histogram and try to pick the radix dynamically based on what’s likely to be good for that distribution.

      1. I want to revoke my use of “generally better” in the first sentence. I should really say “I found 8 bits better in my specific scenario of sorting uniformly random values”.

        1. This is my experience also with 11-bit radix. It was surprisingly bad (on random input). I’m surprised to see it recommended, honestly.

          I think you’re better off using 8-bits and relying on column-skipping to shave off (likely high-order bits) for when the data isn’t random.

  2. It’s possible to save some memory and only keep two levels’ histograms in memory at one time by creating the next level’s histogram as you’re distributing the current one.

    But this probably isn’t worth it. Each histogram requires only 1KiB (it’s not hard to adjust the prefix sum loop to get rid of the 257th element), but you need 256 cache lines (16 KiB) for the active head of each bucket.

    That’s probably why going to 11 bits is a disaster; a typical 64K L1D only has 1024 lines. So randomly writing to 2048 buffers is going to thrash the hell out of it.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.