Optimizing polymorphic code in Java

Oracle’s Java is a fast language… sometimes just as fast as C++. In Java, we commonly use polymorphism through interfaces, inheritance or wrapper classes to make our software more flexible. Unfortunately, when polymorphism is involved with lots of function calls, Java’s performance can go bad. Part of the problem is that Java is shy about fully inlining code, even when it would be entirely safe to do so. (Though this problem might be alleviated in the latest Java revisions, see my update at the end of this blog post.)

Consider the case where we want to abstract out integer arrays with an interface:

public interface Array {
    public int get(int i);
    public void set(int i, int x);
    public int size();
}

Why would you want to do that? Maybe because your data can be in a database, on a network, on disk or in some other data structure. You want to write your code once, and not have to worry about how the array is implemented.

It is not difficult to produce a class that is effectively equivalent to a standard Java array, except that it implements this interface:

public final class NaiveArray implements Array {
    protected int[] array;
    
    public NaiveArray(int cap) {
        array = new int[cap];
    }
    
    public int get(int i) {
        return array[i];
    }
    
    public void set(int i, int x) {
        array[i] = x;  
    }
    
    public int size() {
        return array.length;
    }
}

At least in theory, this NaiveArray class should not cause any performance problem. The class is final, all methods are short.

Unfortunately, on a simple benchmark, you should expect NaiveArray to be over 5 times slower than a standard array when used as an Array instance, as in this example:

public int compute() {
   for(int k = 0; k < array.size(); ++k) 
      array.set(k,k);
   int sum = 0;
   for(int k = 0; k < array.size(); ++k) 
      sum += array.get(k);
   return sum;
}

You can alleviate the problem somewhat by using NaiveArray as an instance of NaiveArray (avoiding polymorphism). Unfortunately, the result is still going to be more than 3 times slower, and you just lost the benefit of polymorphism.

So how do you force Java to inline function calls?

A viable workaround is to inline the functions by hand. You can to use the keyword instanceof to provide optimized implementations, falling back on a (slower) generic implementation otherwise. For example, if you use the following code, NaiveArray does become just as fast as a standard array:

public int compute() {
     if(array instanceof NaiveArray) {
        int[] back = ((NaiveArray) array).array;
        for(int k = 0; k < back.length; ++k) 
           back[k] = k;
        int sum = 0;
        for(int k = 0; k < back.length; ++k) 
           sum += back[k];
        return sum;
     }
     //...
}

Of course, I also introduce a maintenance problem as the same algorithm needs to be implemented more than once… but when performance matters, this is an acceptable alternative.

As usual, my benchmarking code is available online.

To summarize:

  • Some Java versions may fail to fully inline frequent function calls even when it could and should. This can become a serious performance problem.
  • Declaring classes as final does not seem to alleviate the problem.
  • A viable workaround for expensive functions is to optimize the polymorphic code by hand, inlining the function calls yourself. Using the instanceof keyword, you can write code for specific classes and, thus, preserve the flexibility of polymorphism.

Update

Erich Schubert ran a similar benchmark with double arrays that appeared to contradict my results. In effect, he sees no difference between the various implementations. I was able to confirm his results by updating to the very latest OpenJDK revision. The next table gives the number of nanoseconds required to process 10 million integers:

 Function   Oracle JDK 8u11   OpenJDK 1.8.0_40   OpenJDK 1.7.0_65 
straight arrays 0.92 0.71 0.87
with interface 5.9 0.70 6.3
with manual inlining 0.98 0.71 0.93

As one can sees, the latest OpenJDK is much smarter and makes the performance overhead of polymorphism go away (1.8.0_40). If you are lucky enough to be using this JDK, you do not need to worry about the problems of this blog post. However, the general idea is still worthwhile become in more complicated scenarios, the JDK might still fail to give you the performance you expect.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

23 thoughts on “Optimizing polymorphic code in Java”

  1. Did you check the generated machine code? I suspect these methods are in fact inlined, but another stage of optimization is missing: special-casing arrays to eliminate bounds checking on every element access. That’s what you get when you write those manually inlined loops over back.length.

  2. I realize this point is not directly relevant to your topic of optimizing polymorphism, but optimization is always a sum of several efforts. For example, changing the loops in your BasicSummer to

    for(int k = array.size() – 1; k >= 0; k–)

    cuts execution time by almost 40%. People seriously underestimate reverse loops and the savings they offer.

  3. @Chris

    That is a distinct possibility.

    Of course, there can be different ways to “inline” code in Java… but I had this in mind when I qualified “inline” with “fully”.

    I am not exactly sure how to access the machine code in java since it is JIT compiling… and javac will not, as far as I can tell, inline functions.

  4. In Java people can play tricks with the class loader and reflexion, the runtime cannot make sure that other implementations of Array don’t exist.
    As one more Array impl may be loaded later on without its class name being part of the code, the VM is unable to optimize the code.

    In this situation you could use a bytecode optimization library like Soot. Here as your object has a single effectively final member, it can be completely optimized away into static calls with that one object as one extra parameter.

  5. @Dmitry Platonoff great observation, we get 1.5x reduction for BasicSummer and 1.2x reduction for FastSummer (Java8, core i7).

    Do you have any idea why there is a much larger difference for BasicSummer?

    @Lemire This is what I hate in Java and some other languages. It is all great until you have to create atrocities like parallel arrays.

  6. @Leonid

    It is indeed quite surprising that simply looping in reverse, you can go 40% faster with BasicSummer. This makes little sense as very little of the computational burden has to do with the loop itself in this case.

    In C, reverse loops are not typically faster, they may even be slower at times. So I am not even sure why reverse loops are faster in Java.

  7. Reverse loops only need to evaluate the method call once, for the starting index. The ending index must be evaluated for every cycle unless the optimizer can prove that the returned value won’t change. I’m guessing this is the cause.

    That’s another built-in optimization for arrays, by the way, so there should be no difference when looping over array.length.

  8. The gain with reverse loops is pretty trivial, it helps you eliminate an extra method call per iteration. As you demonstrated in the original post, function calls add a lot of overhead, and it really adds up. It’s common not to think twice about adding that size() or length() call in the middle of the “for” statement. In our example, removing it halves the number of method invocations per loop.

  9. @Chris

    Good point, but we do see a 20% gain with reverse loops on straight arrays.

    That is, this is faster:

    https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2014/12/17/ReverseFastSummer.java#L8-L15

    than this…

    https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2014/12/17/FastSummer.java#L8-L15

    I realize that at the assembly level, a reverse loop can save an instruction sometimes, by checking the overflow bit (or something of the sort), but recent Intel processors can go just as fast when iterating forward…

    I find this very mysterious.

  10. @Daniel Sorry, I was replying to Leonid. Should’ve refreshed the page before posting as Chris chimed in already…

    Oddly enough, I had also not observed any gains with reverse loops on straight arrays. They were in fact a little bit slower on my PC (Java 1.8.0_25-b18, Win8.1, AMD FX 8-core CPU).

  11. @Dmitry

    Can you post your results using the latest version of the code from GitHub?

    Here is what I get…

    $ java Benchmark
    refast fast basic smart resmart silly fixed rebasic
    0.8537767 0.933717 5.8813856 0.9737157 0.9292436 5.8855361 3.7857766 3.3170692 3.7141652
    
  12. PS: array.length call is inlined or something like this. There’s no point in memorizing it.

    Regarding doing things backwardly: I really wonder how this is supported by various CPU and memory “predictors”. For example, if you read long arrays, will prefetch understand that what you are going to read next?

  13. Daniel, I immediately see one reason why the reverse could be faster than forward. In the forward case, the call to array.size() has to be evaluated in each loop iteration, which might not be optimized away in Java in the way it would be for C.

    In C, both cases would compile to something like initializing a register with the number of iterations and loop the operation until the register hit zero. For the reverse case in Java, where the iteration termination test is an immediate literal, I would expect the Java to produce virtually identical code to the C. For the forward case, there is room for Java to do something less optimal, especially if the optimizer is shy about inlining as you demonstrate. It doesn’t make much sense but Java has a lot of odd edges in practice.

  14. This is what I get

    refast fast basic smart resmart silly fixed
    1.6447691 1.6076136 7.4859653 1.6031647 1.6433024 7.4883121 4.7948762 4.9081517

  15. @Dmitry

    I have pushed the code, sorry.

    I was able to reproduce the negative results on an AMD processor. It seems that backward loops on standard arrays are faster on Haswell processors, but not necessarily on other processors.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.