Setting aside graphics processors, most commodity processors support at least two distinct parallel execution models.
-
Most programmers are familiar with the multicore model whereas we split the computation into distinct parts that get executed on distinct cores using threads or goroutine.
Speeding up your program by using many cores is not exactly free, however. The cores you are using are not available to the other programs on your machine. Thus you may be able to complete the processing faster, but it is unclear if, all things said, you are going to use less energy. Moreover, on a busy server, multicore processing could offer little gain.
More critically, as a rule, you should not expect your program to go N times faster even if you have N cores. Even when your problems are conveniently embarrassingly parallel, it can be difficult to scale in practice with the number of cores you have.
It is not uncommon for threaded programs to run slower.
-
There is another parallelization model: vectorization. Vectorization relies on Single instruction, multiple data (SIMD) instructions. It is relatively easy for hardware engineers to design SIMD instructions, so most processors have them.
Sadly, unlike multicore programming, there is very little support within programming languages for vectorization. For the most part, you are expected to hope that your compiler will automatically vectorize your code. It works amazingly well, but compilers simply do not reengineer your algorithms… which is often what is needed to benefit fully from vectorization.
My belief is that the performance benefits of vectorization are underrated.
As an experiment, I decided to look at the fasta benchmark from the Computer Language Benchmarks Game. It is a relatively simple “random generation” benchmark. As is typical, you get the best performance with C programs. I ran the current best-performing programs for single-threaded (one core) and multithreaded (multicore) execution. The multicore approach shaves about 30% of the running time, but it uses the equivalent of two cores fully:
elapsed time | total time (all CPUs) | |
---|---|---|
single-threaded | 1.36 s | 1.36 s |
multicore | 1.00 s | 2.00 s |
Is the multicore approach “better”? It is debatable. There are many more lines of code, and the gain is not large.
The numbers one the Computer Language Benchmarks Game page differs from my numbers but the conclusion is the same: the multicore version is faster (up to twice as fast) but it uses twice the total running time (accounting for all CPUs).
I decided to reengineer the single-threaded approach so that it would be vectorized. Actually, I only vectorized a small fraction of the code (what amounts to maybe 20 lines in the original code).
elapsed time | total time (all CPUs) | |
---|---|---|
vectorized | 0.31 s | 0.31 s |
I am about four times faster, and I am only using one core.
My version is not strictly equivalent to the original, so extra work would be needed to qualify as a drop-in replacement, but I claim that it could be done without performance degradation.
I make my code available so that you can easily reproduce my results and correct as needed. Your results will vary depending on your compiler, hardware, system, and so forth.
I use SIMD a lot in assembly language. If you are doing arithmetic you see a 4 by or 8 by speedup depending on the SIMD slot size. If you have do something no covered by the SIMD instruction set and can’t find a bit hack then you are back to 1 by. Also L1, L2 caching issues can limit SIMD when processing large arrays.
SIMD is very easy to program though compared to GPU multicore.
I think genuine CPU multicore with say 1000 RISC cores on one chip with good caching and memory behavior would be both relatively easy to program and a lot faster for say AI applications. There were only 56,000 transistors on a Z80 CPU in the 1970’s. Actually Intel are killing themselves with super high complexity CPUs that are now full of bugs:
https://hothardware.com/news/intel-cpu-bug-kernel-memory-isolation-linux-windows-macos
Memory access problems can make all forms of CPU optimizations irrelevant. You have to first keep your processors fed with data before anything else comes into play.
The more cores you add, the more likely that memory access will become an issue.
Why are your software samples so Yuuuge for such tiny programs?
(also wanted to see your Roaring bit maps but deleted it without even unzipping when I saw the size)
Why are you software samples so Yuuuge for such tiny programs?
What is a software sample? If you are alluding to the source code, I wrote very little of it and just modified existing programs. The single-threaded C program is quite short. All these programs are very short thought the multithreaded code takes up more lines.
Myself, I wrote maybe 20 lines of code in preparation for this blog post.
Regarding the size of the Roaring Bitmap code releases… The Java source code for the Java Roaring Bitmap library is 150 kB. That’s a long running piece of software written in a verbose language with lots and lots of features and two dozen contributors from diverse organizations. This source code litterally fits in the cache of one of the cores on your CPU. The average web page is 2.3 MB (https://www.wired.com/2016/04/average-webpage-now-size-original-doom/) so ten times larger.
You can grab the released jars on maven:
http://central.maven.org/maven2/org/roaringbitmap/RoaringBitmap/0.6.66/
What about readability?
In a lot of settings (corporate) it is more preferable to have readable code over pure performance, that’s why we see the trend of threads and goroutines.
I think what would be more appealing is if we could somehow make SIMD stupid simple.
I agree but I don’t think multithreaded programming is stupid simple. It has been made simpler but that’s the result of a lot of effort.
When we run benchmarks, we generally assume that the system is otherwise idle. This may be a reasonable assumption in mobile/client code, but it tends to underestimate the benefits of multithreaded server code.
Most HPC/cloud systems have little free capacity. If you’re not using it, you don’t want to pay for it. If your code isn’t using all CPU cores, some other code probably is. The CPU runs slower than in single-threaded benchmarks, and even single-threaded code is competing with other processes for memory, caches, and memory bandwidth. If 32-threaded code gives you an 18x speedup over single-threaded code in benchmarks, you probably won’t get any better performance with 32 independent single-threaded jobs.
Running independent jobs is the easiest way to take advantage of many CPU cores. Unfortunately, at least in bioinformatics, you often run out of memory before you can utilize all CPU cores. Multithreading is the second-best solution. If many jobs share the same read-only data structures, you can often combine them easily into a single job. With well-designed code, it might only require a few OMP statements or similar constructs. On the other hand, SIMD instructions and other low-level improvements can be expensive, as you need to understand the code in order to use them.
On the other hand, SIMD instructions and other low-level improvements can be expensive, as you need to understand the code in order to use them.
Designing algorithms for SIMD is hard. I agree. I am not sure it needs to hard, however.
>Thus you may be able to complete the processing faster, but it is unclear if, all things said, you are going to use less energy
The AVX2 unit uses a different amount of energy to the ALU, so it’s not strictly better either 🙂
For a more extreme example, the new AVX512 instructions can draw so much power that the CPU has to throttle back its frequency to fit within the TDP.
The AVX2 unit uses a different amount of energy to the ALU, so it’s not strictly better either
This has been well documented:
For a more extreme example, the new AVX512 instructions can draw so much power that the CPU has to throttle back its frequency to fit within the TDP.
Are you aware of a formal study that reviewed the energy usage of AVX-512 code?
Another reason parallel programs may not scale so well is because of techniques such as Intel Turbo Boost: the clock of your computer is typically higher if you are using fewer cores.