I sometimes consult with bright colleagues from other departments who do advanced statistical models or simulations. They are from economics, psychology, and so forth. Quite often, their code is slow. As in “it takes weeks to run”. That’s not good.
Given the glacial pace of academic research, you might think that such long delays are nothing to worry about. However, my colleagues are often rightfully concerned. If it takes weeks for you to get the results back, you can only iterate over your ideas a few times a year. This limits drastically how deeply you can investigate issues.
These poor folks are often sent my way. In an ideal world, they would have a budget so that their code can be redesigned for speed… but most research is not well funded. They are often stuck with whatever they put together.
Too often they hope that I have a powerful machine that can run their code much faster. I do have a few fast machines, but it is often not as helpful as they expect.
- Powerful computers tend to be really good at parallelism. Maybe counter-intuitively, these same computers can run non-parallel code slower than your ordinary PC. So dumping your code on a supercomputer can even make things slower!
- In theory, you would think that software could be “automatically” parallelized so that it can run fast on supercomputers. Sadly, I cannot think of many examples where the software automatically tries to run using all available silicon on your CPU. Programmers still need to tell the code to run in parallel (though, often, it is quite simple). Some software libraries are clever and do this work for you… but if you wrote your code without care for performance, it is likely you did not select these clever libraries.
- If you just grabbed code off the Internet, and you do not fully understand what is going on… or you don’t know anything about software performance… it is quite possible that a little bit of engineering can make the code run 10, 100 or 1000 times faster. So messing with a supercomputer could be entirely optional. It probably is.
More than a few times, by changing just a single dependency, or just a single function, I have been able to switch someone’s code from “too slow” to “really fast”.
How should you proceed?
- I recommend making back-of-the-envelope computations. A processor can do billions of operations a second. How many operations are you doing, roughly? If you are doing a billion simple operations (like a billion multiplications) and it takes minutes, days or weeks, something is wrong and you can do much better.
If you genuinely require millions of billions of operations, then you might need a supercomputer.
Estimates are important. A student of mine once complained about running out of memory. I stupidly paid for much more RAM. Yet all I had to do to establish that the machine was not at fault was to compare the student code with a standard example found online. The example was much, much faster than the student’s code running on the same machine, and yet the example did much more work with not much more code. That was enough to establish the problem: I encouraged the student to look at the example code.
-
You often do not need fancy tools to make code run faster. Once you have determined that you could run your algorithm faster, you can often inspect the code and determine at a glance where most of the work is being done. Then you can search for alternatives libraries, or just think about different ways to do the work.
In one project, my colleague’s code was generating many random integers, and this was a bottleneck since random number generation is slow in Python by default, so I just proposed a faster random number generation written in C. (See my blog post Ranged random-number generation is slow in Python… for details.) Most times, I do not need to work so hard, I just need to propose trying a different software library.
If you do need help finding out the source of the problem, there are nifty tools like line-by-line profilers in Python. There are also profilers in R.
My main insight is that most people do not need supercomputers. Some estimates and common sense are often enough to get code running much faster.
It would be more interesting if you could please provide some specific examples and how you narrowed it down to a single dependency and/or a specific function. Were you using a profiler to determine the bottleneck? Can the techniques be automated? For example, is there a tool that can take “too slow” code and make it “really fast”?
I have updated my blog post with more concrete recommendations and examples.
Thank you very much. That is helpful.
Great point. And one should take into account that a supercomputer nowadays is usually just a bunch of loosely connected GPUs or CPUs residing in different boxes. Not only I would disagree with your claim that ||-ion is easy, I would say it is super hard. For example, by taking a framework like MapReduce and running your || code can easily take longer or equivalent amount of time. The problem is that once you have loosely connected computing units, communication becomes an annoying bottleneck.
Parallel processing is sometimes easy because other people made it easy for us.
PS: regarding the joys of ||-zation, there is a good joke: knock-knock, race condition, who’s there?
“Some software libraries are clever and do this work for you… but if you wrote your code without care for performance, it is likely you did not select these clever libraries.”
Daniel, can you please mention the clever libraries you referred to? Would be helpful to know.
I am not sure it is of general interest, but here are some examples.
Under R, the ‘boot’ package makes it really easy to parallelize the processing, one just needs to add a flag.
In Python, some numpy automatically parallelize (e.g., numpy.dot).
And so forth.
At one place I worked we had a data-crunching program that we thought ran reasonably well and took 45 minutes. In the process of adding a feature, a coworker cleaned up the code (mostly rearranging do loops and the like). Afterwards the program ran in 3 minutes — surprised everyone, including the coworker.
Same is true for Spark. https://lnkd.in/fCsrKXj
Usage of modern languages like Go and Rust, where memory management and concurrency is built-in is the best way to go for new projects. Also note that parallelism is not the only way for concurrency. Simple goroutines can bring about a lot of performance gain.
Ivey raises pre-flop through the button to 150,000 and
gets re-raised to 460,000. So if you’re dealt so named “suited connectors” as the hole cards (two cards of the same
suit close to the other, say a nine and ten of hearts) and
hit three more hearts you’re in a good position of strength.
Since it is a computerized game, and lacks real human intervention within the shuffling and
dealing, they need to use a software package for the job
of your poker dealer.