# Compression is never worth slowing down your code?

Should you ever compress your data to reduce memory usage?

We are in 2050. Nobody uses tapes or disks anymore. The NSA has 1 billion servers. It needs so many to hold all of the data in memory.

A contractor offers a marvellous algorithm that can compress data by a factor of 1,000,000,000. The problem is that it decreases all processing by a factor of two.

Does that sound like a good deal?

Let us put it in practice. The president wants the NSA to take all data and change lower-case letters by upper-case letters, and vice versa.

Without fancy compression, it would take all of the 1 billion servers one hour to complete the task.

With the new compression technology, you can use just one server to hold all of the data and do the processing. Unfortunately, your throughput is half, and you have just one machine, so it would take you 2 billion hours. To complete the task in 1 hour, you would need 2 billion servers. That’s twice as many as you have.

Thus the NSA rejects the contractor offer. He goes on to sell the technology to North Korean where it is used for evil purposes we shall discuss another day.

Where did I go wrong in my story?

Credit: This fun puzzle was given to me in a different form by Leonid Boytsov.

### Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

## 16 thoughts on “Compression is never worth slowing down your code?”

1. Peter Boothe says:

You assumed a fixed amount of CPU available per meg of storage.

2. Double the throughput is on the compressed data, not the raw data.
So it’s 1h * 1B / 1B *2 = 2h on one server

3. eduarrrd says:

Multiple approaches …

“All processing” doesn’t make much sense. I may take some time for (de)compression, but there is no reason for decompressed data to be processed slower.

You’re assuming that all the data is already available to the servers. You would probably see significant gains because of reduced transfer times.

As an aside, hypothetically, you could invent something like homomorphic case-conversion. That should make things pretty fast :-).

4. Keith Trnka says:

I can’t imagine the situation you describe; as time goes on, storage capacity grows but the amount of data grows as well. It seems like you’re asking if there’s any reason to compress if you have practically infinite resources?

5. The available and needed storage are independent from the available CPU processing power (at least in the first order and in your example).

Example: if you buy a harddisk of twice the size, you still don’t have a faster CPU and your data will not be processed faster.

Apart from that I suppose that most of the NSA algorithms are based on map/reduce (Hadoop like) and therefore very suitable to be split up to as many cores as one likes. This again favors many computers, since the overhead of distributing the data to the servers and recollecting the computation results is low.

6. @Peter it is rather a fixed amount of CPU per bit of information processed, isn’t it?

7. Nathan Kurz says:

Your story isn’t “wrong”, but it’s not particularly representative of a real world situation. The main issue is that you are presuming the the [typo intentional for future testing] server farm already exists and must be used in its current form, and that the costs of keeping it running are constant per hour.

Assume instead that you were provisioning such a system, and that because of the new algorithm, instead of putting \$1000 of RAM into each machine you only had to put in \$1 worth. If the cost of the RAM is more than half of the machine cost, you could use the savings to buy more than 2x the number of servers and come out ahead.

Also, RAM is power hungry and generates heat. If you have machines very little RAM, they cost less to run, and your data center can support more of them. If each server running the new algorithm uses 1/10th the power and generates 1/10th the heat, you can afford to run more than twice as many.

I think that more usually you want to figure out what total level of performance you want to provide, and then decide most effective way to deploy your resources to do this. If RAM is free but floorspace is expensive or some per CPU cost dominates, the new algorithm isn’t much use to you.

It would be no better than an algorithm “saves on instructions” by running only half as many but takes twice as long to do so. But if buying and powering RAM is your major operating expense, the equation changes, and the new algorithm might allow you to achieve a lower total cost of operations while still meeting your target level of performance.

8. Milind Changire says:

The place where you went wrong is replaced the speed-up of processing power with halving of throughput, which is contradictory.

If the compressed data were to be maintained respectively at the 1 billion servers, then it would take half the time i.e. 0.5 hours, for each server.

I wonder, then why would NSA reject the proposal!

9. @Nathan, the example here, I believe is intentionally extreme. Imagine the case when your compression reduces the size by 10%.

Regarding the greedy memory: wouldn’t the memory in 2BS machines (in this examples) will be using twice as much energy? as compared to 1BS machines that operate on uncompressed data? I assume that it is not easy to run a computer that provides electricity to only 1 billionth of its memory cells.

Though it is certainly not impossible, especially if we are in the future 🙂

10. KWillets says:

Just compress it in 1,000,000,000-bit blocks and use a lookup table.

11. @Milind I believe this is true only if you process uncompressed data. Compressed data is process twice as long.

12. Paul says:

I can’t tell if my suggestion to change the font so the cases are swapped was lost in the electronic ether or moderated, but I’ll restate it in non-joke form. Don’t spin up 2 billion machines for this problem. Don’t modify your massive compressed data stores. Changing individual records, fine. But if you decide you want different casing, or different format for decimals, or to replace coded words with ***, just do that on the fly. Then you’re already paying the decompression cost, already paying the memory footprint, and are looking at an extra 50 ns of computation during display.

Even without compression, the next president is going to ask for the data lower-cased actually. Now you’re rerunning that algorithm across 1 billion servers again. Versus commenting out the line that swaps the casing when the document goes out.

Where you went wrong in this story was in entertaining the idea that recasing all your data was a sane way of meeting the random whims of bosses 😉

13. Justin Dossey says:

When evaluating storage compression, you also have to take into account the amount of time it will take to read the data from media into memory. Reading a billionth as much data from disk will take a billionth as long, and so in the real world, that constant factor could easily dominate the entire calculation. If you’re talking about slowing every data access by a factor of two (which implies that actual processing time is zero), your conclusion is correct.

14. Nathan Kurz says:

@Leo #9: Yes, if you already have the memory installed and are paying the cost of running it, there is no downside to using it. If “use it or lose it” applies, the compression scheme does not help. But the ‘paradox’ is false if you are able to optimize the system in advance and physically put less RAM in each box.

Yes, Daniel’s example is extreme, but I was offering a general principle: if a different algorithm allows a reallocation of resources that results in a lower total cost of operation, it’s a win. Changing the tradeoff between compression and throughput to be more realistic adjusts the final answer, but not the principle.

15. Good point that the actual cost may not necessarily scale linearly with the #of nodes used. There is no arguing that this example considers the actual cost.

Here, though, having extra billion greedy CPUs, fans and power adapters will likely outweigh savings on memory. And in many cases, you do have this “lose it or use it” approach, because you cannot get an optimal configuration for each task.

16. PS: no arguing that this example DOESN’T consider an actual cost.

You may subscribe to this blog by email.