Should you ever compress your data to reduce memory usage?

We are in 2050. Nobody uses tapes or disks anymore. The NSA has 1 billion servers. It needs so many to hold all of the data in memory.

A contractor offers a marvellous algorithm that can compress data by a factor of 1,000,000,000. The problem is that it decreases all processing by a factor of two.

Does that sound like a good deal?

Let us put it in practice. The president wants the NSA to take all data and change lower-case letters by upper-case letters, and vice versa.

Without fancy compression, it would take all of the 1 billion servers one hour to complete the task.

With the new compression technology, you can use just one server to hold all of the data and do the processing. Unfortunately, your throughput is half, and you have just one machine, so it would take you 2 billion hours. To complete the task in 1 hour, you would need 2 billion servers. That’s twice as many as you have.

Thus the NSA rejects the contractor offer. He goes on to sell the technology to North Korean where it is used for evil purposes we shall discuss another day.

Where did I go wrong in my story?

Credit: This fun puzzle was given to me in a different form by Leonid Boytsov.

16 Comments »

  1. You assumed a fixed amount of CPU available per meg of storage.

    Comment by Peter Boothe — 3/2/2014 @ 23:37

  2. Double the throughput is on the compressed data, not the raw data.
    So it’s 1h * 1B / 1B *2 = 2h on one server

    Comment by Julien Le Dem — 4/2/2014 @ 0:00

  3. Multiple approaches …

    “All processing” doesn’t make much sense. I may take some time for (de)compression, but there is no reason for decompressed data to be processed slower.

    You’re assuming that all the data is already available to the servers. You would probably see significant gains because of reduced transfer times.

    As an aside, hypothetically, you could invent something like homomorphic case-conversion. That should make things pretty fast :-) .

    Comment by eduarrrd — 4/2/2014 @ 0:21

  4. I can’t imagine the situation you describe; as time goes on, storage capacity grows but the amount of data grows as well. It seems like you’re asking if there’s any reason to compress if you have practically infinite resources?

    Comment by Keith Trnka — 4/2/2014 @ 0:27

  5. The available and needed storage are independent from the available CPU processing power (at least in the first order and in your example).

    Example: if you buy a harddisk of twice the size, you still don’t have a faster CPU and your data will not be processed faster.

    Apart from that I suppose that most of the NSA algorithms are based on map/reduce (Hadoop like) and therefore very suitable to be split up to as many cores as one likes. This again favors many computers, since the overhead of distributing the data to the servers and recollecting the computation results is low.

    Comment by Peter — 4/2/2014 @ 0:31

  6. @Peter it is rather a fixed amount of CPU per bit of information processed, isn’t it?

    Comment by Leonid Boytsov — 4/2/2014 @ 7:40

  7. Your story isn’t “wrong”, but it’s not particularly representative of a real world situation. The main issue is that you are presuming the the [typo intentional for future testing] server farm already exists and must be used in its current form, and that the costs of keeping it running are constant per hour.

    Assume instead that you were provisioning such a system, and that because of the new algorithm, instead of putting $1000 of RAM into each machine you only had to put in $1 worth. If the cost of the RAM is more than half of the machine cost, you could use the savings to buy more than 2x the number of servers and come out ahead.

    Also, RAM is power hungry and generates heat. If you have machines very little RAM, they cost less to run, and your data center can support more of them. If each server running the new algorithm uses 1/10th the power and generates 1/10th the heat, you can afford to run more than twice as many.

    I think that more usually you want to figure out what total level of performance you want to provide, and then decide most effective way to deploy your resources to do this. If RAM is free but floorspace is expensive or some per CPU cost dominates, the new algorithm isn’t much use to you.

    It would be no better than an algorithm “saves on instructions” by running only half as many but takes twice as long to do so. But if buying and powering RAM is your major operating expense, the equation changes, and the new algorithm might allow you to achieve a lower total cost of operations while still meeting your target level of performance.

    Comment by Nathan Kurz — 4/2/2014 @ 7:44

  8. The place where you went wrong is replaced the speed-up of processing power with halving of throughput, which is contradictory.

    If the compressed data were to be maintained respectively at the 1 billion servers, then it would take half the time i.e. 0.5 hours, for each server.

    I wonder, then why would NSA reject the proposal!

    Comment by Milind Changire — 4/2/2014 @ 9:00

  9. @Nathan, the example here, I believe is intentionally extreme. Imagine the case when your compression reduces the size by 10%.

    Regarding the greedy memory: wouldn’t the memory in 2BS machines (in this examples) will be using twice as much energy? as compared to 1BS machines that operate on uncompressed data? I assume that it is not easy to run a computer that provides electricity to only 1 billionth of its memory cells.

    Though it is certainly not impossible, especially if we are in the future :-)

    Comment by Leonid Boytsov — 4/2/2014 @ 9:24

  10. Just compress it in 1,000,000,000-bit blocks and use a lookup table.

    Comment by KWillets — 4/2/2014 @ 10:19

  11. @Milind I believe this is true only if you process uncompressed data. Compressed data is process twice as long.

    Comment by Leonid Boytsov — 4/2/2014 @ 10:20

  12. I can’t tell if my suggestion to change the font so the cases are swapped was lost in the electronic ether or moderated, but I’ll restate it in non-joke form. Don’t spin up 2 billion machines for this problem. Don’t modify your massive compressed data stores. Changing individual records, fine. But if you decide you want different casing, or different format for decimals, or to replace coded words with ***, just do that on the fly. Then you’re already paying the decompression cost, already paying the memory footprint, and are looking at an extra 50 ns of computation during display.

    Even without compression, the next president is going to ask for the data lower-cased actually. Now you’re rerunning that algorithm across 1 billion servers again. Versus commenting out the line that swaps the casing when the document goes out.

    Where you went wrong in this story was in entertaining the idea that recasing all your data was a sane way of meeting the random whims of bosses ;)

    Comment by Paul — 4/2/2014 @ 11:04

  13. When evaluating storage compression, you also have to take into account the amount of time it will take to read the data from media into memory. Reading a billionth as much data from disk will take a billionth as long, and so in the real world, that constant factor could easily dominate the entire calculation. If you’re talking about slowing every data access by a factor of two (which implies that actual processing time is zero), your conclusion is correct.

    Comment by Justin Dossey — 4/2/2014 @ 15:15

  14. @Leo #9: Yes, if you already have the memory installed and are paying the cost of running it, there is no downside to using it. If “use it or lose it” applies, the compression scheme does not help. But the ‘paradox’ is false if you are able to optimize the system in advance and physically put less RAM in each box.

    Yes, Daniel’s example is extreme, but I was offering a general principle: if a different algorithm allows a reallocation of resources that results in a lower total cost of operation, it’s a win. Changing the tradeoff between compression and throughput to be more realistic adjusts the final answer, but not the principle.

    Comment by Nathan Kurz — 4/2/2014 @ 17:37

  15. Good point that the actual cost may not necessarily scale linearly with the #of nodes used. There is no arguing that this example considers the actual cost.

    Here, though, having extra billion greedy CPUs, fans and power adapters will likely outweigh savings on memory. And in many cases, you do have this “lose it or use it” approach, because you cannot get an optimal configuration for each task.

    Comment by Leonid Boytsov — 4/2/2014 @ 18:15

  16. PS: no arguing that this example DOESN’T consider an actual cost.

    Comment by Leonid Boytsov — 4/2/2014 @ 21:19

Leave a comment

Warning: When entering a long comment, please ensure that you make copy of your text prior to submitting it. If the server should fail or if you hit a bug, you might lose your work. I am not responsible for your lost effort.

To spammers: I carefully review every single post and make sure that spam gets deleted. You are wasting your time if you are manually entering spam using this form. Read my terms of use to see what I consider to be abusive.

 

« Blog's main page

Powered by WordPress