Compressing document-oriented databases by rewriting your documents

The space utilization of relational databases can be estimated quickly. If you create a table made of three columns, each containing an integer, you can expect the database to use roughly 12 bytes per row, plus some overhead. Unless your database is tiny, how you name your columns is irrelevant to the space utilization.

Document-oriented databases such as MongoDB are not so simple. There is room for optimization. Using short names for attributes is better.

For example, in going from JSON tuples of the form


to these tuples where one attribute has a longer name


you increase the space utilization per tuple by 12 bytes (from 105 to 117 bytes per tuple).

The converse is true. Using shorter names is better:


The space utilization per tuple goes down to 80 bytes (from 105 bytes). This is a saving of over 20%.

It is tempting to do away with the attribute names entirely and save the data as array:


Yet the space utilization remains at 80 bytes because the binary format used by MongoDB (BSON) does not store arrays concisely.

Should we worry about this issue? We live in an era of abundant storage and memory. MongoDB pre-allocates the storage to avoid disk fragmentation. Even the tiniest collection will use 128 MB, and larger collections are stored in 2 GB files: MongoDB is unafraid to waste nearly 2 GB or more. In fact, we might say that it is precisely because we live in such abundance that we can afford to use document-oriented databases. However, engineers still face problems with space utilization. Hence, it is useful to be aware of the effect that the names you choose will have, especially if you come from a relational database context where name length is irrelevant.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

5 thoughts on “Compressing document-oriented databases by rewriting your documents”

  1. @David

    If you have short names, it may not automatically save much to replace the name (as a string) by a pointer to the name in a dictionary, and it may even take more space (and more memory). It would certainly introduce a (small) computational overhead.

    So a more reasonable implementation would only use a dictionary for the long names.

    This being said, a clever implementation could end up being superior to what MongoDB currently does.

  2. This seems particularly bizarre as I’d have thought interning your keys was a really easy storage optimisation to do and would basically always be a large win. Any idea why this isn’t done?

  3. Would it really lower the insert speed much? With sensible in memory caching (which is probably free given mongo does everything in memory anyway) the costs of looking up the key would be tiny compared to the cost of writing to disk (and for writes of many objects might win due to less data being written

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.