What is the space overhead of Base64 encoding?

Many Internet formats from email (MIME) to the Web (HTML/CSS/JavaScript) are text-only. If you send an image or executable file by email, it often first gets encoded using base64. The trick behind base64 encoding is that we use 64 different ASCII characters including all letters, upper and lower case, and all numbers.

Not all non-textual documents are shared online using base64 encoding. However, it is quite common. Load up google.com or bing.com and look at the HTML source code: you will base64-encoded images. On my blog, I frequently embed figures using base64: it is convenient for me to have the blog post content be one blob of data.

Base64 is apparently wasteful because we use just 64 different values per byte, whereas a byte can represent 256 different characters. That is, we use bytes (which are 8-bit words) as 6-bit words. There is a waste of 2 bits for each 8 bits of transmission data. To send three bytes of information (3 times 8 is 24 bits), you need to use four bytes (4 times 6 is again 24 bits). Thus the base64 version of a file is 4/3 larger than it might be. So we use 33% more storage than we could.

That sounds bad. How can engineers tolerate such wasteful formats?

It is common for web servers to provide the content in compressed form. Compression partially offset the wasteful nature of base64.

To assess the effect of base64 encoding, I picked a set of images used in a recent research paper. There are different compression formats, but an old format is gzip. I encode the images using base64 and then I compress them with gzip. I report the number of bytes. I make the files are available.

File name Size Base64 size Base64 gzip size
bing.png 1355 1832 1444
googlelogo.png 2357 3186 2477
lena_color_512.jpg 105764 142876 108531
mandril_color.jpg 247222 333970 253868
peppers_color.jpg 9478 12807 9798

As you can see, the gzip sizes are within 5% of the original sizes. And for larger files, the difference is closer to 2.5%.

Thus you can safely use base64 on the Web without too much fear.

In some instances, base64 encoding might even improve performance, because it avoids the need for distinct server requests. In other instances, base64 can make things worse, since it tends to defeat browser and server caching. Privacy-wise, base64 encoding can have benefits since it hides the content you access in larger encrypted bundles.

Further reading. Faster Base64 Encoding and Decoding using AVX2 Instructions, ACM Transactions on the Web 12 (3), 2018. See also Collaborative Compression by Richard Startin.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

23 thoughts on “What is the space overhead of Base64 encoding?”

  1. Thank you for the interesting post. What do you mean by „Privacy-wise, base64 encoding can have benefits since it hides the content you access in larger encrypted bundles.“?

    Base64 itself is an encoding scheme and not an encryption algorithm. Therefore it does not provide secrecy, but merely obscurity. Do I miss something here?

  2. What’s the original file size after gzip? In other words gzip(original) compared to gzip(base64(original)) and gzip(base64(gzip(original)))

    Jon

  3. Thanks for the good primer, easy to follow.

    For clarity though, would you consider adding a column which shows the gzipped version of the original? Given most servers and browsers enable this transparently by default these days it might be misleading for some to omit it.

    Also, if you’re thinking of a follow-up it would be great to compare typical CPU cycles for each approach. With IoT all the rage, storage and bandwidth often gives way to processing budget.

  4. Why would you use Base64 and then gzip? The point of using Base64 is so that the output only uses a safe subset of ASCII. But gzip will turn that into binary. If you’re going to compress a file, there is no value in using Base64 first.

    A more common use case is to compress the file first (using gzip, JPEG, or whatever is appropriate for the file) and then use Base64 to make the compressed file safe for transmission via email.

    1. The author was speaking about web servers. In current HTTP traffic, most content is automatically gzip compressed when sent out from a webserver.

      However, that means that the comparison should be done between the binary compressed and the base64 compressed. That’s the true comparison for real world situations.

      1. That’s what I was thinking too; you can’t just say “Oh, but compressed Base64 is almost as small as uncompressed binary”. That’s beside the point. That being said, many binary formats like JPEG are already compressed, so gzipping those may not help much, but after reducing the entropy by base64 encoding the data, it makes sense that it becomes easier to compress again.

        Ultimately I don’t see much of a point though, as images easily get much larger than text-based formats like HTML. And the more of your data is static content, the more you can profit by caching it.

        Also base64 wastes 1/4 of the bits, not 1/3, plus a few bits depending on how the data aligns. So for large amounts of data, it’s essentially 25% of wasted bits.

        1. Also base64 wastes 1/4 of the bits, not 1/3

          My blog post is explicit as to what I mean: the base64 version of a file is 4/3 larger than it might be. You send 4 bytes for 3 bytes of actual information.

    2. The test here approximates the result of Base64 encoding to place an image in an HTML document, then using (negotiated) HTTP compression when exchanging the document.

      That is, for various reasons people may want to embed an image in an HTML document rather than provide a hypertext reference to the image. Most image formats contain binary data. HTML does not support embedding arbitrary binary objects, so the data must be encoded. Most people embed images as a data URI for the ‘img’ element. The data URI supports “base64” as the only available encoding scheme, so most people embed using Base64 encoding.

      The HTTP document is then transferred over HTTP. HTTP supports automatic compression, if the client/server can agree on a compression scheme. The most widely used scheme is ‘gzip’, which is the same method used in the gzip command-line program.

      Thus, it is reasonable to approximate the payload overhead of base64-escaped data URIs followed by gzip HTTP compression, by taking the image file, Base64-encoding it, and using gzip to compress the result.

      This was described in the text as “look at the HTML source” and “It is common for web servers to provide the content in compressed form”. This is a well-known topic that typically doesn’t deserve the level of detail I just gave.

        1. Since the bitstream is already compressed, you’re probably seeing entropy compression for 64 uniformly distributed values. You might try inlining the images into a typical document (with more skew) to see if the compression holds up, i.e. their mutual information.

  5. You compare the size of the original content to the size of the base64 + gzipped result, but isn’t a more interesting comparison the one between the only-gzipped original content vs base64 + gzipped content? After all, we expect that independently from whether base64 is needed in the protocol, compression will be applied.

    For your corpus of .png and .jpeg files, I don’t expect you to see much of a difference, since both png and jpeg are already backed by at-least-as-good-as-deflate coding, so re-compression is generally minimal. So all gzip is doing is undoing (via entropy coding, as the matching portion is probably useless) specifically the base64 inflation (and the fact that it still has a 5% overhead shows that it isn’t a particularly efficient entropy coder).

    For files that actually can be compressed, however, the results may be very different – and in realistic cases I think the result could be a penalty larger than 33% for base64, as the encoding can interfere with the compression.

    1. Do we have any statistics on gzip usage? I have tried a few well-known sites, and they all appear to serve the content in compressed form. For example, GMail uses gzip. You would think that Google would be on top of things, security-wise. Or is that a security issue that is specific to some form of secure layers and not others?

      1. Interesting question. So far as I understand it, it applies to all layers, but I’m not an expert in that type of thing.

        I don’t have access to the Alexa top 500 list, but Moz has a list of 500, https://moz.com/top500. I wrote some very rough code (https://gist.github.com/twirrim/877bcaf373aa1fec99c102b7c84ea1ce), using python3 and the requests library, to go through and check for Content-Encoding appearing in the headers of responses for them:

        {False: 53, True: 390, ‘Unknown’: 57}

        Unknown is a catchall for “Something didn’t go right” rather than indicative of any confusion about if compression is enabled.

        So more use it than don’t, by a good margin.

  6. Just don’t rely on size estimations to limit ingress traffic: Some base64 encodings allow comments, which can be used to amplify ratio bytes_in/bytes_decoded.

    1. ASCII spaces are certainly possible within base64 encoded text, but I have never seem comments. Do you have an example in the wild or a reference to the part of the specification that allows comments?

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax