Downloading files faster by tweaking headers

I was given a puzzle recently. Someone was parsing JSON files downloaded from the network from a bioinformatics URI. One JSON library was twice as fast at the other one.

Unless you are on a private high-speed network, the time required to parse a file will always be small compared to the time required to download a file. Maybe people at Google have secret high-speed options, but most of us have to make do with speeds below 1 GB/s.

So how could it be?

One explanation might have to do with how the client (such as curl) and the web server negotiate the transmission format. Even if the actual data is JSON, what is transmitted is often in compressed form. Thankfully, you can tell you client to request some encoding. In my particular case, out of the all of the encodings I tried, gzip was much faster. The reason seems clear enough: when I requested gzip, I got 82 KB back, instead of 766 KB.

curl -H 'Accept-Encoding: gzip' $URL 0.5 s 82 KB
curl -H 'Accept-Encoding: deflate' $URL 1.0 s 766 KB
curl -H 'Accept-Encoding: br' $URL 1.0 s 766 KB
curl -H 'Accept-Encoding: identity' $URL 1.0 s 766 KB
curl -H 'Accept-Encoding: compress' $URL 1.0 s 766 KB
curl -H 'Accept-Encoding: *' $URL 1.0 s 766 KB

Sure enough, if you look at the downloaded file, it has 766 KB, but if you gzip it, you get back 82 KB.

What I find interesting is that my favorite tools (wget and curl) do not request gzip by default. At least in this instance, it would be much faster. The curl tool takes the --compressed flag to make life easier.

Of course, the point is moot if the data is already in compressed form on the server.

  1. Hi @lemire Thanks for following up on this issue.
    using R curl package you can specify headers to accept gzip content and curl will automatically download and unzip for you

    "Accept-Encoding": "deflate, gzip"

    I had to use it in my own package to support fast downloads in regions with low internet speed (developing countries). More importantly, using curl open the door to more speed enhancements including asynchronous downloads which cuts the download time even further especially if the dwonload server allows for concurrent downloads. I am not a network guru, but I think we can combine the goodies of curl and Rppsimdjson in a new package.

