Regular Visual Studio versus ClangCL

If you are programming in C++ using Microsoft tools, you can use the traditional Visual Studio compiler. Or you can use LLVM as a front-end (ClangCL).

Let us compare their performance characteristics with a fast string transcoding library (simdutf). I use an up-to-date Visual Studio (2022) with the latest ClangCL component (based on LLVM 15). For building the library, we use the latest version of CMake. I will abstain from parallelizing the build: I use default settings. Hardware-wise, I use a Microsoft Surface Laptop Studio: it has a Tiger Lake Intel processor (i7-11370 @ 3.3 GHz).

After grabbing the  simdutf library from GitHub, I prepare the build directory for standard Visual Studio:

> cmake -B buildvc

I do the same for ClangCL:

> cmake -B buildclangcl -T ClangCL

You may also build directly with LLVM:

> cmake -B buildfastclang  -D CMAKE_LINKER="lld"   -D CMAKE_CXX_COMPILER=clang++  -D CMAKE_C_COMPILER=clang -D CMAKE_RC_COMPILER=llvm-rc

For each build directory, I can build in Debug mode (--config Debug) or in Release mode (--config Release) with commands such as

> cmake --build buildvc --config Debug

The project builds an extensive test suite by default. I often rely on my Apple macbook, and I build a lot of software using Amazon (AWS) nodes. I use an AWS c6i.large node (Intel Icelake running at 3,5 GHz, 2 vCPU).

The simdutf library and its testing suite build in a reasonable time as illustrated by the following table (Release builds). For comparison purposes, I also build the library using ‘WSL’ on the Microsoft laptop (Windows Subsystem for Linux).

macbook air ARM M2 processor LLVM 14 25 s
AWS/Linux Intel Ice Lake processor GCC 11 54 s
AWS/Linux Intel Ice Lake processor LLVM 14 54 s
WSL (Microsoft Laptop) Intel Rocket Lake processor GCC 11 1 min*
WSL (Microsoft Laptop) Intel Rocket Lake processor LLVM 14 1 min*

On Intel processors, we build multiple kernels to support the various families of processors. On a 64-bit ARM processor, we only build one kernel. Thus the performance of the AWS/Linux system and the macbook is somewhat comparable.

Let us switch back to Windows and build the library.

Debug Release
Visual Studio (default) 2 min 2 min 15 s
ClangCL 2 min 51 s 3 min 12 s
Windows LLVM (direct with ldd) 2 min 2 min 4 s

Let us run an execution benchmark. We pick an UTF-8 Arabic file that we load in memory and that we transcode to UTF-16 using a fast AVX-512 algorithm. (The exact command is benchmark -P convert_utf8_to_utf16+icelake -F Arabic-Lipsum.utf8.txt).

Debug Release
Visual Studio (default) 0.789 GB/s 4.2 GB/s
ClangCL 0.360 GB/s 5.9 GB/s
WSL GCC 11 (omitted) 6.3 GB/s
WSL LLVM 14 (omitted) 5.9 GB/s
AWS Server (GCC) (omitted) 8.2 GB/s
AWS Server (clang) (omitted) 7.7 GB/s

I draw the following tentative conclusions:

  1. There may be a significant performance difference between Debug and Release code (e.g., between 5x to 15x difference).
  2. Compiling your Windows software with ClangCL may lead to better performance (in Release mode). In my test, I get a 40% speedup with ClangCL. However, compiling with ClangCL takes much longer. I have recommended Windows users build their librairies with ClangCL and I maintain this recommendation.
  3. In Debug mode, the regular Visual Studio produces more performant code and it compiles faster than ClangCL.

Thus it might make sense to use the regular Visual Studio compiler in Debug mode as it builds fast and offers other benefits while testing the code, and then ClangCL in Release mode for the performance.

You may bypass ClangCL under Windows and build directly with clang with the LLVM linker for faster builds. However, I have not verified that you get the same speed.

No matter what I do, however, I seem to be getting slower builds under Windows that I expect. I am not exactly sure why build the code takes so much longer under Windows. It is not my hardware since building with Linux under Windows, on the same laptop, is fast.

Of course, parallelizing the build is the most obvious way to speed it up. Appending -- -m to my CMake command could help my performance. I deliberately avoided parallel builds in this experiment.

WSL Update. Reader Quan Anh Mai asked me whether I had made a copy of the source files in the Windows Subsystem for Linux drive, and I had not. Doing so multiplied the speed by a factor of three. I included the better timings.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

21 thoughts on “Regular Visual Studio versus ClangCL”

  1. Did you build in WSL from its native file system or from the Windows one? WSL access to the Windows file system is notoriously slow, so it may be better to make a copy of the project in the WSL file system to measure the build speed. Thanks.

  2. Have you excluded your Windows source and build directories from antivirus scanning? In my experience that’s the most common reason for slow builds on Windows.

      1. Sure, but the point was that there’s telemetry in the build process thst many don’t expect, and that that might be related to slower builds.

    1. I think I provide enough information for reproduction in my blog post. I already formally reported that Visual Studio has non-competitive performance with SIMD kernels like simdutf.

  3. If you’re using a recentish version of LLVM and CMake on windows, you don’t need to use the clang-cl driver at all. A trivial toolchain file for cmake like:

    set(CMAKE_C_COMPILER clang)
    set(CMAKE_CXX_COMPILER clang++)
    set(CMAKE_RC_COMPILER llvm-rc)

    run in the VS development prompt will get it working with the normal clang frontend, so you can use the normal GCCish arguments.
    If it’s linking that’s taking the time, adding the “-fuse-ld=lld” argument will speed it up a lot, as lld is much faster than link.exe.

    1. That’s useful.

      I tried the following…

      cmake -D CMAKE_LINKER=”ldd” -D CMAKE_CXX_COMPILER=clang++ -D CMAKE_RC_COMPILER=llvm-rc -B buildfastclang3

      A debug build then takes 2 minutes and slightly over 2 minutes for a release build.

  4. I don’t know if it applies here but one thing I’ve noticed in particular is that clang-cl tends to inline a lot more aggressively than MSVC.

    1. Note that with Visual Studio 2019+, you can use the /Ob3 flag for more aggressive inlining.
      /Ob2 is the default when using /O2
      I haven’t tested if this actually makes a difference though.

  5. Process creation on Windows is really expensive. It was bad enough for the chrome build that they finally changed the clang driver to call its internal drivers directly in-process instead of in a new process.

    And file system operations are slow for many small files. Someone said that’s only because of hooks like antivirus but it still feels like ntfs is slower in general then ext4.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.