Precision, recall and why you shouldn’t crank up the warnings to 11

Recently, the code hosting site GitHub deployed widely a tool called CodeQL with rather agressive settings. It does static analysis on the code and it attempts to flag problems. I use the phrase “static analysis” to refer to an analysis that does not run the code. Static analysis is limited: it can identify a range of actual bugs, but it tends also to catch false positives: code patterns that it thinks are bugs but aren’t.

Several Intel engineers proposed code to add AVX-512 support to a library I help support. We got the following scary warnings:

CodeQL is complaining that we are taking as an input a pointer to 8-byte words, and treating it if it were a pointer to 64-byte words. If you work with AVX-512, and are providing optimized replacements for existing function, such code is standard. And no compiler that I know of, even at the most extreme settings, will ever issue a warning, let alone a scary “High severity Check Failure”.

On its own, this is merely a small annoyance that I can ignore. However, I fear that it is part of a larger trend where people come to rely more or more on overbearing static analysis to judge code quality. The more warnings, the better, they think.

And indeed, surely, the more warnings that a linter/checker can generate, the better it is ?

No. It is incorrect for several reasons:

  1. Warnings and false errors may consume considerable programmer time. You may say ‘ignore them’ but even if you do, others will be exposed to the same warnings and they will be tempted to either try to fix your code, or to report the issue to you. Thus, unless you work strictly alone or in a closed group, it is difficult to escape the time sink.
  2. Training young programmers to avoid non-issues may make them less productive. The two most important features of software are (in order): correctness (whether it does what you say it does) and performance (whether it is efficient). Fixing shallow warnings is easy work, but it often does not contribute to either correctness (i.e., it does not fix bugs) nor does it make the code any faster. You may feel productive, and it may look like you are changing much code, but what are you gaining?
  3. Modifying code to fix a non-issue has a non-zero chance of introducing a real bug. If you have code that has been running in production for a long time, without error… trying to “fix it” (when it is not broken) may actually break it. You should be conservative about fixing code without strong evidence that there is a real issue. Your default behaviour should be to refuse to change the code unless you can see the benefits. There are exceptions but almost all code changes should either fix a real bug, introduce a new feature or improve the performance.
  4. When programming, you need to clear your mental space. Distractions are bad. They make you dumber. So your programming environment should not have unnecessary features.

Let us use some mathematics. Suppose that my code has bugs, and that a static checker has some probability of catching a bug each time it issues a warning. In my experience, this probability can be low… but the exact percentage is not important to the big picture. Let me use a reasonable model. Given B bugs per 1000 lines the probability that my warning has caught a bug follows a logistic functions, say 1/(1+exp(10 – B)). So if I have 10 bugs per 1000 lines of code, then each warning has a 50% probability of being useful. It is quite optimistic.

The recall is how many of the bugs I have caught. If I have 20 bugs in my code per 1000 lines, then having a million warnings will almost ensure that all bugs are caught. But the human beings would need to do a lot of work.

So given B, how many warnings should I issue? Of course, in the real world I do not know B, and I do not know that the usefulness of the warnings follows a logistic function, but humour me.

A reasonable answer is that we want to maximize the F-score: the harmonic mean between the precision and the recall.

I hastily coded a model in Python, where I vary the number of warnings. The recall always increases while the precision always fall. The F-score follows a model distribution: having no warnings in terrible, but having too many is just as bad. With a small number of warnings, you can maximize the F-score.

A more intuitive description of the issue is that the more warnings you produce, the more likely you are to waste programmer time. You are also more likely to catch bugs. One is negative, one is positive. There is a trade-off. When there is a trade-off, you need to seek the sweet middle point.

The trend toward an ever increasing number of warnings does not improve productivity. In fact, at the margin, disabling the warnings entirely might be just as productive as having the warnings: the analysis has zero value.

I hope that it is not a symptom of a larger trend where programming becomes bureaucratic. Software programming is one of the key industry where productivity has been fantastic and where we have been able to innovate at great speed.

Daniel Lemire, "Precision, recall and why you shouldn’t crank up the warnings to 11," in Daniel Lemire's blog, March 15, 2023.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

14 thoughts on “Precision, recall and why you shouldn’t crank up the warnings to 11”

  1. I presume there’s also a harmonium between how concise a warning is, versus providing all the valid reasons you might encounter such an error. Such as the example you gave.

  2. I see similar issue with code style checkers, recently I feel the industry has adopted quite dogmatic approach – more the better.

    On one of our projects we have black, isort, flake… I had quite a few commits dealing with things like long lines with sql queries being rejected.. Like having one tool that is integrated with IDEs makes sense.. but having three that i have to run through docker…

    But regarding the initial situation in the blog post: i think its fine writing something like // noqa: this is safe because blabla
    solves the problem of multiple people dealing with the same warning.. Of course if you have million of these your point still stands.

  3. Having a static analyzer in your IDE from the very beginning of the project tuned for your current project has low overhead and excellent ROI. Maintaining near zero warnings is easy.

    Checking an existing code base is often counterproductive. Even trying to come up with one set of personal fine-tubed warning settings doesn’t really work. Every project tends to have its own kind of typical false warnings and you don’t want to turn them all off because cumulatively they have a good chance of catching some real issues.

    1. many languages also have a “Ignore x warning for this region of code” pragma/attribute. That way you can get to zero without torturing your code to meet some warning.

  4. I noticed that they were moving toward that and thought it was a bad idea to offer additional checks by default based on my own experience with the security-and-quality checks, but I did not say anything.

    Last year, I arranged for OpenZFS to start running CodeQL on every commit and it rarely complained, but we had a number of bugs that were being caught that I felt a somewhat more aggressive version would catch. That resulted in the following experimental branch where I have been slowly disabling buggy / pointless queries while filing bug reports about them:

    https://github.com/ryao/zfs/blob/codeql-experiments/.github/codeql.yml

    Not that long ago, it caught a bug in a patch that reviewers missed, which vindicated my thoughts that a more aggressive analyzer would be useful. However, it is still not quite ready yet for me to open a PR against the main repository, since there is still a fair amount of noise. Some of it comes from tests while the rest comes from our python bindings. Unfortunately, I am not yet well versed enough in python to make judgment calls on those, but I imagine I will eventually.

    That said, I often find making code changes to silence false positive reports from static analyzers to be worthwhile since the result is often significantly cleaner code than what was there originally. This is true for Clang’s static analyzer’s false positives, although I also often find them to be opportunities to add missing assertions. Unfortunately, false positives from CodeQL’s security-and-quality checks are an exception to this trend since I find that those checks rarely identify opportunities to improve code cleanliness. I often look at the false positive reports from CodeQL’s security-and-quality checks and think “the reviewers will think I have gone insane if I submit a patch based on this”.

    There is one saving grace to GitHub’s move. GitHub’s code scanning only reports new reports in PRs, so even if you ignore the initial deluge of reports, you can still get useful information from a more noisy static analyzer, so this could be less terrible than it seems. However, getting people into the habit of ignoring reports is not good practice, so I highly suggest anyone deploying those checks to do some tuning to turn off the checks that have the worst signal to noise ratio first.

    People could even use what I have done in my codeql-experiments branch as a basis for that, although I suggest at least spot checking reports in that category before turning it off, since maybe by the time you actually do this, CodeQL will have fixed some of the checkers. I plan to separate the checks being disabled into two groups, which are completely pointless checks that we should never enable and broken checks. After I have finished getting that branch ready to be merged and it is merged into OpenZFS, I plan to re-check the broken checks periodically. That would be maybe once or twice a year. That way, when the checks are fixed, we can start using them. That is probably something that others who adopt the security-and-quality checks should do too.

    1. In hindsight, I should have written:

      “we had a number of bugs that were being caught outside of CodeQL that I felt a somewhat more aggressive version would catch.”

  5. Great read as usual. While you talk about false-positives, here’s one more (and IMO a bigger) issue to consider: Checks that blindly mark certain pattern/functions without even attempting to check whether the usage was correct or not.

    For example, clang-tidy by default will mark any memcpy call (regardless of whether it was correct or not) as “unsafe” and recommend using the non-portable memcpy_s function.

    This is significantly more harmful (especially when enabled by default) because not only is it not catching (or even attempting to catch) any actual bugs, but it’s going to trick newbies who don’t know any better into writing non-portable code with a false sense of security.

    This trend is very unfortunate and concerning because I regularly use clang-tidy (and cppcheck) and even recommend it to others as they have in the past caught real bugs.

    1. I opened an issue with LLVM to request more specific checks and that issue has morphed into a more general complaint about that checker in general:

      https://github.com/llvm/llvm-project/issues/61098

      Hopefully, they will drop that check in favor of more intelligent checks in the future.

      That said, there is likely room for someone to open an issue asking for that check to be disabled in clang-tidy by default. It is deeply flawed.

  6. Great to see this frank feedback, and the discussion of the precision/recall trade off is spot on. I like the use of the F1 score.

    The problems of precision and recall are well recognised at GitHub, where I work, which is why there are different suites of queries for different precision/recall appetites.

    There is also the ability to filter out specific problematic queries (such as the one you describe), modify them or even write your own.

    CodeQL is opt-in for repos and can be configured quite a lot – you can change which “suite” is used (e.g. security-extended ) and you can use “query filters” to further include or exclude specific queries.

    The GitHub docs on configuring Code Scanning with CodeQL are here.

    If you find security-and-quality too noisy you can move to security-extended instead, or go back down to the default set, which is tuned to be low false positives – high precision – vs the quality set, and the extended set, which increase recall at the expense of precision.

    The default set is included in the extended set, and the extended set is included in the quality set.

    You can really crank it up to 11 with the security-experimental suite, but that’s for auditing or where you really need a rule that hasn’t been fully tested yet to meet the standards for the other suites – you can just pick and choose one rule from a suite.

    I work in “Field” at GitHub, rather than directly on the product, but a big part of my role is helping get this balance right for customers using Advanced Security (the paid version of what open source get for free). We even have our own set of queries that we maintain (and move into the product, when they are ready) to help customers and demonstrate what’s possible.

    If you get in touch with me by email or on Fosstodon I’d be happy to take a look at the repo you’re contributing to and see what can be done about the level of false positives.

    1. Thanks. My blog post wasn’t a criticism of CodeQL per se.

      If you find security-and-quality too noisy you can move to security-extended instead, or go back down to the default set, which is tuned to be low false positives – high precision – vs the quality set, and the extended set, which increase recall at the expense of precision.

      I removed the security-and-quality setting. Please be mindful that once you promote some level of static analysis, this tends to propagate: “I just added your code to my project and CodeQL has found all these bugs”.

  7. And now if you vary your “B” parameter how is the optimal number of warnings changed? The world is moving towards the hoards of not-so-well-trained programmers cause that model seems to scale better and has a faster response time. And it that world issuing tons of warnings is actually a good thing.

  8. Back when I worked on Windows Vista, the Windows team introduced static analysis tools that operated in conjunction with source code annotations. The vast majority of flagged issues were false positives, but the problem wasn’t just wasted time from investigating non-issues. Some manager had the brilliant idea of outsourcing all the “trivial fixes” for issues flagged by static analysis to a large IT contractor in India. You can probably guess how well that went. Novice programmers completely unfamiliar with one of the world’s most complex codebases introduced so many bugs (I wish I had statistics), which the Windows developers then had to fix, that I’m sure it would have been cheaper to leave the investigation and fixes to the original developers. The original “bugs” were mostly illusory, but the bugs introduced in the “fixes” certainly were not. (Not that I have anything against static analysis: the Vista codebase was far more robust than XP as a result. But this was definitely the wrong way to implement it.)

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.