Some useful regular expressions for programmers

In my blog post, My programming setup, I stressed how important regular expressions are to my programming activities.

Regular expressions can look intimidating and outright ugly. However, they should not be underestimated.

Someone asked for examples of regular expressions that I rely upon. Here a few.

  1. It is commonly considered a faux pas to include ‘trailing white space’ in code. That is, your lines should end with the line-return control characters and nothing else. In a regular expression, the end of the string (or line) is marked by the ‘$’ symbol, and a white-space can be indicated with ‘\s’, and a sequence of one or more white space is ‘\s+’. Thus if I search for ‘\s+$‘, I will locate all offending lines.
  2. It is often best to avoid non-ASCII characters in source code. Indeed, in some cases, there is no standard way to tell the compiler about your character encoding, so non-ASCII might trigger problems. To check all non-ASCII characters, you may do [^\x00-\x7F].
  3. Sometimes you insert too many spaces between a variable or an operator. Multiple spaces are fine at the start of a line, since they can be used for indentation, but other repeated spaces are usually in error. You can check for them with the expression \b\s{2,}. The \b indicate a word boundary.
  4. I use spaces to indent my code, but I always use an even number of spaces (2, 4, 8, etc.). Yet I might get it wrong and insert an odd number of spaces in some places. To detect these cases, I use the expression ^(\s\s)*\s[^\s]. To delete the extra space, I can select it with look-ahead and look-behind expressions such as (?<=^(\s\s)*)\s(?=[^\s]).
  5. I do not want a space after the opening parenthesis nor before the closing parenthesis. I can check for such a case with (\(\s|\s\)). If I want to remove the spaces, I can detect them with a look-behind expression such as (?<=\()\s.
  6. Suppose that I want to identify all instances of a variable, I can search for \bmyname\b. By using word boundaries, I ensure that I do not catch instances of the string inside other functions or variable names. Similarly, if I want to select all variable that end with some expression, I can do it with an expression like \b\w*myname\b.

The great thing with regular expressions is how widely applicable they are.

Many of my examples have to do with code reformatting. Some people wonder why I do not simply use code reformatters. I do use such tools all of the time, but they are not always a good option. If you are going to work with other people who have other preferences regarding code formatting, you do not want to trigger hundreds of formatting changes just to contribute a new function. It is a major faux pas to do so. Hence you often need to keep your reformatting in check.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

9 thoughts on “Some useful regular expressions for programmers”

    1. As I understand the idea of the article is to helps and teach useful regular expressions and uses code formatting as an example.

  1. Assuming that most of your research is done in C or C++, I wonder why you’re not considering using clang-format for these tasks as regular expressions will only get you so far?

  2. Nice tips. You can use \S instead of [^\s] to shorten some of these.

    To delete the extra space, I can select it with look-ahead and look-behind expressions such as <(?<=^(\s\s)*)\s(?=[^\s]).

    I think you’ve got an extra < in there, unless that’s some sort of new metacharacter.

    I do not want a space after the opening parenthesis nor before the closing parenthesis. I can check for such a case with (\(\s|\s\)). If I want to remove the spaces, I can detect them with a look-behind expression such as (?<=\()\s.

    This is probably the desired behaviour, but just to note, \s will also match newlines. If you wanted to preserve those, you could use [ \t] instead.

    Your use of lookbehind & lookahead is interesting. When I’m doing search-and-replace on code, I always include the prefix and suffix in capturing groups and account for those in the replacement.

    …on further consideration, that’s probably because I’m doing it in Emacs, whose native regex engine is rather primitive.

    1. This is probably the desired behaviour, but just to note, \s will also match newlines.

      It depends on whether the regular expression is applied to the whole documents or to lines. Many editors match regular expressions on a line-by-line basis by default.

    2. Your habit to use capture groups instead of lookarounds is good. A lot of regex engines don’t support variable-length lookarounds.

  3. These are some really useful regular expressions. I use some all the time, but I used your blog as an opportunity to delete all the annoying trailing spaces in our code base (like really, who does that?).

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.