Some useful regular expressions for programmers

In my blog post, My programming setup, I stressed how important regular expressions are to my programming activities.

Regular expressions can look intimidating and outright ugly. However, they should not be underestimated.

Someone asked for examples of regular expressions that I rely upon. Here a few.

  1. It is commonly considered a faux pas to include ‘trailing white space’ in code. That is, your lines should end with the line-return control characters and nothing else. In a regular expression, the end of the string (or line) is marked by the ‘$’ symbol, and a white-space can be indicated with ‘\s’, and a sequence of one or more white space is ‘\s+’. Thus if I search for ‘\s+$‘, I will locate all offending lines.
  2. It is often best to avoid non-ASCII characters in source code. Indeed, in some cases, there is no standard way to tell the compiler about your character encoding, so non-ASCII might trigger problems. To check all non-ASCII characters, you may do [^\x00-\x7F].
  3. Sometimes you insert too many spaces between a variable or an operator. Multiple spaces are fine at the start of a line, since they can be used for indentation, but other repeated spaces are usually in error. You can check for them with the expression \b\s{2,}. The \b indicate a word boundary.
  4. I use spaces to indent my code, but I always use an even number of spaces (2, 4, 8, etc.). Yet I might get it wrong and insert an odd number of spaces in some places. To detect these cases, I use the expression ^(\s\s)*\s[^\s]. To delete the extra space, I can select it with look-ahead and look-behind expressions such as (?<=^(\s\s)*)\s(?=[^\s]).
  5. I do not want a space after the opening parenthesis nor before the closing parenthesis. I can check for such a case with (\(\s|\s\)). If I want to remove the spaces, I can detect them with a look-behind expression such as (?<=\()\s.
  6. Suppose that I want to identify all instances of a variable, I can search for \bmyname\b. By using word boundaries, I ensure that I do not catch instances of the string inside other functions or variable names. Similarly, if I want to select all variable that end with some expression, I can do it with an expression like \b\w*myname\b.

The great thing with regular expressions is how widely applicable they are.

Many of my examples have to do with code reformatting. Some people wonder why I do not simply use code reformatters. I do use such tools all of the time, but they are not always a good option. If you are going to work with other people who have other preferences regarding code formatting, you do not want to trigger hundreds of formatting changes just to contribute a new function. It is a major faux pas to do so. Hence you often need to keep your reformatting in check.

Slope One Predictors for Online Rating-Based Collaborative Filtering (SDM’05 / April 20-23th 2005)

I’m very proud of this little paper called Slope One Predictors for Online Rating-Based Collaborative Filtering. The paper report on some of the core collaborative filtering research leading to the inDiscover web site. I’ll be presenting it at SIAM Data Mining 2005 in April (Newport Beach, California).

This is a case where, with Anna Maclachlan, we did something that few researchers do these days: we looked for something simpler. The main result of the paper is that you can use extremely simple and easy to implement algorithms and get very competitive results.

The current trend, in academia, is to develop crazy algorithms that require not 10 lines of code, not 100 lines of code, but several thousands. I think the same is true in some industries: think of Web Services or Java (with the infinite number of new acronyms).

Well, I like complex algorithms and as a math guy, I like a challenge, but once in a while, I think it pays to go I think “wait! what if the average Joe wants to implement this?”

So, if you write real code and are interested in collaborative filtering, go check this paper.