Regular Expressions (part 2)

The first regexp was thought after someone in the IRC asked for some help. They asked if anyone could remove the comments from a c++ source code file. I tried to help:

(/\*([^*]|[\n]|(\*+([^*/]|[\n])))*\*+/)|(//.*)

Note: this is not the exact expression that I came up at the time but this one is richer and better than the other one I answered at the time.

Let’s analyze it:

  1. /\* – to match with the beginning of any comment /*.
  2. [^*]|[\n] – to match any characters except the * or match with the new line character.
  3. \*+ – matches any number of * in the middle of comments.
  4. [^*/]|[\n] – to match with any character except these two * and / or match a new line.
  5. \*+/ – matches any number of * and the / character.
  6. //.* – matches // followed by any characters.

After matching with the first /* the expression becomes a bit harder to understand. What happens next is that we match anything (including new lines) except the * or we match one or more * followed by anything except the end of comment */. After we match with one ore more * followed by a /.
The second part matches only 1 line comments in C++.

Have you ever received an email full of HTML garbage? It happened to me more than once and it’s extremely annoying having to filter the text in the middle of the HTML. I remembered to create a regular expression that would help me remove this kind of garbage. If you didn’t understand what I meant by garbage, here is an example of these emails:

<html><div style='background-color:'><DIV>
<DIV>
<P class=MsoNormal><FONT color=navy face=Impact size=5><SPAN style="BACKGROUND: #f7f7f7; COLOR: navy;
FONT-FAMILY: Impact; FONT-SIZE: 18pt">This is extremely&nbsp</SPAN></FONT><FONT color=#9966ff
face=Impact size=5 FAMILY="SANSSERIF"> <SPAN style="BACKGROUND: #f7f7f7; COLOR: #9966ff; FONT-FAMILY: Impact;
FONT-SIZE: 18pt">annoying&nbsp;</SPAN></FONT> <FONT color=navy face=Tahoma FAMILY="SANSSERIF">
<SPAN style="BACKGROUND: #f7f7f7; COLOR: navy; FONT-FAMILY: Tahoma">&nbsp;</SPAN></FONT>

In this case I used this regexp:

(\<[^\<]*\>)|&nbsp;

It’s quite pretty to grasp this one, we just grab everything that is between two < > but we have to put a safe guard to exclude a possible < since regular expressions are pretty greedy and like to match whatever they can.
The &nbsp; match the HTML code for space characters serve. We could filter other similar characters but this one seems to do the trick in most situations.

Regular Expressions (part 1)

A while ago I was refactoring the #include’s of c++ project and I needed to know which project files were including STL files (in our case it meant that the included files wouldn’t have the .h). So, I decided to make a regular expression to find these files:

\#include:b*\<.+[^\.h]\>

If you insert this simple expression in the search box of the Visual Studio you can get a list of all #include’s that don’t have the .h in the name of the file. Now I’ll break the regular expression and try to explain it step by step:

  1. \#include – to match with any #include expression.
  2. :b* – to match with any number of spaces or tabs.
  3. \< – to match with the < character.
  4. .+ – to match one or more characters.
  5. [^\.h] – exclude the .h characters.
  6. \> – match with the > character.

Meaning of the characters in the expression:

  • \ – escape a character, the character after this symbol is treated as a normal character instead of a special character used in regular expressions.
  • :b – space or tab.
  • * – 0 or more times.
  • + – 1 or more times.
  • . – any character except the end of line.
  • [] – any set of characters inside the [].

Note: this regular expression might not be compatible with other programs because it uses specific expressions of the VS, such as the :b that matches a space or tab.

Another example, remove the initial characters (garbage) from actual lines code:

1.          #include <iostream>
2.             using namespace std;
3.         int main()
4.           {
5.           cout << "Hello World!";
6.        return 0;
7.         }

I’m sure you already found something like this and when you put it in the editor it’s really a pain in the ass to remove all that garbage line by line. Here’s another expression that will help in this task:

^[^a-zA-Z_$/{}\#"'\+\-]+

Again, let’s go step by step:

  1. ^ – this means that we’ll start to match only at the beginning of a line.
  2. [^…]+ – matches any character that is not in the set of characters that follows the ^.
  3. a-zA-Z_$/{}\#”‘\+\- – exclude the characters from a to z (same for uppercase letters) and the following characters: _, $, /, {, }, #, , , + and .

This means that this expression catches anything that starts with any character except the characters that are excluded. In the VS, replace this expression by an empty string to remove the garbage.
Note: It’s quite possible that the regular expressions presented here will fail (specially the second one), because it’s really complicated to test all the possibilities but in the general case, these should work.

I hope these two examples will make you see the power of regular expressions or even be useful to you ;) If you have any comments about this article or do you have any problems with a regular expression? Just let me know.