Wednesday, April 16, 2008

How to achieve 6000% performance loss using Regex

RegularExpression namespace contains a RegexOptions enumeration with one of the values we will look closely on - RegexOptions.Compiled. MSDN, humble and terse as usual, briefly informs that this option compiles the expression to an assembly (no kidding!). And the given example treacherously shows you how exactly to create regular expression using this option. It won't hurt deep if you, like me, fearfully use regular expressions as a doomsday weapon, but any mass operation could get you in trouble. The way the sample code uses this option, the component performance may degrade drastically - up to 6000% for a just few hundred operations (if you interested, you can find the test details here).

Just do not use RegexOptions.Compiled at all. Definitely not in the instance's Match method. This code is evil:

Regex rex=new Regex(pattern, RegexOptions.Compiled);
Match m = rex.Match(input);
If you really, really want to compile the regular expression, use the static method instead:
Match m=Regex.Match(input, pattern, RegexOptions.Compiled);
Even then you have to keep in mind the Regex cache sizing (though you can control it through the Regex.CacheSize property) and by using the static method you give up the control over the regular expression lifetime.

The noticeable advantage can be gained only if you do a massive processing - at least a thousand iterations for the same middle-complexity pattern. For a fewer iterations the "start" price will eat out the performance gain.

Another way to speed up the regular expression is to use Regex.CompileToAssembly method but it will take more steps and expression should be recompiled when changes are made.

The Coding Horror and MSDN blogged about this quite a while ago. It doesn' seem that .NET Framework SP fixed this problem as it was announced. Another MSDN article claims: "The 2.0 Regex class no longer caches parsed regular expressions created by Regex instance methods, it only caches regular expressions created by Regex static methods." Then obviously this behaviour shouldn't exist! I wasn't really determined but I wondered through the Regex code and apparently the static method creates the instance behind the scene and both approaches use the same path. So the issue could be with a memory management or a threading synchronization. Or it could be a setback from the dark side of the Force

Oh, and I found this sample of a highly maintainable code in the Regex class:

if ((options < RegexOptions.None) || ((((int)options) >> 10) != 0))

No comments:

© 2008-2013 Michael Goldobin. All rights reserved