Monday, February 21, 2011

How do I write a .Net Regular Expression to match from the end of line back

I have the following line of text

Reference=*\G{7B35DDAC-FFE2-4435-8A15-CF5C70F23459}#1.0#0#..\..\..\bin\App Components\AcmeFormEngine.dll#ACME Form Engine

and wish to grab the following as two separate capture groups:

AcmeFormEngine.dll
ACME Form Engine

Can anyone help?

From stackoverflow
  •     using System.Text.RegularExpressions;
    
        Regex regex = new Regex(
        @"\\(?<filename>[\w\.]+)\#(?<comment>[\w ]+)$",
        RegexOptions.IgnoreCase
        | RegexOptions.Compiled
        );
    
    Matthew Scharley : Does the hash really need escaping? What special meaning does it have?
    Bartek Szabat : Hash is begin of comment
    Gishu : # seems to stand for comment :) I think
    Matthew Scharley : Silly .NET regexes. Fixed mine now.
    Gishu : this is broken if you have - or _ in the filename
  • Regex r = new Regex("\\(.+?)\#(.+?)$");
    

    Non-greedy multiplicities are great.

    '$': Match the end of the string.

    "\#(.+?)": Match everything back from the end of the string till the first '#' character and return that in a capture.

    "\\(.+?)": Same again, except with an escaped '\'.

    Gishu : this doesn't work. '\.' is a valid match
    Matthew Scharley : Should be fixed now. silly # comments.
    Joel Coehoorn : upvote because it's the shorted expression and you explained how/why it works
  • If you are sincere of the string format, you can also solve that in an earthbound manner, without regex: Take everything after the last index of '\', and split that at '#'.

    Gishu : agree. More readable over a regex in this specific scenario.
    tvanfosson : And more efficient since we only need to do character comparison and avoid the overhead of the state machine.
  • I voted for tomalask's non-regex approach. However if you HAD to do it with regex, I think you need something like this

    \\([^\\/?"<>|]+?)\#([^\\/?"<>|]+?)[\r\n]*$
    

    This will allow things like - and _ which are valid in filenames, Its 2 identical groups (each excluding invalid chars for win32 filenames) beginning with a slash, delimited by a # and at the end of the line (the $). Assuming second group is also a valid win32 filename.. I saw some ugly boxes in the matched second group, the [\r\n]* keeps them away.

    e.g. F5C70F23459}#1.0#0#..\..\..\bin\App Components\Acme_Form-Engine.dll#ACME Form Engine
    group#1 => Acme_Form-Engine.dll
    group#2 => ACME Form Engine
    

    In short this is arcane.. avoid if possible.

0 comments:

Post a Comment