Tuesday, April 5, 2011

Take multiples matches with regex separated by defined marks

Hello. I have a text and I need to take the content in a defined pattern. A content between MARK1 and MARK2 and content after MARK2. However, those marks can repeat and I need to take all their ocurrences. In the example below:

text: "textA textB _MARK1_ textC _MARK2_ textD _MARK1_ textE textF _MARK2_ textG textH textI"

array(0): _MARK1_ textC _MARK2_ textD 
array(1): textC
array(2): textD
array(3): _MARK1_ textE textF _MARK2_ textG textH textI 
array(4): textE textF
array(5): textG textH textI
From stackoverflow
  • I don't think you'll be able to achieve this with a single expression. Likely you'll need to break it down into an initial expression and then a loop to perform a 2nd expression match against each iteration of the first match.

  • Am I missing something or is this what you are looking for?

    /(_MARK1_ (.*?) _MARK2 (.*?))*/
    

    I made some arbitrary assumptions about how you want to handle spaces, which I realize were probably only consistent to make your example case more readable.

  • That would be:

    /(_MARK1_(.*?)_MARK2_((?:(?!_MARK1_).)*))/g
    

    At least, it works on RegEx Coach on your test case.
    Of course, you need to iterate on each match.
    Note it might not work on all flavors of regex: JavaScript, for example, has no lookahead assertions.

    Davi Kenji : perfect. Thats it
    Sparr : good catch, excluding _MARK2__MARK1_, I didn't cover that case in my solution
  • I'm not sure whether you actually need the separating marks in your array. That part seems superfluous unless you have a specific spec for it. This solution assumes you don't really need that. Since you didn't specify a language, how about Perl?

    use Data::Dumper;
    my $text = 'textA textB _MARK1_ textC _MARK2_ textD _MARK1_ textE textF _MARK2_ textG textH textI';
    my @results = $text =~ m/(?<=_MARK1_|_MARK2_)(.*?)(?=_MARK1_|_MARK2_|$)/g;
    print Data::Dumper::Dumper @results;
    

    However, there's no reason to try the general case with regular expressions. Use a parser instead.

0 comments:

Post a Comment