Sunday, May 1, 2011

AS3 RegExp to match words with boundry type characters in them

I'm wanting to match a list of words which is easy enough when those words are truly words. For example /\b (pop|push) \b/gsx when ran against the string

pop gave the door a push but it popped back

will match the words pop and push but not popped.

I need similar functionality for words that contain characters that would normally qualify as word boundaries. So I need /\b (reverse!|push) \b/gsx when ran against the string

push reverse! reverse!push

to only match reverse! and push but not match reverse!push. Obviously this regex isn't going to do that so what do I need to use instead of \b to make my regex smart enough to handle these funky requirements?

From stackoverflow
  • Your first problem is that you need three (possibly four) cases in your alternation, not two.

    • /\breverse!(?:\s|$)/ reverse! by itself
    • /\bpush\b/ push by itself
    • /\breverse!push\b/ together
    • /\bpushreverse!(?:\s|$)/ this is the possible case

    Your second problem is that a \b won't match after a "!" because it is not a \w. Here is what Perl 5 has to say about \b, you may want to consult your docs to see if they agree:

    A word boundary ("\b") is a spot between two characters that has a "\w" on one side of it and a "\W" on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a "\W". (Within character classes "\b" represents backspace rather than a word boundary, just as it normally does in any double-quoted string.)

    So, the regex that you need is something like

    / \b ( reverse!push | reverse! | push ) (?: \s | \b | $ )+ /gx;
    

    I left out the /s because there are not periods in this regex, so treat as single line makes no sense. If /s doesn't mean treat as a single line in your engine you should probably add it back. Also, you should read up on how your engine handles alternation. I know in Perl 5 to get the right behaviour you must arrange the items this way (otherwise reverse! would always win over reverse!push).

    Alan Moore : Read the question again, Chas; the OP *doesn't* want to match "reverse!push".
  • At the end of a word, \b means "the previous character was a word character, and the next character (if there is a next character) is not a word character. You want to drop the first condition because there might be a non-word character at the end of the "word". That leaves you with a negative lookahead:

    /\b (reverse!|push) (?!\w)/gx
    

    I'm pretty sure AS3 regexes support lookahead.

    DL Redden : In addition to using (?!\w) as the trailing \b replacement I also used (?
  • You can replace \b by something equivalent, but less strict:

    /(?<=\s|^)(reverse!|push)(?=\s|$)/g
    

    This way the limiting factor of the \b (that it can only match before or after an actual \w word character) is removed.

    Now white space or the start/end of the string function as valid separators, and the inner expression can be easily built at run-time, from a list of search terms for example.

0 comments:

Post a Comment