Tuesday, May 3, 2011

Java vs Javascript Regex problem

Hello,
I am having a problem with my regular expression: <a.*href=[\"'](.*?)[\"'].*>(.*?)</a>. It is, as you can probably tell, supposed to take all the links from a string of HTML and return the link text in group 2, and the link target in group 1. But I am having a problem. If I try it in Javascript (using http://www.regextester.com/, with all the flags on), it works fine, but in Java, like this:

Pattern myPattern = Pattern.compile("<a.*href=[\"'](.*?)[\"'].*>(.*?)</a>", Pattern.CASE_INSENSITIVE);
Matcher match = myPattern.matcher(htmlData);
while(match.find()) {
 String linkText = match.group(2);
 String linkTarget = match.group(1);
}

I don't get all the matches I expect. With regex tester, I get many more and it works exactly like it is supposed to, but with the Java version, it just get 1 or 2 links per page.
Sorry if this is obvious, but I am new to regular expressions.
Thanks,
Isaac Waller

Edit: I think it might be something wrong with my regex. See, from this Apache indexof page:

<tr><td valign="top"><img src="/icons/sound2.gif" alt="[SND]"></td><td><a href="Bryan%20Adams%20-%20Here%20I%20Am.mp3">Bryan Adams - Here I Am.mp3</a></td><td align="right">27-Aug-2008 11:48  </td><td align="right">170K</td></tr>
<tr><td valign="top"><img src="/icons/sound2.gif" alt="[SND]"></td><td><a href="Cars%20-%20Drive.mp3">Cars - Drive.mp3</a></td><td align="right">26-Aug-2008 19:04  </td><td align="right">149K</td></tr>
<tr><td valign="top"><img src="/icons/sound2.gif" alt="[SND]"></td><td><a href="Cock%20Robin%20-%20When%20Your%20Heart%20Is%20Weak.mp3">Cock Robin - When Your Heart Is Weak.mp3</a></td><td align="right">26-Aug-2008 19:04  </td><td align="right">124K</td></tr>
<tr><td valign="top"><img src="/icons/sound2.gif" alt="[SND]"></td><td><a href="Colbie%20Caillat%20-%20Bubbly.mp3">Colbie Caillat - Bubbly.mp3</a></td><td align="right">27-Aug-2008 11:49  </td><td align="right">215K</td></tr>

<tr><td valign="top"><img src="/icons/sound2.gif" alt="[SND]"></td><td><a href="Colbie%20Caillat%20-%20The%20Little%20Things.mp3">Colbie Caillat - The Little Things.mp3</a></td><td align="right">27-Aug-2008 11:49  </td><td align="right">176K</td></tr>
<tr><td valign="top"><img src="/icons/sound2.gif" alt="[SND]"></td><td><a href="Coldplay%20-%20Violet%20Hill.mp3">Coldplay - Violet Hill.mp3</a></td><td align="right">27-Aug-2008 11:49  </td><td align="right">136K</td></tr>
<tr><td valign="top"><img src="/icons/sound2.gif" alt="[SND]"></td><td><a href="Corrs%20-%20Radio.mp3">Corrs - Radio.mp3</a></td><td align="right">26-Aug-2008 19:04  </td><td align="right">112K</td></tr>
<tr><td valign="top"><img src="/icons/sound2.gif" alt="[SND]"></td><td><a href="Corrs%20-%20What%20Can%20I%20Do.mp3">Corrs - What Can I Do.mp3</a></td><td align="right">26-Aug-2008 19:04  </td><td align="right">146K</td></tr>
<tr><td valign="top"><img src="/icons/sound2.gif" alt="[SND]"></td><td><a href="Counting%20Crows%20-%20Big%20Yellow%20Taxi.mp3">Counting Crows - Big Yellow Taxi.mp3</a></td><td align="right">26-Aug-2008 19:04  </td><td align="right">135K</td></tr>

<tr><td valign="top"><img src="/icons/sound2.gif" alt="[SND]"></td><td><a href="Curtis%20Stigers%20-%20I%20Wonder%20Why.mp3">Curtis Stigers - I Wonder Why.mp3</a></td><td align="right">26-Aug-2008 19:03  </td><td align="right">213K</td></tr>
<tr><td valign="top"><img src="/icons/sound2.gif" alt="[SND]"></td><td><a href="Cyndi%20Lauper%20-%20Time%20After%20Time.mp3">Cyndi Lauper - Time After Time.mp3</a></td><td align="right">26-Aug-2008 19:03  </td><td align="right">193K</td></tr>
<tr><td valign="top"><img src="/icons/sound2.gif" alt="[SND]"></td><td><a href="David%20Bowie%20-%20Absolute%20Beginners.mp3">David Bowie - Absolute Beginners.mp3</a></td><td align="right">26-Aug-2008 19:04  </td><td align="right">155K</td></tr>
<tr><td valign="top"><img src="/icons/sound2.gif" alt="[SND]"></td><td><a href="Depeche%20Mode%20-%20Enjoy%20The%20Silence.mp3">Depeche Mode - Enjoy The Silence.mp3</a></td><td align="right">26-Aug-2008 19:03  </td><td align="right">230K</td></tr>
<tr><td valign="top"><img src="/icons/sound2.gif" alt="[SND]"></td><td><a href="Dido%20-%20White%20Flag.mp3">Dido - White Flag.mp3</a></td><td align="right">27-Aug-2008 11:48  </td><td align="right">158K</td></tr>

I should get:
1: Bryan%20Adams%20-%20Here%20I%20Am.mp3
2: Bryan Adams - Here I Am.mp3
and many more like that. With Regex tester, I get all the results I want. With Java, I get none.

From stackoverflow
  • You have to escape the backslash characters and the quotation marks:

    Pattern myPattern = Pattern.compile("<a.*href=[\\\"'](.*?)[\\\"'].*>(.*?)</a>", Pattern.CASE_INSENSITIVE);
    

    However, that might not be your real problem. The backslashes are not really needed in the pattern. There are some other possible issues with the pattern.

    You are using a greedy match before the href property, which means that it will match from the start of the first link on the line to the href property of the last link on the line. Make the match non-greedy by changing it from ".*" to ".*?". The same goes for the match after the href property, it has to be non-greedy or it will match up to the end of the last link on the line.

    The . character does not match line breaks, so if there are line breaks in the link code or in the text in the link, the link will not be matched. You can use [\W\w] instead of . to match any character.

    So, removing the backslashes, making the matches non-greedy and allowing line breaks would make the pattern:

    Pattern myPattern = Pattern.compile("<a[\\W\\w]*?href=[\"'](.*?)[\"'][\\W\\w]*?>([\\W\\w]*?)</a>", Pattern.CASE_INSENSITIVE);
    

    Edit:
    I forgot to escape the backslashes in the [\W\w] codes in the string.

    Isaac Waller : I think you use 3 slashes instead of 2, but it did not work regardless. Sorry.
    Alan Moore : Isaac said he's using the DOTALL modifier (along with all of the other modifiers), so his dots are already matching newlines. Come to think of it, if he had left that out, he probably wouldn't have noticed anything wrong; all of the links in the sample text are separated from each other by at least one newline.
  • Don't all of the full-stop/wildcard matches need to be ungreedy?

    <a.*?href=[\"'](.*?)[\"'].*?>(.*?)</a>
    

    I'm not a java developer so I don't know the escaping rules for patterns.

0 comments:

Post a Comment