Tuesday, March 1, 2011

How do I fix wrongly nested / unclosed HTML tags?

I need to sanitize HTML submitted by the user by closing any open tags with correct nesting order. I have been looking for an algorithm or Python code to do this but haven't found anything except some half-baked implementations in PHP, etc.

For example, something like

<p>
  <ul>
    <li>Foo

becomes

<p>
  <ul>
    <li>Foo</li>
  </ul>
</p>

Any help would be appreciated :)

From stackoverflow
  • Run it through Tidy or one of its ported libraries.

    Try to code it by hand and you will want to gouge your eyes out.

  • Beautiful Soup works great for this.

    http://www.crummy.com/software/BeautifulSoup/

    Baishampayan Ghose : I couldn't find any relevant examples of BS in achieving this. Can you point me to some?
    Matthew Trevor : It's in the very first section on parsing HTML, right at the start of the documentation...terribly hard to find: http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing%20HTML
  • using BeautifulSoup:

    from BeautifulSoup import BeautifulSoup
    html = "<p><ul><li>Foo"
    soup = BeautifulSoup(html)
    print soup.prettify()
    

    gets you

    <p>
     <ul>
      <li>
       Foo
      </li>
     </ul>
    </p>
    

    As far as I know, you can't control putting the <li></li> tags on separate lines from Foo.

    using Tidy:

    import tidy
    html = "<p><ul><li>Foo"
    print tidy.parseString(html, show_body_only=True)
    

    gets you

    <ul>
    <li>Foo</li>
    </ul>
    

    Unfortunately, I know of no way to keep the <p> tag in the example. Tidy interprets it as an empty paragraph rather than an unclosed one, so doing

    print tidy.parseString(html, show_body_only=True, drop_empty_paras=False)
    

    comes out as

    <p></p>
    <ul>
    <li>Foo</li>
    </ul>
    

    Ultimately, of course, the <p> tag in your example is redundant, so you might be fine with losing it.

    Finally, Tidy can also do indenting:

    print tidy.parseString(html, show_body_only=True, indent=True)
    

    becomes

    <ul>
      <li>Foo
      </li>
    </ul>
    

    All of these have their ups and downs, but hopefully one of them is close enough.

    some : The reason tidy sees it as an empty element is because p-elements are not allowed to contain ul-elements.
    some : P-elements can only contain inline elements like a, abbr, acronym, b, bdo, big, br, button, cite, code, del, dfn, em, i, img, input, ins, kbd, label, map, object, q, samp, script select, small, span, strong, sub, sup, textarea, tt and var.

0 comments:

Post a Comment