Thursday, February 10, 2011

Stripping MS Word Tags Using Html Agility Pack

Hi Everyone,

I have a DB with some text fields pasted from MS Word, and I'm having trouble to strip just the , and tags, but obviously keeping their innerText.

I've tried using the HAP but I'm not going in the right direction..

Public Function StripHtml(ByVal html As String, ByVal allowHarmlessTags As Boolean) As String
    Dim htmlDoc As New HtmlDocument()
    htmlDoc.LoadHtml(html)
    Dim invalidNodes As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//div|//font|//span")
    For Each node In invalidNodes
        node.ParentNode.RemoveChild(node, False)
    Next
    Return htmlDoc.DocumentNode.WriteTo()
End Function

This code simply selects the desired elements and removes them... but not keeping their inner text..

Thanks in advance

  • Well... I think I found a solution:

    Public Function StripHtml(ByVal html As String) As String
        Dim htmlDoc As New HtmlDocument()
        htmlDoc.LoadHtml(html)
        Dim invalidNodes As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//div|//font|//span|//p")
        For Each node In invalidNodes
            node.ParentNode.RemoveChild(node, True)
        Next
        Return htmlDoc.DocumentNode.WriteContentTo
    End Function
    

    I was almost there... :P

    From gjsduarte

0 comments:

Post a Comment