asked on

Regular Expression and Truncate HTML safely

Hi All,

I wish to take a html string and truncate it to a set number of words (or even char length) so I can then show the start of it as a summary (which then leads to the main article).

The main issue I have to solve is the fact that we may well end up with unclosed html tags if we truncate in the middle of 1 or more tags!

The idea I have for solution is to

a) truncate the html to N words first (ensuring we dont trunc in the middle of a tag!!!)
b) work through the opened html tags in this truncated string (stick them on stack as I go?)
c) then work through the closing tags and ensure they match the ones on stack as I pop them off?
d) if any open tags left on stack after this, then write them to end of truncated string and html should be good to go!!!!
e) you may know much better

So, does anyone fancy this or done this before? I am looking at regular expressions but its very slow going as beginner on them

Heres some existing code I use to truncate words for basic text, not sure this helps -

public static string TruncateWords(string text, int wordCount)
{
string output = String.Empty;

if (text.Length > 0)
{
try
{
string[] words = text.Split(' ');
if (words.Length < wordCount) wordCount = words.Length;
for (int x = 0; x <= wordCount; x++)
output += words[x] + " ";

if (words.Length > wordCount)
output = output.Trim() + "...";
}
catch { /* do nothing */ }
}
return output;
}

Hope you can help, thanks,
Matt

Dennis Aries

A HTML-file is a special form of an XML-file (except for the fact that most browsers have no problems with the missing closing-tags and such).
http://www.codeproject.com/KB/cs/htmlparser.aspx#xx186545xx shows a way to parse HTML-text (showing it in a textbox). Using that, you can extract the title of the page and the start of the document which you can use for your summary.

wickedw

ASKER

Thanks very much, will look into it, but in the meantime has anyone got an algorithm lying around for this?

Hairbrush

Hi, the approach that I have taken to doing the same thing is to extract the innertext of the root XML tag, which gives you plain text and you can then use your existing algorithm to truncate it.

The following VB.NET code can be used as an alternative to XPATH in order to extract just the text:

    Private Shared Function RemoveTags(ByVal Content As String) As String
 
        Dim myRegex As New Regex("<[a-zA-Z\/][^>]*>")
        Dim myMatch As Match = myRegex.Match(Content)
        If myMatch.Success Then
            'Remove anything between angle brackets
            Content = Regex.Replace(Content, "<[a-zA-Z\/][^>]*>", "")
        End If
 
        Return Content
 
    End Function

Open in new window

ASKER CERTIFIED SOLUTION

wickedw

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial