wickedw
asked on
Regular Expression and Truncate HTML safely
Hi All,
I wish to take a html string and truncate it to a set number of words (or even char length) so I can then show the start of it as a summary (which then leads to the main article).
The main issue I have to solve is the fact that we may well end up with unclosed html tags if we truncate in the middle of 1 or more tags!
The idea I have for solution is to
a) truncate the html to N words first (ensuring we dont trunc in the middle of a tag!!!)
b) work through the opened html tags in this truncated string (stick them on stack as I go?)
c) then work through the closing tags and ensure they match the ones on stack as I pop them off?
d) if any open tags left on stack after this, then write them to end of truncated string and html should be good to go!!!!
e) you may know much better
So, does anyone fancy this or done this before? I am looking at regular expressions but its very slow going as beginner on them
Heres some existing code I use to truncate words for basic text, not sure this helps -
public static string TruncateWords(string text, int wordCount)
{
string output = String.Empty;
if (text.Length > 0)
{
try
{
string[] words = text.Split(' ');
if (words.Length < wordCount) wordCount = words.Length;
for (int x = 0; x <= wordCount; x++)
output += words[x] + " ";
if (words.Length > wordCount)
output = output.Trim() + "...";
}
catch { /* do nothing */ }
}
return output;
}
Hope you can help, thanks,
Matt
I wish to take a html string and truncate it to a set number of words (or even char length) so I can then show the start of it as a summary (which then leads to the main article).
The main issue I have to solve is the fact that we may well end up with unclosed html tags if we truncate in the middle of 1 or more tags!
The idea I have for solution is to
a) truncate the html to N words first (ensuring we dont trunc in the middle of a tag!!!)
b) work through the opened html tags in this truncated string (stick them on stack as I go?)
c) then work through the closing tags and ensure they match the ones on stack as I pop them off?
d) if any open tags left on stack after this, then write them to end of truncated string and html should be good to go!!!!
e) you may know much better
So, does anyone fancy this or done this before? I am looking at regular expressions but its very slow going as beginner on them
Heres some existing code I use to truncate words for basic text, not sure this helps -
public static string TruncateWords(string text, int wordCount)
{
string output = String.Empty;
if (text.Length > 0)
{
try
{
string[] words = text.Split(' ');
if (words.Length < wordCount) wordCount = words.Length;
for (int x = 0; x <= wordCount; x++)
output += words[x] + " ";
if (words.Length > wordCount)
output = output.Trim() + "...";
}
catch { /* do nothing */ }
}
return output;
}
Hope you can help, thanks,
Matt
ASKER
Thanks very much, will look into it, but in the meantime has anyone got an algorithm lying around for this?
Hi, the approach that I have taken to doing the same thing is to extract the innertext of the root XML tag, which gives you plain text and you can then use your existing algorithm to truncate it.
The following VB.NET code can be used as an alternative to XPATH in order to extract just the text:
The following VB.NET code can be used as an alternative to XPATH in order to extract just the text:
Private Shared Function RemoveTags(ByVal Content As String) As String
Dim myRegex As New Regex("<[a-zA-Z\/][^>]*>")
Dim myMatch As Match = myRegex.Match(Content)
If myMatch.Success Then
'Remove anything between angle brackets
Content = Regex.Replace(Content, "<[a-zA-Z\/][^>]*>", "")
End If
Return Content
End Function
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
http://www.codeproject.com