We help IT Professionals succeed at work.

Regular Expression and Truncate HTML safely

wickedw
wickedw asked
on
1,491 Views
Last Modified: 2012-05-08
Hi All,

I wish to take a html string and truncate it to a set number of words (or even char length) so I can then show the start of it as a summary (which then leads to the main article).

The main issue I have to solve is the fact that we may well end up with unclosed html tags if we truncate in the middle of 1 or more tags!

The idea I have for solution is to

a) truncate the html to N words first (ensuring we dont trunc in the middle of a tag!!!)
b) work through the opened html tags in this truncated string (stick them on stack as I go?)
c) then work through the closing tags and ensure they match the ones on stack as I pop them off?
d) if any open tags left on stack after this, then write them to end of truncated string and html should be good to go!!!!
e) you may know much better

So, does anyone fancy this or done this before?  I am looking at regular expressions but its very slow going as beginner on them

Heres some existing code I use to truncate words for basic text, not sure this helps -

 public static string TruncateWords(string text, int wordCount)
    {
        string output = String.Empty;

        if (text.Length > 0)
        {
            try
            {
                string[] words = text.Split(' ');
                if (words.Length < wordCount) wordCount = words.Length;
                for (int x = 0; x <= wordCount; x++)
                    output += words[x] + " ";

                if (words.Length > wordCount)
                    output = output.Trim() + "...";
            }
            catch { /* do nothing */ }
        }
        return output;
    }

Hope you can help, thanks,
Matt
Comment
Watch Question

Dennis AriesCEO @ Arkro IT
CERTIFIED EXPERT

Commented:
A HTML-file is a special form of an XML-file (except for the fact that most browsers have no problems with the missing closing-tags and such).
http://www.codeproject.com/KB/cs/htmlparser.aspx#xx186545xx shows a way to parse HTML-text (showing it in a textbox). Using that, you can extract the title of the page and the start of the document which you can use for your summary.

Author

Commented:
Thanks very much, will look into it, but in the meantime has anyone got an algorithm lying around for this?



Hi, the approach that I have taken to doing the same thing is to extract the innertext of the root XML tag, which gives you plain text and you can then use your existing algorithm to truncate it.

The following VB.NET code can be used as an alternative to XPATH in order to extract just the text:
    Private Shared Function RemoveTags(ByVal Content As String) As String
 
        Dim myRegex As New Regex("<[a-zA-Z\/][^>]*>")
        Dim myMatch As Match = myRegex.Match(Content)
        If myMatch.Success Then
            'Remove anything between angle brackets
            Content = Regex.Replace(Content, "<[a-zA-Z\/][^>]*>", "")
        End If
 
        Return Content
 
    End Function

Open in new window

Commented:
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION

Gain unlimited access to on-demand training courses with an Experts Exchange subscription.

Get Access
Why Experts Exchange?

Experts Exchange always has the answer, or at the least points me in the correct direction! It is like having another employee that is extremely experienced.

Jim Murphy
Programmer at Smart IT Solutions

When asked, what has been your best career decision?

Deciding to stick with EE.

Mohamed Asif
Technical Department Head

Being involved with EE helped me to grow personally and professionally.

Carl Webster
CTP, Sr Infrastructure Consultant
Empower Your Career
Did You Know?

We've partnered with two important charities to provide clean water and computer science education to those who need it most. READ MORE

Ask ANY Question

Connect with Certified Experts to gain insight and support on specific technology challenges including:

  • Troubleshooting
  • Research
  • Professional Opinions
Unlock the solution to this question.
Join our community and discover your potential

Experts Exchange is the only place where you can interact directly with leading experts in the technology field. Become a member today and access the collective knowledge of thousands of technology experts.

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.