Need RegEx: shrink string without breaking HTML rules.

Posted on 2007-08-11
Last Modified: 2012-06-27
I am looking for a regex to shrink a string. The problem is, the string might contain HTML tags. I don't want to kill all HTML from the string, I just need to shrink the string so, that the result is valid HTML.

Additionally I would like then not to cut in the middle of words, so I need to shrink it "back" to the next space/return in the string.

Can someone help me with this or point to a page with matching regex?
Question by:Smoerble
    LVL 62

    Expert Comment

    by:Fernando Soto
    To your question, "I am looking for a regex to shrink a string. The problem is, the string might contain HTML tags. I don't want to kill all HTML from the string, I just need to shrink the string so, that the result is valid HTML."

    Can you give an example of a string and how the string is to be shrunken, where in the string do you want to delete the characters?
    LVL 16

    Expert Comment

    I thought about this problem a few weeks ago, and decided that it was very messy territory.

    For a start, it's hard to work out the length of a string when that string contains HTML tags. You have to count the length of the string, then subtract the length of the tags to get the length that will display on the page. That's a programmatic process that I don't think can be done in regular expressions alone.

    In the end I just stripped out HTML to avoid the rather big task of working around possible corruption of HTML elements and HTML entities. (If you break an entity in half, you make XML invalid, and risk seeing HTML render incorrectly.)

    Author Comment

    Hackney: good points, thanks.

    After I discussed your input, we found a logical and visual correct approach we want to implement. For ths we need several RegEx:

    1) count all characters inside HTML tags (including the < and >)
    2) count all characters OUTSIDE HTML tags
    3) find the full string from <table> tag to </table> tag and replace it with a cimplete new string
    4) From a starting point, find the NEXT sentence end (colon, semicolon, dot etc).

    Additionally to this we need a more complex thing which I need to open an own questions I hope, someone can help me on these 4 tasks?

    Author Comment

    No help on these tasks?
    LVL 16

    Expert Comment

    The problem is, you can't count characters with regular expressions. You'd have to be using a scripting language, such as Perl or PHP for that.
    LVL 84

    Accepted Solution

    if you can strip out HTML, .then counting characters before and after will give you 1 and 2
    you can count characters outside of <> with something like
    but you may need something more complicated for things like
     <IMG SRC = "foo.gif" ALT = "A > B">
    <!-- <A comment> -->
     <script>if (a<b && a>c)</script>

    s#<table>.*?</table>#cimplete new string#s

    You may want some context around that so you don' t match . in numbers or abbreviations

    Featured Post

    Looking for New Ways to Advertise?

    Engage with tech pros in our community with native advertising, as a Vendor Expert, and more.

    Join & Write a Comment

    Whatever be the reason, if you are working on web development side,  you will need day-today validation codes like email validation, date validation , IP address validation, phone validation on any of the edit page or say at the time of registration…
    Do you hate spam? I do, and I am willing to bet you do as well. I often wonder, though, "if people hate spam so much, why do they still post their email addresses on the web?" I'm not talking about a plain-text posting here. I am referring to the fa…
    Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
    Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

    754 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    19 Experts available now in Live!

    Get 1:1 Help Now