Avatar of Smoerble
SmoerbleFlag for Germany

asked on 

RegEx to shrink text with HTML tags

I hav a string that contains a lot different HTML tags. I want to shorten it to, let's say, maximum 400 characters. The issue I have is:
I need to make shure that I don't leave open HTML tags. So let's say the text ends with
---
<a href=".....">whatever</a>
---
 and the cap would be in the middle of the word "whatever", then I can't have
---
<a href=".....">whatev...
---
at the end of the string.

So I need a regular expression that does the following:
1) find end of words where I want to do the cut.
2) If an HTML tag would stay open there, then do the cut BEFORE the last HTML tag.

Can anybody help?
Scripting LanguagesProgramming

Avatar of undefined
Last Comment
abel
Avatar of ozo
ozo
Flag of United States of America image

>([^>]{0,400})[^<]*
Avatar of ozo
ozo
Flag of United States of America image

(>([^<]{0,400})[^<]*
Avatar of abel
abel
Flag of Netherlands image

Ozo, your regex captures a string between the ending bracket of a tag, takes up to 400 chars and only makes sure there's no closing bracket in the capture. Then it throws the stringaway up to the next opening bracket.

The OP asks for no break inside the middle of a word and the text can have any number of tags, opening and closing, but must not break layout (i.e., must not end up with crippled tags).

This is hardly a trivial task. Consider bold text
 withimages. The regex needs to understand that
 and  do not need to be scanned for a matching ending tag,  does. So, the regex can "break in the middle" of
 and and  but not "in the middle" of  ... .

That's at least my understanding of the question. I wonder what would happen if a opening / ending tag contains more than 400 chars (i.e., a table, a div).

I'm sorry I am not posting a solution (yet) because I think it is not easily (or at all) possible with a simple regex. A parsing language like XSLT (available in PHP) might be an easier means to tame this beast :), but only when this is about XHTML and not just HTML.

Regards,
-- Abel --
Avatar of abel
abel
Flag of Netherlands image

The crippled part again (oh, I hate this new EE rich text formatter that cripples just about any post...)

This is hardly a trivial task. Consider <b>bold text with <img ...> images and <br> newlines</b>. The regex needs to understand that <br> and  do not need to be scanned for a matching ending tag,  but <b> does. So, the regex can "break in the middle" of <br> and <img>  but not "in the middle" of <b>...<b>
Avatar of abel
abel
Flag of Netherlands image

The crippled part again (oh, I hate this new EE rich text formatter that cripples just about any post...)

This is hardly a trivial task. Consider < b >bold text with  images and < br > newlines. The regex needs to understand that < br > and < img > do not need to be scanned for a matching ending tag,  but < b > does. So, the regex can "break in the middle" of < br > and < img >  but not "in the middle" of < b >...< / b >

*edited by Netminder to fix formatting issue. Note that there are now spaces after the < and before the > for tags that can possibly be converted by the Rich Text feature*
Avatar of abel
abel
Flag of Netherlands image

sorry for the clutter. I have no idea how to fix this... See the code snippet for uncrippled text... :S

This is hardly a trivial task. Consider &lt;b&gt;bold text with &lt;img ...&gt; 
images and &lt;br&gt; newlines&lt;/b&gt;. The regex needs to understand 
that &lt;br&gt; and  do not need to be scanned for a matching ending tag, 
but &lt;b&gt; does. So, the regex can "break in the middle" of &lt;br&gt; 
and &lt;img&gt;  but not "in the middle" of &lt;b&gt;...&lt;b&gt;

Open in new window

Avatar of abel
abel
Flag of Netherlands image

Last time, then I give up...

This is hardly a trivial task. Consider 
 
<b>bold text with <img ...> images and <br> newlines</b>. 
 
The regex needs to understand that <br> and 
do not need to be scanned for a matching ending 
tag, but <b> does. So, the regex can "break in the middle"
 of <br> and <img> but not "in the middle" of <b>...<b>

Open in new window

Avatar of Smoerble
Smoerble
Flag of Germany image

ASKER

Hello abel,
thanks for your answer. You pointed out a problem I did not think of before: what happens if the whole string starts with a tag that ends at the end?

So I think the best solution would be to "kill" all unclosed HTML tags and leave the rest as is... still the question, how to do it?
Avatar of abel
abel
Flag of Netherlands image

Can you elaborate a bit on the "kill" bit? How would that look like? Surely this may well end up onto a very crippled HTML bit (i.e., when a "tr" is killed, but the "td" is not, or when the "html" is killed and the header tags not, let alone lists and selectboxes...).
Avatar of abel
abel
Flag of Netherlands image

Btw: the bug with the RTE editor is being researched by the EE team.
Avatar of abel
abel
Flag of Netherlands image

Something that didn't work out for you, I see, considering the layout you made :)
Avatar of Smoerble
Smoerble
Flag of Germany image

ASKER

@abel:
good point... I start to think that it might be better to leave only and  tags and strip the rest totally... will think about it a little.
ASKER CERTIFIED SOLUTION
Avatar of abel
abel
Flag of Netherlands image

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
Programming
Programming

Programming includes both the specifics of the language you’re using, like Visual Basic, .NET, Java and others, but also the best practices in user experience and interfaces and the management of projects, version control and development. Other programming topics are related to web and cloud development and system and hardware programming.

55K
Questions
--
Followers
--
Top Experts
Get a personalized solution from industry experts
Ask the experts
Read over 600 more reviews

TRUSTED BY

IBM logoIntel logoMicrosoft logoUbisoft logoSAP logo
Qualcomm logoCitrix Systems logoWorkday logoErnst & Young logo
High performer badgeUsers love us badge
LinkedIn logoFacebook logoX logoInstagram logoTikTok logoYouTube logo