• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 297
  • Last Modified:

Need RegEx: shrink string without breaking HTML rules.

I am looking for a regex to shrink a string. The problem is, the string might contain HTML tags. I don't want to kill all HTML from the string, I just need to shrink the string so, that the result is valid HTML.

Additionally I would like then not to cut in the middle of words, so I need to shrink it "back" to the next space/return in the string.

Can someone help me with this or point to a page with matching regex?
Thanks
0
Smoerble
Asked:
Smoerble
1 Solution
 
Fernando SotoRetiredCommented:
To your question, "I am looking for a regex to shrink a string. The problem is, the string might contain HTML tags. I don't want to kill all HTML from the string, I just need to shrink the string so, that the result is valid HTML."

Can you give an example of a string and how the string is to be shrunken, where in the string do you want to delete the characters?
0
 
HackneyCabCommented:
I thought about this problem a few weeks ago, and decided that it was very messy territory.

For a start, it's hard to work out the length of a string when that string contains HTML tags. You have to count the length of the string, then subtract the length of the tags to get the length that will display on the page. That's a programmatic process that I don't think can be done in regular expressions alone.

In the end I just stripped out HTML to avoid the rather big task of working around possible corruption of HTML elements and HTML entities. (If you break an entity in half, you make XML invalid, and risk seeing HTML render incorrectly.)
0
 
SmoerbleAuthor Commented:
Hackney: good points, thanks.

After I discussed your input, we found a logical and visual correct approach we want to implement. For ths we need several RegEx:

1) count all characters inside HTML tags (including the < and >)
2) count all characters OUTSIDE HTML tags
3) find the full string from <table> tag to </table> tag and replace it with a cimplete new string
4) From a starting point, find the NEXT sentence end (colon, semicolon, dot etc).

Additionally to this we need a more complex thing which I need to open an own questions I hope, someone can help me on these 4 tasks?
0
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
SmoerbleAuthor Commented:
No help on these tasks?
0
 
HackneyCabCommented:
The problem is, you can't count characters with regular expressions. You'd have to be using a scripting language, such as Perl or PHP for that.
0
 
ozoCommented:
if you can strip out HTML, .then counting characters before and after will give you 1 and 2
you can count characters outside of <> with something like
$count=()=/(?:\G|>|^)([^<>])/g
but you may need something more complicated for things like
 <IMG SRC = "foo.gif" ALT = "A > B">
<!-- <A comment> -->
 <script>if (a<b && a>c)</script>

s#<table>.*?</table>#cimplete new string#s

/\G.*?[:;.]/
You may want some context around that so you don' t match . in numbers or abbreviations
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: Microsoft Office 2010

This course will introduce you to the interfaces and features of Microsoft Office 2010 Word, Excel, PowerPoint, Outlook, and Access. You will learn about the features that are shared between all products in the Office suite, as well as the new features that are product specific.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now