Strip HTML, preserving img tags and their parent paragraphs

MichaelEvangelista
MichaelEvangelista used Ask the Experts™
on
I'm using coldfusion to parse the content of a blog feed into a web page, truncating each entry after a specific number of characters.
I have another great function to turn all of the junk html into plain text, so my page doesn't break when I truncate the text in the middle of an open <div>, but then I lose the images.

I've pasted the function below.
What I'd like to do is modify this so that all fully-closed <img> tags are left completely alone, and if the <img> is in a fully-closed<p>, leave that <p> alone too.

By fully closed, I mean that since I will be passing in a truncated block of HTML, the function should make sure any <img> and <p> tags being preserved have matching end tags, or else they should be stripped out anyway.

thanks in advance, here's the current function
You'll see it replaces a list of specific items with a new line in the text, and then replaces all <*> html completely.

I believe the regex we are looking to modify is in that last line,
   theText = REReplaceNoCase(theText,"<[^>]+>","","all");










<cfscript>
/**
 * strips html out of text, replaces paragraphs with line breaks, adds text versions of links
 * @param theHtml 	 HTML you wish to render to text. (Required)
 * @return Returns a string. 
 */
function htmlToText(theHtml)
{  newP = chr(13) & chr(10) & '-' & chr(13) & chr(10);
   theText = REReplaceNoCase(theHTML,"<a .*(href=['""]?)([^'"" ]+)['"" ][^>]+>([^<]+)</a>","\3: \2","all");
   theText = REReplaceNoCase(theText,"<br />",newP,"all");
   theText = REReplaceNoCase(theText,"<br>",newP,"all");
   theText = REReplaceNoCase(theText,"<h1[^>]*>",newP,"all");
   theText = REReplaceNoCase(theText,"<h2[^>]*>",newP,"all");
   theText = REReplaceNoCase(theText,"<h3[^>]*>",newP,"all");
   theText = REReplaceNoCase(theText,"<h4[^>]*>",newP,"all");
   theText = REReplaceNoCase(theText,"<h5[^>]*>",newP,"all");
   theText = REReplaceNoCase(theText,"<p[^>]*>",newP,"all");
   theText = REReplaceNoCase(theText,"<[^>]+>","","all");
  // writeoutput(thetext);
	return theText;
}
</cfscript>

Open in new window

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
do you have a sample blog entry / URL that provides the scenario you're stuck on? It'd help to provide the most accurate answer.
Also it appears you're replacing <h..>, <p..> tags and break tags with [carriage return "-" carriage return], do you still want this functionality? Also how many characters are you truncating after and what takes care of the truncating?

I'm assuming this is to provide a blog summary and a link to the specific blog. Do you care where the images are places if they're left alone, you're going to lose some formatting ability with the images and stripping all the other HTML.

What you can do is just replace the HTML characters with escaped equivalents and store that. This way you preserve all display and you are left with just determining where to truncate.

Author

Commented:
yes I fixed that function just now, noticed i had put '-' in there for testing at one point.

Here is a sample block of content

==
<img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 337px; DISPLAY: block; HEIGHT: 330px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5415138022649906306" border="0" alt="" src="http://2.bp.blogspot.com/_n37agaDjeMg/SyZvV0nCRII/AAAAAAAAACY/NToEFqkKNdU/s320/Dek+Max+bid+011.jpg" /> Here we have a beautiful home with a deck in front overlooking downtown Salt Lake City and the valley as well as a more private deck out the back. The home owner had been experiencing water leaks for several years and never found a good solution. It seemed every year or two they were spending more money on the next best thing that came along. I could hear the frustration in their voices as the described the events. <br /><br />These decks were designed to be an addition to the usable space of the home, opening up the outdoors and providing a place to relax and enjoy. Rather they became a burden, a money trap and not even being used. <br /><br /><br /><p><img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 320px; DISPLAY: block; HEIGHT: 240px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5415138055934317986" border="0" alt="" src="http://2.bp.blogspot.com/_n37agaDjeMg/SyZvXwmqFaI/AAAAAAAAAC4/IEAQiaWlouk/s320/Dek+Max+bid+008.jpg" /></p><p> </p><p> These pictures are from the last attempt at waterproofing the decks. Some type of liquid applied product was used and it looks to have some sand added to it for slip resistance. There are a couple of problems here that the home owner may have uncovered with some effort. A large portion of the decking is over living space (that’s why the leaking was costing so much) and according to building code the waterproofing needs to be 60mil thickness at a minimum. So the question that came to mind was “How do you measure the thickness of a liquid rolled on? When the liquid dry’s it becomes hard and since this is a wood surface it will have movement. Wood shrinks, expands and contracts, and is subject to pressure from the home settling. So when that happens what is the result on the applied product? If it has dried on the wood surface it will fail simply because it can not “move” with the wood. </p><p><img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 320px; DISPLAY: block; HEIGHT: 240px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5415138045438434082" border="0" alt="" src="http://4.bp.blogspot.com/_n37agaDjeMg/SyZvXJgPmyI/AAAAAAAAACw/g3tZ_p_RJ6Y/s320/Dek+Max+bid+015.jpg" /><br /><br />Installation from a trained professional also reduces silly mistakes likes these.<br />Notice the brick, the installer simply “painted” the lower brick and decided that was waterproof. The same was done for the bottom of the railing around both decks. This is an actual hole between the brick and the door, you can see where the “waterproofing has peeled away. </p><div><div><a href="http://4.bp.blogspot.com/_n37agaDjeMg/SyZvW05QfeI/AAAAAAAAACo/jiczNLf32Z4/s1600-h/Dek+Max+bid+016.jpg"><img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 320px; DISPLAY: block; HEIGHT: 240px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5415138039906205154" border="0" alt="" src="http://4.bp.blogspot.com/_n37agaDjeMg/SyZvW05QfeI/AAAAAAAAACo/jiczNLf32Z4/s320/Dek+Max+bid+016.jpg" /></a><br />There are plenty of products out there. Home owners need to be aware of the proper questions to ask and do some homework to be certain you will not end up with a situation like the one described above. You can see our T.I.P.S at <a href="http://www.dekmax,com/">www.dekmax,com</a> or at a minimum you should request a copy of the ICC ES report for the product you are considering. You can look yourself at <a href="http://www.icc-es.org/">www.icc-es.org</a> </div><div>Warranty’s are good and are a common question. Check the number of years they have been in business versus the number of warranty cycles they have been through. There is nothing wrong with asking about their history. How many recalls? Manufacturing defects? What is the quality assurance program?<br /><br />A home owner once asked me if they should just trust their contractor? Yes you should and you should ask any question you want because they should be able to answer it clearly in terms that you understand. They may be doing the work for you but it is still your project, your home and your money. Invest it don’t just spend it.</div><div><br /><br /><div><a href="http://3.bp.blogspot.com/_n37agaDjeMg/SyZvWQ1ZIcI/AAAAAAAAACg/uVHrdRgzpYs/s1600-h/Dek+Max+bid+017.jpg"><img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 320px; DISPLAY: block; HEIGHT: 240px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5415138030226317762" border="0" alt="" src="http://3.bp.blogspot.com/_n37agaDjeMg/SyZvWQ1ZIcI/AAAAAAAAACg/uVHrdRgzpYs/s320/Dek+Max+bid+017.jpg" /></a><br /><br /><br /><br /><div><br /><br /><br /><br /><br /><div></div></div></div></div></div><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8633731374262364956-4899084485021980302?l=utahdecks.blogspot.com' alt='' /></div>
==


Ideally, we'd strip this down so it is only Text with carriage returns (like my current function does)
and once in a while we'd  have

<img src="[image source]">  - no closing />, and no other attributes than src


As a bonus, I was thinking it would be nice to look for the parent <p> or <div> to that <img> and preserve its opening and closing tags IF both exist in the string, otherwise strip it out and leave the <img>

The point of this last part - the blog editor sometimes has their images inside of a <p> or <div> ,making it easy to format into a caption. This is not necessary but would be a nice touch.
Acronis in Gartner 2019 MQ for datacenter backup

It is an honor to be featured in Gartner 2019 Magic Quadrant for Datacenter Backup and Recovery Solutions. Gartner’s MQ sets a high standard and earning a place on their grid is a great affirmation that Acronis is delivering on our mission to protect all data, apps, and systems.

I have one potential solution here that would satisfy what you're looking to do, but I was curious if you still want to strip all tags? The method(s) I'll propose  and post here will take into account the tags when truncating the body of content, so you won't have the issue of truncating in the middle of a tag anymore. So given that, do you still want to strip all tags?
See how this works out for you. It doesn't work by stripping everything out, but by breaking it down into tags and text. So you specify how many characters you want to see and once the parser reaches that limit it truncates everything else but allows ending tags and self terminating tags to still be written. This solves the issue with being truncated right in the middle of a tag. The only thing truncated is the text. Also this preserves your terminating tags so that you an finish wrapping images and text with parent containers. This also allows your links to be preserved, however if the text being truncated happens to be the link text, that will be truncated, but the anchor tags will still be written properly so that the links work. If the link is just around an image, its not affected by truncation.

See the code below...
//The feed

<cfsavecontent variable="datum"><img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 337px; DISPLAY: block; HEIGHT: 330px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5415138022649906306" border="0" alt="" src="http://2.bp.blogspot.com/_n37agaDjeMg/SyZvV0nCRII/AAAAAAAAACY/NToEFqkKNdU/s320/Dek+Max+bid+011.jpg" /> Here we have a beautiful home with a deck in front overlooking downtown Salt Lake City and the valley as well as a more private deck out the back. The home owner had been experiencing water leaks for several years and never found a good solution. It seemed every year or two they were spending more money on the next best thing that came along. I could hear the frustration in their voices as the described the events. <br /><br />These decks were designed to be an addition to the usable space of the home, opening up the outdoors and providing a place to relax and enjoy. Rather they became a burden, a money trap and not even being used. <br /><br /><br /><p><img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 320px; DISPLAY: block; HEIGHT: 240px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5415138055934317986" border="0" alt="" src="http://2.bp.blogspot.com/_n37agaDjeMg/SyZvXwmqFaI/AAAAAAAAAC4/IEAQiaWlouk/s320/Dek+Max+bid+008.jpg" /></p><p> </p><p> These pictures are from the last attempt at waterproofing the decks. Some type of liquid applied product was used and it looks to have some sand added to it for slip resistance. There are a couple of problems here that the home owner may have uncovered with some effort. A large portion of the decking is over living space (that’s why the leaking was costing so much) and according to building code the waterproofing needs to be 60mil thickness at a minimum. So the question that came to mind was “How do you measure the thickness of a liquid rolled on? When the liquid dry’s it becomes hard and since this is a wood surface it will have movement. Wood shrinks, expands and contracts, and is subject to pressure from the home settling. So when that happens what is the result on the applied product? If it has dried on the wood surface it will fail simply because it can not “move” with the wood. </p><p><img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 320px; DISPLAY: block; HEIGHT: 240px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5415138045438434082" border="0" alt="" src="http://4.bp.blogspot.com/_n37agaDjeMg/SyZvXJgPmyI/AAAAAAAAACw/g3tZ_p_RJ6Y/s320/Dek+Max+bid+015.jpg" /><br /><br />Installation from a trained professional also reduces silly mistakes likes these.<br />Notice the brick, the installer simply “painted” the lower brick and decided that was waterproof. The same was done for the bottom of the railing around both decks. This is an actual hole between the brick and the door, you can see where the “waterproofing has peeled away. </p><div><div><a href="http://4.bp.blogspot.com/_n37agaDjeMg/SyZvW05QfeI/AAAAAAAAACo/jiczNLf32Z4/s1600-h/Dek+Max+bid+016.jpg"><img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 320px; DISPLAY: block; HEIGHT: 240px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5415138039906205154" border="0" alt="" src="http://4.bp.blogspot.com/_n37agaDjeMg/SyZvW05QfeI/AAAAAAAAACo/jiczNLf32Z4/s320/Dek+Max+bid+016.jpg" /></a><br />There are plenty of products out there. Home owners need to be aware of the proper questions to ask and do some homework to be certain you will not end up with a situation like the one described above. You can see our T.I.P.S at <a href="http://www.dekmax,com/">www.dekmax,com</a> or at a minimum you should request a copy of the ICC ES report for the product you are considering. You can look yourself at <a href="http://www.icc-es.org/">www.icc-es.org</a> </div><div>Warranty’s are good and are a common question. Check the number of years they have been in business versus the number of warranty cycles they have been through. There is nothing wrong with asking about their history. How many recalls? Manufacturing defects? What is the quality assurance program?<br /><br />A home owner once asked me if they should just trust their contractor? Yes you should and you should ask any question you want because they should be able to answer it clearly in terms that you understand. They may be doing the work for you but it is still your project, your home and your money. Invest it don’t just spend it.</div><div><br /><br /><div><a href="http://3.bp.blogspot.com/_n37agaDjeMg/SyZvWQ1ZIcI/AAAAAAAAACg/uVHrdRgzpYs/s1600-h/Dek+Max+bid+017.jpg"><img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 320px; DISPLAY: block; HEIGHT: 240px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5415138030226317762" border="0" alt="" src="http://3.bp.blogspot.com/_n37agaDjeMg/SyZvWQ1ZIcI/AAAAAAAAACg/uVHrdRgzpYs/s320/Dek+Max+bid+017.jpg" /></a><br /><br /><br /><br /><div><br /><br /><br /><br /><br /><div></div></div></div></div></div><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8633731374262364956-4899084485021980302?l=utahdecks.blogspot.com' alt='' /></div></cfsavecontent>

//The test; 
//the numeric argument to truncateHTMLText() is the character length you wish to truncate to

<cfoutput>#truncateHTMLText(datum, 1301)#</cfoutput>

//The logic...

<cfscript>
function truncateHTMLText(context, contentLength){
 var i = 0;
 var maxLen = arguments.contentLength;
 var datum = parseDatum(arguments.context);
 var result = arrayNew(1);
 var limitCount = 0;
 var setBreak = false;
 var matchOpener = arrayNew(1);
 for(i = 1; i lte arrayLen(datum); i = i + 1){
  if(not datum[i].isTag and limitCount lte maxLen and not setBreak){
   if(limitCount + datum[i].len lte maxLen){
    arrayAppend(result, datum[i].context);
	limitCount = limitCount + datum[i].len;
   }
   else{
    arrayAppend(result, left(datum[i].context, maxLen - limitCount) & "...");
	setBreak = true;
   }
  }
  else if(not setBreak){
   arrayAppend(result, datum[i].context);
   if(datum[i].isOpener){
    arrayAppend(matchOpener, datum[i].type);
   }
   if(datum[i].isCloser){
    arrayDeleteAt(matchOpener, arrayLen(matchOpener));
   }
  }
  else if(setBreak and datum[i].isTag and arrayLen(matchOpener) and (datum[i].isCloser or datum[i].isSelfTerminating)){
   arrayAppend(result, datum[i].context);
   if(datum[i].isCloser){
    arrayDeleteAt(matchOpener, arrayLen(matchOpener));
   }
  }
  else if(not arrayLen(matchOpener)){
   break;
  }
 }
 return arrayToList(result, "");
}

function parseDatum(context){
 var datum = arguments.context;
 var tags = arrayNew(1);
 var data = arrayNew(1);
 var idx = 1;
 var i = 0;
 var total = len(datum);
 var reg = "<[^>]+>";
 var result = structNew();
 var tagsLength = 0;
 var forwardPos = 0;
 var dataLength = 0;
 while(idx lte total and idx neq 0){
  result = reFindNoCase(reg, datum, idx, true);
  idx = result.pos[1] + result.len[1];
  if(idx neq 0){
   arrayAppend(tags, appendPosition(result.pos[1], result.len[1]));
  }
 }
 tagsLength = arrayLen(tags);
 if(tagsLength and tags[1].pos neq 1){
  arrayAppend(data, appendNode(1, tags[1].pos - 1, datum));
 }
 for(i = 1; i lte tagsLength; i = i + 1){
  arrayAppend(data, appendNode(tags[i].pos, tags[i].len, datum));
  if(i neq tagsLength and tags[i].pos + tags[i].len neq tags[i + 1].pos){
   forwardPos = tags[i].pos + tags[i].len;
   arrayAppend(data, appendNode(forwardPos, tags[i + 1].pos - forwardPos, datum));
  }
 }
 dataLength = arrayLen(data);
 if(data[dataLength].pos + data[dataLength].len lte total){
  arrayAppend(data, appendNode(data[dataLength].pos + data[dataLength].len, total - data[dataLength].pos + data[dataLength].len, datum));
 }
 return data; 
}

function appendNode(pos, len, context){
 var node = structnew();
 var typeReg = "<[\/]?([a-z]+)\s?";
 var typeResult = structNew();
 node['pos'] = arguments.pos;
 node['len'] = arguments.len;
 node['context'] = mid(arguments.context, node.pos, node.len);
 node['isTag'] = reFind("<[^>]+>", node.context, 1, false) eq 1;
 if(node.isTag){
  typeResult = reFindNoCase(typeReg, node.context, 1, true);
  node['isSelfTerminating'] = left(node.context, 1) eq "<" and right(node.context, 2) eq "/>";
  node['isCloser'] = left(node.context, 2) eq "</";
  node['isOpener'] = not node.isCloser and not node.isSelfTerminating;
  node['type'] = mid(node.context, typeResult.pos[2], typeResult.len[2]);
 }
 return node; 
}

function appendPosition(pos, len){
 var node = structnew();
 node['pos'] = arguments.pos;
 node['len'] = arguments.len;
 return node; 
}
</cfscript>

Open in new window

Author

Commented:
WOW. checking this out now... will report back... thx!

Author

Commented:
That is very slick.
At first glance doing exactly what we need. I am going to assign full points for this great answer but also ask a quick follow up Q if that's ok.
Thank you VERY much.

Author

Commented:
Excellent response to a detailed request!

Author

Commented:
I have attached the source of my customer's blog feed as rendered by your script.
The remaining things I would like to strip out would be

a) More than one <br /> OR <br> tag in a row (i.e. replace any number of consecutive <br /> with just one.

b) Remove all attributes EXCEPT the src="" on img tags, otherwise, no style/class/font/etc attributes would come through on any tag
(optionally, if this was a list of attributes to remove or allow as a variable to the script it would be usable in the maximum number of ways)

i.e. this

<span style="color: rgb(0, 0, 0);font-family:Arial;font-size:100%;"  ><div  style="font-family:georgia;"><span class="062374316-02102009">

would become simply

<span><div><span>


truncateHtmlText-Source.txt
That would be fairly easy to do, what my code does there is pull each tag into an array element and stores specific information about each one basically an array of structs which is done by the appendNode(), modifying that just takes an extra little bit of cleaning up while pushing the data to the array.

I'll make a few tweaks and re-post in a few...

Thanks and glad I could help!
Give this a try, I added a function called applyNodeRule() which contains a case statement to apply rules to nearly any tag or even text and since each array element holds information about what it contains, we can use this method/function and make that info work for us in cleaning up any unwanted stuff.
function truncateHTMLText(context, contentLength){
 var i = 0;
 var maxLen = arguments.contentLength;
 var datum = parseDatum(arguments.context);
 var result = arrayNew(1);
 var limitCount = 0;
 var setBreak = false;
 var matchOpener = arrayNew(1); 
 for(i = 1; i lte arrayLen(datum); i = i + 1){
  if(not datum[i].isTag and limitCount lte maxLen and not setBreak){
   if(limitCount + datum[i].len lte maxLen){
    arrayAppend(result, datum[i].context);
	limitCount = limitCount + datum[i].len;
   }
   else{
    arrayAppend(result, left(datum[i].context, maxLen - limitCount) & "...");
	setBreak = true;
   }
  }
  else if(not setBreak){
   arrayAppend(result, datum[i].context);
   if(datum[i].isOpener){
    arrayAppend(matchOpener, datum[i].type);
   }
   if(datum[i].isCloser){
    arrayDeleteAt(matchOpener, arrayLen(matchOpener));
   }
  }
  else if(setBreak and datum[i].isTag and arrayLen(matchOpener) and (datum[i].isCloser or datum[i].isSelfTerminating)){
   arrayAppend(result, datum[i].context);
   if(datum[i].isCloser){
    arrayDeleteAt(matchOpener, arrayLen(matchOpener));
   }
  }
  else if(not arrayLen(matchOpener)){
   break;
  }
 }
 return arrayToList(result, "");
}
//returns array
function parseDatum(context){
 var datum = arguments.context;
 var tags = arrayNew(1);
 var data = arrayNew(1);
 var idx = 1;
 var i = 0;
 var total = len(datum);
 var reg = "<[^>]+>";
 var result = structNew();
 var tagsLength = 0;
 var forwardPos = 0;
 var dataLength = 0;
 while(idx lte total and idx neq 0){
  result = reFindNoCase(reg, datum, idx, true);
  idx = result.pos[1] + result.len[1];
  if(idx neq 0){
   arrayAppend(tags, appendPosition(result.pos[1], result.len[1]));
  }
 }
 tagsLength = arrayLen(tags);
 if(tagsLength and tags[1].pos neq 1){
  arrayAppend(data, appendNode(1, tags[1].pos - 1, datum, data));
 }
 for(i = 1; i lte tagsLength; i = i + 1){
  arrayAppend(data, appendNode(tags[i].pos, tags[i].len, datum, data));
  if(i neq tagsLength and tags[i].pos + tags[i].len neq tags[i + 1].pos){
   forwardPos = tags[i].pos + tags[i].len;
   arrayAppend(data, appendNode(forwardPos, tags[i + 1].pos - forwardPos, datum, data));
  }
 }
 dataLength = arrayLen(data);
 if(data[dataLength].pos + data[dataLength].len lte total){
  arrayAppend(data, appendNode(data[dataLength].pos + data[dataLength].len, total - data[dataLength].pos + data[dataLength].len, datum, data));
 }
 return data; 
}
//returns struct
function appendNode(pos, len, context, build){
 var node = structnew();
 var typeReg = "<[\/]?([a-z]+)\s?";
 var typeResult = structNew();
 var buildLength = arrayLen(build);
 node['pos'] = arguments.pos;
 node['len'] = arguments.len;
 node['context'] = mid(arguments.context, node.pos, node.len);
 node['isTag'] = reFind("<[^>]+>", node.context, 1, false) eq 1;
 if(node.isTag){
  typeResult = reFindNoCase(typeReg, node.context, 1, true);
  node['isSelfTerminating'] = left(node.context, 1) eq "<" and right(node.context, 2) eq "/>";
  node['isCloser'] = left(node.context, 2) eq "</";
  node['isOpener'] = not node.isCloser and not node.isSelfTerminating;
  node['type'] = mid(node.context, typeResult.pos[2], typeResult.len[2]);
  node['isRepeated'] = false;
  if(buildLength and build[buildLength].isTag){
   node['isRepeated'] = build[buildLength].type eq node.type;
  }
 }
 node = applyNodeRule(node);
 return node; 
}
//returns struct
function appendPosition(pos, len){
 var node = structnew();
 node['pos'] = arguments.pos;
 node['len'] = arguments.len;
 return node; 
}
//cleans out unwanted data
function applyNodeRule(tagContext){
 var datum = arguments.tagContext;
 var context = datum.context;
 var newContext = "";
 if(datum.isTag){
  switch(datum.type){
   case "img":
	//preserve the src attribute
	newContext = "<img src=""";
	search = refindNoCase("<\s*(img|IMG)[^>]*(src|src)\s*=\s*['""]\s*([^'""]*\/)*[^'""\s]+\s*['""][^>]*>", context, 1, true);
	if(search.pos[4]){
	 newContext = newContext & mid(context, search.pos[4], search.len[4]);
	 if(datum.isSelfTerminating){
	  newContext = newContext & """/>";
	  break;
	 }
	 newContext = newContext & """>";
	}
    break;
   case "br":
    //clear repeated <br> tags
	if(datum.isRepeated){
	 newContext = "";
	 break;
	}
	return datum;
    break;
   case "div":
	newContext = context;
	//remove all attributes
	if(datum.isOpener){
     newContext = "<div>";
	}
    break;
   case "span":
	newContext = context;
	//remove all attributes
	if(datum.isOpener){
     newContext = "<span>";
	}
    break;
   default:
    return datum;
	break;
  }
  datum.context = newContext;
 }
 return datum;
}

Open in new window

In addition to what you mentioned above, if you want to strip out all empty tags, ie: <span></span> or even <span><span><span></span></span></span>

use this code...

added one more function "stripEmptyTags()" to strip those out...
function truncateHTMLText(context, contentLength){
 var i = 0;
 var maxLen = arguments.contentLength;
 var datum = parseDatum(arguments.context);
 var result = arrayNew(1);
 var limitCount = 0;
 var setBreak = false;
 var matchOpener = arrayNew(1); 
 for(i = 1; i lte arrayLen(datum); i = i + 1){
  if(not datum[i].isTag and limitCount lte maxLen and not setBreak){
   if(limitCount + datum[i].len lte maxLen){
    arrayAppend(result, datum[i].context);
	limitCount = limitCount + datum[i].len;
   }
   else{
    arrayAppend(result, left(datum[i].context, maxLen - limitCount) & "...");
	setBreak = true;
   }
  }
  else if(not setBreak){
   arrayAppend(result, datum[i].context);
   if(datum[i].isOpener){
    arrayAppend(matchOpener, datum[i].type);
   }
   if(datum[i].isCloser){
    arrayDeleteAt(matchOpener, arrayLen(matchOpener));
   }
  }
  else if(setBreak and datum[i].isTag and arrayLen(matchOpener) and (datum[i].isCloser or datum[i].isSelfTerminating)){
   arrayAppend(result, datum[i].context);
   if(datum[i].isCloser){
    arrayDeleteAt(matchOpener, arrayLen(matchOpener));
   }
  }
  else if(not arrayLen(matchOpener)){
   break;
  }
 }
 return stripEmptyTags(arrayToList(result, ""));
}
//returns array
function parseDatum(context){
 var datum = arguments.context;
 var tags = arrayNew(1);
 var data = arrayNew(1);
 var idx = 1;
 var i = 0;
 var total = len(datum);
 var reg = "<[^>]+>";
 var result = structNew();
 var tagsLength = 0;
 var forwardPos = 0;
 var dataLength = 0;
 while(idx lte total and idx neq 0){
  result = reFindNoCase(reg, datum, idx, true);
  idx = result.pos[1] + result.len[1];
  if(idx neq 0){
   arrayAppend(tags, appendPosition(result.pos[1], result.len[1]));
  }
 }
 tagsLength = arrayLen(tags);
 if(tagsLength and tags[1].pos neq 1){
  arrayAppend(data, appendNode(1, tags[1].pos - 1, datum, data));
 }
 for(i = 1; i lte tagsLength; i = i + 1){
  arrayAppend(data, appendNode(tags[i].pos, tags[i].len, datum, data));
  if(i neq tagsLength and tags[i].pos + tags[i].len neq tags[i + 1].pos){
   forwardPos = tags[i].pos + tags[i].len;
   arrayAppend(data, appendNode(forwardPos, tags[i + 1].pos - forwardPos, datum, data));
  }
 }
 dataLength = arrayLen(data);
 if(data[dataLength].pos + data[dataLength].len lte total){
  arrayAppend(data, appendNode(data[dataLength].pos + data[dataLength].len, total - data[dataLength].pos + data[dataLength].len, datum, data));
 }
 return data; 
}
//returns struct
function appendNode(pos, len, context, build){
 var node = structnew();
 var typeReg = "<[\/]?([a-z]+)\s?";
 var typeResult = structNew();
 var buildLength = arrayLen(build);
 node['pos'] = arguments.pos;
 node['len'] = arguments.len;
 node['context'] = mid(arguments.context, node.pos, node.len);
 node['isTag'] = reFind("<[^>]+>", node.context, 1, false) eq 1;
 if(node.isTag){
  typeResult = reFindNoCase(typeReg, node.context, 1, true);
  node['isSelfTerminating'] = left(node.context, 1) eq "<" and right(node.context, 2) eq "/>";
  node['isCloser'] = left(node.context, 2) eq "</";
  node['isOpener'] = not node.isCloser and not node.isSelfTerminating;
  node['type'] = mid(node.context, typeResult.pos[2], typeResult.len[2]);
  node['isRepeated'] = false;
  if(buildLength and build[buildLength].isTag){
   node['isRepeated'] = build[buildLength].type eq node.type;
  }
 }
 node = applyNodeRule(node);
 return node; 
}
//returns struct
function appendPosition(pos, len){
 var node = structnew();
 node['pos'] = arguments.pos;
 node['len'] = arguments.len;
 return node; 
}
//cleans out unwanted data
function applyNodeRule(tagContext){
 var datum = arguments.tagContext;
 var context = datum.context;
 var newContext = "";
 if(datum.isTag){
  switch(datum.type){
   case "img":
	//preserve the src attribute
	newContext = "<img src=""";
	search = refindNoCase("<\s*(img|IMG)[^>]*(src|src)\s*=\s*['""]\s*([^'""]*\/)*[^'""\s]+\s*['""][^>]*>", context, 1, true);
	if(search.pos[4]){
	 newContext = newContext & mid(context, search.pos[4], search.len[4]);
	 if(datum.isSelfTerminating){
	  newContext = newContext & """/>";
	  break;
	 }
	 newContext = newContext & """>";
	}
    break;
   case "br":
    //clear repeated <br> tags
	if(datum.isRepeated){
	 newContext = "";
	 break;
	}
	return datum;
    break;
   case "div":
	newContext = context;
	//remove all attributes
	if(datum.isOpener){
     newContext = "<div>";
	}
    break;
   case "span":
	newContext = context;
	//remove all attributes
	if(datum.isOpener){
     newContext = "<span>";
	}
    break;
   default:
    return datum;
	break;
  }
  datum.context = newContext;
 }
 return datum;
}
//remove all empty tags
function stripEmptyTags(context){
 var datum = arguments.context;
 var emptyTagRegex = "<(\w+)>(\s|&nbsp;)*</\1>";
 datum = reReplaceNoCase(datum, emptyTagRegex, "", "all");
 if(reFindNoCase(emptyTagRegex, datum, 1, false)){
  datum = stripEmptyTags(datum);
 }
 return datum;
}

Open in new window

Author

Commented:
thanks so much. It might take me a bit to get back to that project and mess with it again, but i will be back!

Author

Commented:
Excellent. Beautiful. 1000 points if i could. Thanks SO much!

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial