asked on

Strip HTML, preserving img tags and their parent paragraphs

I'm using coldfusion to parse the content of a blog feed into a web page, truncating each entry after a specific number of characters.
I have another great function to turn all of the junk html into plain text, so my page doesn't break when I truncate the text in the middle of an open <div>, but then I lose the images.

I've pasted the function below.
What I'd like to do is modify this so that all fully-closed <img> tags are left completely alone, and if the <img> is in a fully-closed<p>, leave that <p> alone too.

By fully closed, I mean that since I will be passing in a truncated block of HTML, the function should make sure any <img> and <p> tags being preserved have matching end tags, or else they should be stripped out anyway.

thanks in advance, here's the current function
You'll see it replaces a list of specific items with a new line in the text, and then replaces all <*> html completely.

I believe the regex we are looking to modify is in that last line,
theText = REReplaceNoCase(theText,"<[^>]+>","","all");

<cfscript>
/**
 * strips html out of text, replaces paragraphs with line breaks, adds text versions of links
 * @param theHtml 	 HTML you wish to render to text. (Required)
 * @return Returns a string. 
 */
function htmlToText(theHtml)
{  newP = chr(13) & chr(10) & '-' & chr(13) & chr(10);
   theText = REReplaceNoCase(theHTML,"<a .*(href=['""]?)([^'"" ]+)['"" ][^>]+>([^<]+)</a>","\3: \2","all");
   theText = REReplaceNoCase(theText,"<br />",newP,"all");
   theText = REReplaceNoCase(theText,"<br>",newP,"all");
   theText = REReplaceNoCase(theText,"<h1[^>]*>",newP,"all");
   theText = REReplaceNoCase(theText,"<h2[^>]*>",newP,"all");
   theText = REReplaceNoCase(theText,"<h3[^>]*>",newP,"all");
   theText = REReplaceNoCase(theText,"<h4[^>]*>",newP,"all");
   theText = REReplaceNoCase(theText,"<h5[^>]*>",newP,"all");
   theText = REReplaceNoCase(theText,"<p[^>]*>",newP,"all");
   theText = REReplaceNoCase(theText,"<[^>]+>","","all");
  // writeoutput(thetext);
	return theText;
}
</cfscript>

Open in new window

trailblazzyr55

do you have a sample blog entry / URL that provides the scenario you're stuck on? It'd help to provide the most accurate answer.

trailblazzyr55

Also it appears you're replacing <h..>, <p..> tags and break tags with [carriage return "-" carriage return], do you still want this functionality? Also how many characters are you truncating after and what takes care of the truncating?

I'm assuming this is to provide a blog summary and a link to the specific blog. Do you care where the images are places if they're left alone, you're going to lose some formatting ability with the images and stripping all the other HTML.

What you can do is just replace the HTML characters with escaped equivalents and store that. This way you preserve all display and you are left with just determining where to truncate.

MichaelEvangelista

ASKER

yes I fixed that function just now, noticed i had put '-' in there for testing at one point.

Here is a sample block of content

==
<img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 337px; DISPLAY: block; HEIGHT: 330px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5415138022649906306" border="0" alt="" src="http://2.bp.blogspot.com/_n37agaDjeMg/SyZvV0nCRII/AAAAAAAAACY/NToEFqkKNdU/s320/Dek+Max+bid+011.jpg" /> Here we have a beautiful home with a deck in front overlooking downtown Salt Lake City and the valley as well as a more private deck out the back. The home owner had been experiencing water leaks for several years and never found a good solution. It seemed every year or two they were spending more money on the next best thing that came along. I could hear the frustration in their voices as the described the events. <br /><br />These decks were designed to be an addition to the usable space of the home, opening up the outdoors and providing a place to relax and enjoy. Rather they became a burden, a money trap and not even being used. <br /><br /><br /><p><img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 320px; DISPLAY: block; HEIGHT: 240px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5415138055934317986" border="0" alt="" src="http://2.bp.blogspot.com/_n37agaDjeMg/SyZvXwmqFaI/AAAAAAAAAC4/IEAQiaWlouk/s320/Dek+Max+bid+008.jpg" /></p><p> </p><p> These pictures are from the last attempt at waterproofing the decks. Some type of liquid applied product was used and it looks to have some sand added to it for slip resistance. There are a couple of problems here that the home owner may have uncovered with some effort. A large portion of the decking is over living space (that’s why the leaking was costing so much) and according to building code the waterproofing needs to be 60mil thickness at a minimum. So the question that came to mind was “How do you measure the thickness of a liquid rolled on? When the liquid dry’s it becomes hard and since this is a wood surface it will have movement. Wood shrinks, expands and contracts, and is subject to pressure from the home settling. So when that happens what is the result on the applied product? If it has dried on the wood surface it will fail simply because it can not “move” with the wood. </p><p><img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 320px; DISPLAY: block; HEIGHT: 240px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5415138045438434082" border="0" alt="" src="http://4.bp.blogspot.com/_n37agaDjeMg/SyZvXJgPmyI/AAAAAAAAACw/g3tZ_p_RJ6Y/s320/Dek+Max+bid+015.jpg" /><br /><br />Installation from a trained professional also reduces silly mistakes likes these.<br />Notice the brick, the installer simply “painted” the lower brick and decided that was waterproof. The same was done for the bottom of the railing around both decks. This is an actual hole between the brick and the door, you can see where the “waterproofing has peeled away. </p><div><div><a href="http://4.bp.blogspot.com/_n37agaDjeMg/SyZvW05QfeI/AAAAAAAAACo/jiczNLf32Z4/s1600-h/Dek+Max+bid+016.jpg"><img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 320px; DISPLAY: block; HEIGHT: 240px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5415138039906205154" border="0" alt="" src="http://4.bp.blogspot.com/_n37agaDjeMg/SyZvW05QfeI/AAAAAAAAACo/jiczNLf32Z4/s320/Dek+Max+bid+016.jpg" /></a><br />There are plenty of products out there. Home owners need to be aware of the proper questions to ask and do some homework to be certain you will not end up with a situation like the one described above. You can see our T.I.P.S at <a href="http://www.dekmax,com/">www.dekmax,com</a> or at a minimum you should request a copy of the ICC ES report for the product you are considering. You can look yourself at <a href="http://www.icc-es.org/">www.icc-es.org</a> </div><div>Warranty’s are good and are a common question. Check the number of years they have been in business versus the number of warranty cycles they have been through. There is nothing wrong with asking about their history. How many recalls? Manufacturing defects? What is the quality assurance program?<br /><br />A home owner once asked me if they should just trust their contractor? Yes you should and you should ask any question you want because they should be able to answer it clearly in terms that you understand. They may be doing the work for you but it is still your project, your home and your money. Invest it don’t just spend it.</div><div><br /><br /><div><a href="http://3.bp.blogspot.com/_n37agaDjeMg/SyZvWQ1ZIcI/AAAAAAAAACg/uVHrdRgzpYs/s1600-h/Dek+Max+bid+017.jpg"><img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 320px; DISPLAY: block; HEIGHT: 240px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5415138030226317762" border="0" alt="" src="http://3.bp.blogspot.com/_n37agaDjeMg/SyZvWQ1ZIcI/AAAAAAAAACg/uVHrdRgzpYs/s320/Dek+Max+bid+017.jpg" /></a><br /><br /><br /><br /><div><br /><br /><br /><br /><br /><div></div></div></div></div></div><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8633731374262364956-4899084485021980302?l=utahdecks.blogspot.com' alt='' /></div>
==

Ideally, we'd strip this down so it is only Text with carriage returns (like my current function does)
and once in a while we'd have

<img src="[image source]"> - no closing />, and no other attributes than src

As a bonus, I was thinking it would be nice to look for the parent <p> or <div> to that <img> and preserve its opening and closing tags IF both exist in the string, otherwise strip it out and leave the <img>

The point of this last part - the blog editor sometimes has their images inside of a <p> or <div> ,making it easy to format into a caption. This is not necessary but would be a nice touch.

trailblazzyr55

I have one potential solution here that would satisfy what you're looking to do, but I was curious if you still want to strip all tags? The method(s) I'll propose and post here will take into account the tags when truncating the body of content, so you won't have the issue of truncating in the middle of a tag anymore. So given that, do you still want to strip all tags?

ASKER CERTIFIED SOLUTION

trailblazzyr55

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

MichaelEvangelista

ASKER

WOW. checking this out now... will report back... thx!

MichaelEvangelista

ASKER

That is very slick.
At first glance doing exactly what we need. I am going to assign full points for this great answer but also ask a quick follow up Q if that's ok.
Thank you VERY much.

MichaelEvangelista

ASKER

Excellent response to a detailed request!

MichaelEvangelista

ASKER

I have attached the source of my customer's blog feed as rendered by your script.
The remaining things I would like to strip out would be

a) More than one <br /> OR <br> tag in a row (i.e. replace any number of consecutive <br /> with just one.

b) Remove all attributes EXCEPT the src="" on img tags, otherwise, no style/class/font/etc attributes would come through on any tag
(optionally, if this was a list of attributes to remove or allow as a variable to the script it would be usable in the maximum number of ways)

i.e. this

<span style="color: rgb(0, 0, 0);font-family:Arial;font-size:100%;" ><div style="font-family:georgia;"><span class="062374316-02102009">

would become simply

<span><div><span>

truncateHtmlText-Source.txt

trailblazzyr55

That would be fairly easy to do, what my code does there is pull each tag into an array element and stores specific information about each one basically an array of structs which is done by the appendNode(), modifying that just takes an extra little bit of cleaning up while pushing the data to the array.

I'll make a few tweaks and re-post in a few...

Thanks and glad I could help!

trailblazzyr55

Give this a try, I added a function called applyNodeRule() which contains a case statement to apply rules to nearly any tag or even text and since each array element holds information about what it contains, we can use this method/function and make that info work for us in cleaning up any unwanted stuff.

function truncateHTMLText(context, contentLength){
 var i = 0;
 var maxLen = arguments.contentLength;
 var datum = parseDatum(arguments.context);
 var result = arrayNew(1);
 var limitCount = 0;
 var setBreak = false;
 var matchOpener = arrayNew(1); 
 for(i = 1; i lte arrayLen(datum); i = i + 1){
  if(not datum[i].isTag and limitCount lte maxLen and not setBreak){
   if(limitCount + datum[i].len lte maxLen){
    arrayAppend(result, datum[i].context);
	limitCount = limitCount + datum[i].len;
   }
   else{
    arrayAppend(result, left(datum[i].context, maxLen - limitCount) & "...");
	setBreak = true;
   }
  }
  else if(not setBreak){
   arrayAppend(result, datum[i].context);
   if(datum[i].isOpener){
    arrayAppend(matchOpener, datum[i].type);
   }
   if(datum[i].isCloser){
    arrayDeleteAt(matchOpener, arrayLen(matchOpener));
   }
  }
  else if(setBreak and datum[i].isTag and arrayLen(matchOpener) and (datum[i].isCloser or datum[i].isSelfTerminating)){
   arrayAppend(result, datum[i].context);
   if(datum[i].isCloser){
    arrayDeleteAt(matchOpener, arrayLen(matchOpener));
   }
  }
  else if(not arrayLen(matchOpener)){
   break;
  }
 }
 return arrayToList(result, "");
}
//returns array
function parseDatum(context){
 var datum = arguments.context;
 var tags = arrayNew(1);
 var data = arrayNew(1);
 var idx = 1;
 var i = 0;
 var total = len(datum);
 var reg = "<[^>]+>";
 var result = structNew();
 var tagsLength = 0;
 var forwardPos = 0;
 var dataLength = 0;
 while(idx lte total and idx neq 0){
  result = reFindNoCase(reg, datum, idx, true);
  idx = result.pos[1] + result.len[1];
  if(idx neq 0){
   arrayAppend(tags, appendPosition(result.pos[1], result.len[1]));
  }
 }
 tagsLength = arrayLen(tags);
 if(tagsLength and tags[1].pos neq 1){
  arrayAppend(data, appendNode(1, tags[1].pos - 1, datum, data));
 }
 for(i = 1; i lte tagsLength; i = i + 1){
  arrayAppend(data, appendNode(tags[i].pos, tags[i].len, datum, data));
  if(i neq tagsLength and tags[i].pos + tags[i].len neq tags[i + 1].pos){
   forwardPos = tags[i].pos + tags[i].len;
   arrayAppend(data, appendNode(forwardPos, tags[i + 1].pos - forwardPos, datum, data));
  }
 }
 dataLength = arrayLen(data);
 if(data[dataLength].pos + data[dataLength].len lte total){
  arrayAppend(data, appendNode(data[dataLength].pos + data[dataLength].len, total - data[dataLength].pos + data[dataLength].len, datum, data));
 }
 return data; 
}
//returns struct
function appendNode(pos, len, context, build){
 var node = structnew();
 var typeReg = "<[\/]?([a-z]+)\s?";
 var typeResult = structNew();
 var buildLength = arrayLen(build);
 node['pos'] = arguments.pos;
 node['len'] = arguments.len;
 node['context'] = mid(arguments.context, node.pos, node.len);
 node['isTag'] = reFind("<[^>]+>", node.context, 1, false) eq 1;
 if(node.isTag){
  typeResult = reFindNoCase(typeReg, node.context, 1, true);
  node['isSelfTerminating'] = left(node.context, 1) eq "<" and right(node.context, 2) eq "/>";
  node['isCloser'] = left(node.context, 2) eq "</";
  node['isOpener'] = not node.isCloser and not node.isSelfTerminating;
  node['type'] = mid(node.context, typeResult.pos[2], typeResult.len[2]);
  node['isRepeated'] = false;
  if(buildLength and build[buildLength].isTag){
   node['isRepeated'] = build[buildLength].type eq node.type;
  }
 }
 node = applyNodeRule(node);
 return node; 
}
//returns struct
function appendPosition(pos, len){
 var node = structnew();
 node['pos'] = arguments.pos;
 node['len'] = arguments.len;
 return node; 
}
//cleans out unwanted data
function applyNodeRule(tagContext){
 var datum = arguments.tagContext;
 var context = datum.context;
 var newContext = "";
 if(datum.isTag){
  switch(datum.type){
   case "img":
	//preserve the src attribute
	newContext = "<img src=""";
	search = refindNoCase("<\s*(img|IMG)[^>]*(src|src)\s*=\s*['""]\s*([^'""]*\/)*[^'""\s]+\s*['""][^>]*>", context, 1, true);
	if(search.pos[4]){
	 newContext = newContext & mid(context, search.pos[4], search.len[4]);
	 if(datum.isSelfTerminating){
	  newContext = newContext & """/>";
	  break;
	 }
	 newContext = newContext & """>";
	}
    break;
   case "br":
    //clear repeated <br> tags
	if(datum.isRepeated){
	 newContext = "";
	 break;
	}
	return datum;
    break;
   case "div":
	newContext = context;
	//remove all attributes
	if(datum.isOpener){
     newContext = "<div>";
	}
    break;
   case "span":
	newContext = context;
	//remove all attributes
	if(datum.isOpener){
     newContext = "<span>";
	}
    break;
   default:
    return datum;
	break;
  }
  datum.context = newContext;
 }
 return datum;
}

Open in new window

trailblazzyr55

In addition to what you mentioned above, if you want to strip out all empty tags, ie: <span></span> or even <span><span><span></span></span></span>

use this code...

added one more function "stripEmptyTags()" to strip those out...

function truncateHTMLText(context, contentLength){
 var i = 0;
 var maxLen = arguments.contentLength;
 var datum = parseDatum(arguments.context);
 var result = arrayNew(1);
 var limitCount = 0;
 var setBreak = false;
 var matchOpener = arrayNew(1); 
 for(i = 1; i lte arrayLen(datum); i = i + 1){
  if(not datum[i].isTag and limitCount lte maxLen and not setBreak){
   if(limitCount + datum[i].len lte maxLen){
    arrayAppend(result, datum[i].context);
	limitCount = limitCount + datum[i].len;
   }
   else{
    arrayAppend(result, left(datum[i].context, maxLen - limitCount) & "...");
	setBreak = true;
   }
  }
  else if(not setBreak){
   arrayAppend(result, datum[i].context);
   if(datum[i].isOpener){
    arrayAppend(matchOpener, datum[i].type);
   }
   if(datum[i].isCloser){
    arrayDeleteAt(matchOpener, arrayLen(matchOpener));
   }
  }
  else if(setBreak and datum[i].isTag and arrayLen(matchOpener) and (datum[i].isCloser or datum[i].isSelfTerminating)){
   arrayAppend(result, datum[i].context);
   if(datum[i].isCloser){
    arrayDeleteAt(matchOpener, arrayLen(matchOpener));
   }
  }
  else if(not arrayLen(matchOpener)){
   break;
  }
 }
 return stripEmptyTags(arrayToList(result, ""));
}
//returns array
function parseDatum(context){
 var datum = arguments.context;
 var tags = arrayNew(1);
 var data = arrayNew(1);
 var idx = 1;
 var i = 0;
 var total = len(datum);
 var reg = "<[^>]+>";
 var result = structNew();
 var tagsLength = 0;
 var forwardPos = 0;
 var dataLength = 0;
 while(idx lte total and idx neq 0){
  result = reFindNoCase(reg, datum, idx, true);
  idx = result.pos[1] + result.len[1];
  if(idx neq 0){
   arrayAppend(tags, appendPosition(result.pos[1], result.len[1]));
  }
 }
 tagsLength = arrayLen(tags);
 if(tagsLength and tags[1].pos neq 1){
  arrayAppend(data, appendNode(1, tags[1].pos - 1, datum, data));
 }
 for(i = 1; i lte tagsLength; i = i + 1){
  arrayAppend(data, appendNode(tags[i].pos, tags[i].len, datum, data));
  if(i neq tagsLength and tags[i].pos + tags[i].len neq tags[i + 1].pos){
   forwardPos = tags[i].pos + tags[i].len;
   arrayAppend(data, appendNode(forwardPos, tags[i + 1].pos - forwardPos, datum, data));
  }
 }
 dataLength = arrayLen(data);
 if(data[dataLength].pos + data[dataLength].len lte total){
  arrayAppend(data, appendNode(data[dataLength].pos + data[dataLength].len, total - data[dataLength].pos + data[dataLength].len, datum, data));
 }
 return data; 
}
//returns struct
function appendNode(pos, len, context, build){
 var node = structnew();
 var typeReg = "<[\/]?([a-z]+)\s?";
 var typeResult = structNew();
 var buildLength = arrayLen(build);
 node['pos'] = arguments.pos;
 node['len'] = arguments.len;
 node['context'] = mid(arguments.context, node.pos, node.len);
 node['isTag'] = reFind("<[^>]+>", node.context, 1, false) eq 1;
 if(node.isTag){
  typeResult = reFindNoCase(typeReg, node.context, 1, true);
  node['isSelfTerminating'] = left(node.context, 1) eq "<" and right(node.context, 2) eq "/>";
  node['isCloser'] = left(node.context, 2) eq "</";
  node['isOpener'] = not node.isCloser and not node.isSelfTerminating;
  node['type'] = mid(node.context, typeResult.pos[2], typeResult.len[2]);
  node['isRepeated'] = false;
  if(buildLength and build[buildLength].isTag){
   node['isRepeated'] = build[buildLength].type eq node.type;
  }
 }
 node = applyNodeRule(node);
 return node; 
}
//returns struct
function appendPosition(pos, len){
 var node = structnew();
 node['pos'] = arguments.pos;
 node['len'] = arguments.len;
 return node; 
}
//cleans out unwanted data
function applyNodeRule(tagContext){
 var datum = arguments.tagContext;
 var context = datum.context;
 var newContext = "";
 if(datum.isTag){
  switch(datum.type){
   case "img":
	//preserve the src attribute
	newContext = "<img src=""";
	search = refindNoCase("<\s*(img|IMG)[^>]*(src|src)\s*=\s*['""]\s*([^'""]*\/)*[^'""\s]+\s*['""][^>]*>", context, 1, true);
	if(search.pos[4]){
	 newContext = newContext & mid(context, search.pos[4], search.len[4]);
	 if(datum.isSelfTerminating){
	  newContext = newContext & """/>";
	  break;
	 }
	 newContext = newContext & """>";
	}
    break;
   case "br":
    //clear repeated <br> tags
	if(datum.isRepeated){
	 newContext = "";
	 break;
	}
	return datum;
    break;
   case "div":
	newContext = context;
	//remove all attributes
	if(datum.isOpener){
     newContext = "<div>";
	}
    break;
   case "span":
	newContext = context;
	//remove all attributes
	if(datum.isOpener){
     newContext = "<span>";
	}
    break;
   default:
    return datum;
	break;
  }
  datum.context = newContext;
 }
 return datum;
}
//remove all empty tags
function stripEmptyTags(context){
 var datum = arguments.context;
 var emptyTagRegex = "<(\w+)>(\s|&nbsp;)*</\1>";
 datum = reReplaceNoCase(datum, emptyTagRegex, "", "all");
 if(reFindNoCase(emptyTagRegex, datum, 1, false)){
  datum = stripEmptyTags(datum);
 }
 return datum;
}

Open in new window

MichaelEvangelista

ASKER

thanks so much. It might take me a bit to get back to that project and mess with it again, but i will be back!

MichaelEvangelista

ASKER

Excellent. Beautiful. 1000 points if i could. Thanks SO much!