JavaScript/JQuery RegEx problem...

Hi,

I'm trying to use JavaScript/JQuery to apply a RegEx to a string, but I think I'm out of my league and/or I need more than RegEx alone.

The basic premise is that the string will be markup text - HTML, XML, or something with custom tags.

I want to remove ALL markup from the string - everything inclosed in < and >. However, I want to allow certain strings to remain. <b> and </b> for instance are ok. As is <i> and </i>, and a few others.

So, I know it's pretty simple to apply a RegEx that removes everything between < and >. But how do I create a library of tags I want it to ignore?

I would appreciate the answer in code so there's no ambiguity. For the RegEx, I was using this from RegEx Library:

&lt;/?(\w+)(\s*\w*\s*=\s*(&quot;[^&quot;]*&quot;|'[^']'|[^&gt;]*))*|/?&gt;

A little long, but it matches tags with or without attribute(s) enclosed in single or double quotes. If you know of a better one for this purpose, please use it.

I would show my code ... but it's a mess. I'm over-thinking it, and it's not working. I know there's a simpler way to do this.

One thought I had was to change < and > to [ and ] for all the tags I wanted to keep, then run the RegEx replace, and then change them back. HOWEVER, that would also change ordinary [ and ], possibly messing up the original text.

Any ideas?

Thanks!
CAS-ITAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

mkrohnCommented:
You are very close to the solution, I would replace the tags I want to keep with something like @[@ and @]@
So you don´t end up replacing the preexisting [ and ]

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
CAS-ITAuthor Commented:
Well, the other part is that I'm not very good with this, and my code isn't working.

So, I was hoping someone could throw down what they think the code should be ... get me close, and then I can take it from there.
0
MorphorCommented:
I would just check the string if it's an allowed tag and if it isn't then replace it...

function tagReplace(yourtag) {
	var ignore = new RegExp("<\/(b|i|u)>"); // Tags you wish to keep
	var replaceIt = new RegExp("<([^>]+)>");
	if (yourtag.exec(ignore) == null) {
		return (yourtag.replace(replaceIt));
	}
}

Open in new window

0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

MorphorCommented:
Sorry, the function should be:
function tagReplace(yourtag) {
	var ignore = new RegExp("<\/?(b|i|u|body)>"); // Tags you wish to keep
	var replaceIt = new RegExp("([^<>]+)");
	if (ignore.test(yourtag) == false) {
		return (yourtag.replace(replaceIt, ''));
	} else return yourtag;
}
}

Open in new window


How to use:
tagReplace('<i>'); // returns '<i>'
tagReplace('</body>'); // returns '</body>'
tagReplace('<html>'); // returns '<>'

Open in new window

0
CAS-ITAuthor Commented:
Yeah that's really close, but that works by passing the function a tag.

I'm going to be passing the function a big string of html.

How would I get your function to work on a long string of html rather than a single tag?
0
Terry WoodsIT GuruCommented:
This seems to work for me. In my example, the list of tags you want to keep is (b|u|i|h1)
<script type="text/javascript">
  var re = /<\/?(?!(b|u|i|h1)\W)(\w+)(\s*\w+\s*=\s*("[^"]*"|'[^']'|[^>]*)|\w+)*\s*\/?>/gi;
  var sourcestring = "source string to match with pattern";
  var replacementpattern = "";
  var result = sourcestring.replace(re, replacementpattern);
  alert("result = " + result);
</script>

Open in new window

0
CAS-ITAuthor Commented:
Sorry for the late reply - lost my password ;)

So, we did finally solve this, and here's what we did:
// A function that goes through the user's text and makes
// sure the right formatting is applied.
function FixContentText(thePanelID) {
	$(document).ready(function() {
		
		var thePanel = document.getElementById(String(thePanelID));


		var theOldHTML = thePanel.innerHTML;
		var theNewHTML = theOldHTML;
		
		// As kind of a pre-processing, many browsers use
		// <div> when it should be <p> So, I'm going to change
		// all <div> to <p>, and while that may add some
		// baggage to the page, it should help preserve spacing.
		// I'm only going to replace raw <div> tags. Anything
		// with attributes will not be cinubg over.
		theNewHTML = theNewHTML.replace(/<div.>/gi, "<p>");
		theNewHTML = theNewHTML.replace(/<\/div>/gi, "</p>");

		// Ok, so, the first thing we're going to do is kill all
		// of the bad html. But, some of the html is good. This
		// may not be the best way, but we're going to change of
		// of the good html tags to something that won't get caught
		// by the filter, then run the filter, then change those
		// good fields back. The good fields are:
		// <b> and </b> and <strong> and </strong>
		// <i> and </i> and <em> and </em>
		// <a href=''> and </a> **** Still working on how to do this...
		// <ul> and </ul>
		// <li> and </li>
		// <ol> and </ol>
		// <p> and </p>
		// <br> and <br/> and <br />
		theNewHTML = theNewHTML.replace(/<b>/gi, "~~~b~~~");
		theNewHTML = theNewHTML.replace(/<\/b>/gi, "~~~/b~~~");
		theNewHTML = theNewHTML.replace(/<strong>/gi, "~~~strong~~~");
		theNewHTML = theNewHTML.replace(/<\/strong>/gi, "~~~/strong~~~");
		theNewHTML = theNewHTML.replace(/<i>/gi, "~~~i~~~");
		theNewHTML = theNewHTML.replace(/<\/i>/gi, "~~~/i~~~");
		theNewHTML = theNewHTML.replace(/<em>/gi, "~~~em~~~");
		theNewHTML = theNewHTML.replace(/<\/em>/gi, "~~~/em~~~");
		theNewHTML = theNewHTML.replace(/<ul>/gi, "~~~ul~~~");
		theNewHTML = theNewHTML.replace(/<\/ul>/gi, "~~~/ul~~~");
		theNewHTML = theNewHTML.replace(/<li>/gi, "~~~li~~~");
		theNewHTML = theNewHTML.replace(/<\/li>/gi, "~~~/li~~~");
		theNewHTML = theNewHTML.replace(/<ol>/gi, "~~~ol~~~");
		theNewHTML = theNewHTML.replace(/<\/ol>/gi, "~~~/ol~~~");
		theNewHTML = theNewHTML.replace(/<p>/gi, "~~~p~~~");
		theNewHTML = theNewHTML.replace(/<\/p>/gi, "~~~/p~~~");
		theNewHTML = theNewHTML.replace(/<br>/gi, "~~~br~~~");
		theNewHTML = theNewHTML.replace(/<br\/>/gi, "~~~br/~~~");
		theNewHTML = theNewHTML.replace(/<br \/>/gi, "~~~br /~~~");
		theNewHTML = theNewHTML.replace(/<a/gi, "~~~a~~~");
		theNewHTML = theNewHTML.replace(/<\/a>/gi, "~~~/a~~~");
		theNewHTML = theNewHTML.replace(/<h(1|2)/gi, "~~~h4~~~");
		theNewHTML = theNewHTML.replace(/<\/h(1|2)>/gi, "~~~/h4~~~");
		theNewHTML = theNewHTML.replace(/<h(3|4|5|6)/gi, "~~~h5~~~");
		theNewHTML = theNewHTML.replace(/<\/h(3|4|5|6)>/gi, "~~~/h5~~~");
		

		// Ok, now we run it through the HTML filter and take out
		// all of the HTML.
		theNewHTML = theNewHTML.replace(/<\/?[^>]+(>|$)/ig, "");

		// Now we work our way back, and re-replace all of the good tags
		// We will also do some substitutions here:
		// <b> becomes <strong>
		// <i> becomes <em>
		// <br> and <br/> become <br />
		theNewHTML = theNewHTML.replace(/~~~b~~~/gi, "<strong>");
		theNewHTML = theNewHTML.replace(/~~~\/b~~~/gi, "</strong>");
		theNewHTML = theNewHTML.replace(/~~~strong~~~/gi, "<strong>");
		theNewHTML = theNewHTML.replace(/~~~\/strong~~~/gi, "</strong>");
		theNewHTML = theNewHTML.replace(/~~~i~~~/gi, "<em>");
		theNewHTML = theNewHTML.replace(/~~~\/i~~~/gi, "</em>");
		theNewHTML = theNewHTML.replace(/~~~em~~~/gi, "<em>");
		theNewHTML = theNewHTML.replace(/~~~\/em~~~/gi, "</em>");
		theNewHTML = theNewHTML.replace(/~~~ul~~~/gi, "<ul>");
		theNewHTML = theNewHTML.replace(/~~~\/ul~~~/gi, "</ul>");
		theNewHTML = theNewHTML.replace(/~~~li~~~/gi, "<li>");
		theNewHTML = theNewHTML.replace(/~~~\/li~~~/gi, "</li>");
		theNewHTML = theNewHTML.replace(/~~~ol~~~/gi, "<ol>");
		theNewHTML = theNewHTML.replace(/~~~\/ol~~~/gi, "</ol>");
		theNewHTML = theNewHTML.replace(/~~~p~~~/gi, "<p>");
		theNewHTML = theNewHTML.replace(/~~~\/p~~~/gi, "</p>");
		theNewHTML = theNewHTML.replace(/~~~br~~~/gi, "<br />");
		theNewHTML = theNewHTML.replace(/~~~br\/~~~/gi, "<br />");
		theNewHTML = theNewHTML.replace(/~~~br \/~~~/gi, "<br />");
		theNewHTML = theNewHTML.replace(/~~~a~~~/gi, "<a");
		theNewHTML = theNewHTML.replace(/~~~\/a~~~/gi, "</a>");
		theNewHTML = theNewHTML.replace(/~~~h4~~~/gi, "<h4>");
		theNewHTML = theNewHTML.replace(/~~~\/h4~~~/gi, "</h4>");
		theNewHTML = theNewHTML.replace(/~~~h5~~~/gi, "<h5>");
		theNewHTML = theNewHTML.replace(/~~~\/h5~~~/gi, "</h5>");



		// Bad empty paragraph attempts by the various browsers.
		// Our goal is <p>&nbsp;</p>
		theNewHTML = theNewHTML.replace(/<div><br *\/><\/div>/gi, "<p>&nbsp;</p>");
		theNewHTML = theNewHTML.replace(/<p><br *\/><\/p>/gi, "<p>&nbsp;</p>");
		theNewHTML = theNewHTML.replace(/<p>&#160;<\/p>/gi, "<p>&nbsp;</p>")
		theNewHTML = theNewHTML.replace(/<p> *<\/p>/gi, "<p>&nbsp;</p>");
		theNewHTML = theNewHTML.replace(/<br *\/><br *\/>/gi, "<p>&nbsp;</p>");

		// Empty <a href=""></a> and <p></p> tags
		theNewHTML = theNewHTML.replace(/<a href=.> *<\/a>/gim, "");
		theNewHTML = theNewHTML.replace(/<p> *<\/p>/gim, "");

		// Repeating white space. Basically, a bunch of repeating
		// <p> tags that cause an extended white space.
		theNewHTML = theNewHTML.replace(/((<p>(&#160;|&nbsp;| *)<\/p>)[\s\r\n\f]*){2,}/gim, "<p>&nbsp;</p>");
	
		// Duplicates. Specifically <p> and </p>
		theNewHTML = theNewHTML.replace(/(\s*<p>\s*)+/gim, "<p>");
		theNewHTML = theNewHTML.replace(/(\s*<\/p>\s*)+/gim, "</p>");

		// HTML Source Formatting.
		// First, remove all linebreaks.
		theNewHTML = theNewHTML.replace(/(\r\n|\n|\r)/gim, "");
		
		// Next, remove excess whitespace
		theNewHTML = theNewHTML.replace(/ +/gim, " ");
		theNewHTML = theNewHTML.replace(/\s{2,}/gim, " ");
		
		// Then, we want to put back line breaks, but only on the 
		// things we want the line breaks on.
		theNewHTML = theNewHTML.replace(/<\/div>/gim, "</div>\r\n");
		theNewHTML = theNewHTML.replace(/<\/span>/gim, "</span>\r\n");
		theNewHTML = theNewHTML.replace(/<\/p>/gim, "</p>\r\n");
		theNewHTML = theNewHTML.replace(/<\/h3>/gim, "</h3>\r\n");
		theNewHTML = theNewHTML.replace(/<\/h4>/gim, "</h4>\r\n");
		theNewHTML = theNewHTML.replace(/<br\s+\/?>/gim, "<br />\r\n");
		theNewHTML = theNewHTML.replace(/<hr\s+\/>/gim, "<hr />\r\n");
		theNewHTML = theNewHTML.replace(/<ul>/gim, "<ul>\r\n");
		theNewHTML = theNewHTML.replace(/<\/ul>/gim, "</ul>\r\n");
		theNewHTML = theNewHTML.replace(/<\/li>/gim, "</li>\r\n");

		
		// Finally, re-apply this HTML to the panel. Or, if nothing 
		// has changed, do nothing.
		if (theOldHTML == theNewHTML ) { }
		else {
			// If the string doesn't begin with a <p> tag,
			// then we want to enclose the entire thing with one.
			if (theNewHTML.substring(0,3) != "<p>" && theNewHTML.indexOf("<p>") == -1) {
				theNewHTML = "<p>" + String(theNewHTML) + "</p>";
			}
			thePanel.innerHTML = theNewHTML;
		}

		//alert(theOldHTML);
		//alert(theNewHTML);

	});
}

Open in new window

0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
JavaScript

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.