?
Solved

JavaScript/JQuery RegEx problem...

Posted on 2011-10-03
7
Medium Priority
?
457 Views
Last Modified: 2012-05-12
Hi,

I'm trying to use JavaScript/JQuery to apply a RegEx to a string, but I think I'm out of my league and/or I need more than RegEx alone.

The basic premise is that the string will be markup text - HTML, XML, or something with custom tags.

I want to remove ALL markup from the string - everything inclosed in < and >. However, I want to allow certain strings to remain. <b> and </b> for instance are ok. As is <i> and </i>, and a few others.

So, I know it's pretty simple to apply a RegEx that removes everything between < and >. But how do I create a library of tags I want it to ignore?

I would appreciate the answer in code so there's no ambiguity. For the RegEx, I was using this from RegEx Library:

&lt;/?(\w+)(\s*\w*\s*=\s*(&quot;[^&quot;]*&quot;|'[^']'|[^&gt;]*))*|/?&gt;

A little long, but it matches tags with or without attribute(s) enclosed in single or double quotes. If you know of a better one for this purpose, please use it.

I would show my code ... but it's a mess. I'm over-thinking it, and it's not working. I know there's a simpler way to do this.

One thought I had was to change < and > to [ and ] for all the tags I wanted to keep, then run the RegEx replace, and then change them back. HOWEVER, that would also change ordinary [ and ], possibly messing up the original text.

Any ideas?

Thanks!
0
Comment
Question by:CAS-IT
7 Comments
 
LVL 3

Accepted Solution

by:
mkrohn earned 664 total points
ID: 36903670
You are very close to the solution, I would replace the tags I want to keep with something like @[@ and @]@
So you don´t end up replacing the preexisting [ and ]

0
 

Author Comment

by:CAS-IT
ID: 36903690
Well, the other part is that I'm not very good with this, and my code isn't working.

So, I was hoping someone could throw down what they think the code should be ... get me close, and then I can take it from there.
0
 
LVL 3

Assisted Solution

by:Morphor
Morphor earned 668 total points
ID: 36904101
I would just check the string if it's an allowed tag and if it isn't then replace it...

function tagReplace(yourtag) {
	var ignore = new RegExp("<\/(b|i|u)>"); // Tags you wish to keep
	var replaceIt = new RegExp("<([^>]+)>");
	if (yourtag.exec(ignore) == null) {
		return (yourtag.replace(replaceIt));
	}
}

Open in new window

0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 3

Expert Comment

by:Morphor
ID: 36904192
Sorry, the function should be:
function tagReplace(yourtag) {
	var ignore = new RegExp("<\/?(b|i|u|body)>"); // Tags you wish to keep
	var replaceIt = new RegExp("([^<>]+)");
	if (ignore.test(yourtag) == false) {
		return (yourtag.replace(replaceIt, ''));
	} else return yourtag;
}
}

Open in new window


How to use:
tagReplace('<i>'); // returns '<i>'
tagReplace('</body>'); // returns '</body>'
tagReplace('<html>'); // returns '<>'

Open in new window

0
 

Author Comment

by:CAS-IT
ID: 36904877
Yeah that's really close, but that works by passing the function a tag.

I'm going to be passing the function a big string of html.

How would I get your function to work on a long string of html rather than a single tag?
0
 
LVL 35

Assisted Solution

by:Terry Woods
Terry Woods earned 668 total points
ID: 36906165
This seems to work for me. In my example, the list of tags you want to keep is (b|u|i|h1)
<script type="text/javascript">
  var re = /<\/?(?!(b|u|i|h1)\W)(\w+)(\s*\w+\s*=\s*("[^"]*"|'[^']'|[^>]*)|\w+)*\s*\/?>/gi;
  var sourcestring = "source string to match with pattern";
  var replacementpattern = "";
  var result = sourcestring.replace(re, replacementpattern);
  alert("result = " + result);
</script>

Open in new window

0
 

Author Comment

by:CAS-IT
ID: 37001702
Sorry for the late reply - lost my password ;)

So, we did finally solve this, and here's what we did:
// A function that goes through the user's text and makes
// sure the right formatting is applied.
function FixContentText(thePanelID) {
	$(document).ready(function() {
		
		var thePanel = document.getElementById(String(thePanelID));


		var theOldHTML = thePanel.innerHTML;
		var theNewHTML = theOldHTML;
		
		// As kind of a pre-processing, many browsers use
		// <div> when it should be <p> So, I'm going to change
		// all <div> to <p>, and while that may add some
		// baggage to the page, it should help preserve spacing.
		// I'm only going to replace raw <div> tags. Anything
		// with attributes will not be cinubg over.
		theNewHTML = theNewHTML.replace(/<div.>/gi, "<p>");
		theNewHTML = theNewHTML.replace(/<\/div>/gi, "</p>");

		// Ok, so, the first thing we're going to do is kill all
		// of the bad html. But, some of the html is good. This
		// may not be the best way, but we're going to change of
		// of the good html tags to something that won't get caught
		// by the filter, then run the filter, then change those
		// good fields back. The good fields are:
		// <b> and </b> and <strong> and </strong>
		// <i> and </i> and <em> and </em>
		// <a href=''> and </a> **** Still working on how to do this...
		// <ul> and </ul>
		// <li> and </li>
		// <ol> and </ol>
		// <p> and </p>
		// <br> and <br/> and <br />
		theNewHTML = theNewHTML.replace(/<b>/gi, "~~~b~~~");
		theNewHTML = theNewHTML.replace(/<\/b>/gi, "~~~/b~~~");
		theNewHTML = theNewHTML.replace(/<strong>/gi, "~~~strong~~~");
		theNewHTML = theNewHTML.replace(/<\/strong>/gi, "~~~/strong~~~");
		theNewHTML = theNewHTML.replace(/<i>/gi, "~~~i~~~");
		theNewHTML = theNewHTML.replace(/<\/i>/gi, "~~~/i~~~");
		theNewHTML = theNewHTML.replace(/<em>/gi, "~~~em~~~");
		theNewHTML = theNewHTML.replace(/<\/em>/gi, "~~~/em~~~");
		theNewHTML = theNewHTML.replace(/<ul>/gi, "~~~ul~~~");
		theNewHTML = theNewHTML.replace(/<\/ul>/gi, "~~~/ul~~~");
		theNewHTML = theNewHTML.replace(/<li>/gi, "~~~li~~~");
		theNewHTML = theNewHTML.replace(/<\/li>/gi, "~~~/li~~~");
		theNewHTML = theNewHTML.replace(/<ol>/gi, "~~~ol~~~");
		theNewHTML = theNewHTML.replace(/<\/ol>/gi, "~~~/ol~~~");
		theNewHTML = theNewHTML.replace(/<p>/gi, "~~~p~~~");
		theNewHTML = theNewHTML.replace(/<\/p>/gi, "~~~/p~~~");
		theNewHTML = theNewHTML.replace(/<br>/gi, "~~~br~~~");
		theNewHTML = theNewHTML.replace(/<br\/>/gi, "~~~br/~~~");
		theNewHTML = theNewHTML.replace(/<br \/>/gi, "~~~br /~~~");
		theNewHTML = theNewHTML.replace(/<a/gi, "~~~a~~~");
		theNewHTML = theNewHTML.replace(/<\/a>/gi, "~~~/a~~~");
		theNewHTML = theNewHTML.replace(/<h(1|2)/gi, "~~~h4~~~");
		theNewHTML = theNewHTML.replace(/<\/h(1|2)>/gi, "~~~/h4~~~");
		theNewHTML = theNewHTML.replace(/<h(3|4|5|6)/gi, "~~~h5~~~");
		theNewHTML = theNewHTML.replace(/<\/h(3|4|5|6)>/gi, "~~~/h5~~~");
		

		// Ok, now we run it through the HTML filter and take out
		// all of the HTML.
		theNewHTML = theNewHTML.replace(/<\/?[^>]+(>|$)/ig, "");

		// Now we work our way back, and re-replace all of the good tags
		// We will also do some substitutions here:
		// <b> becomes <strong>
		// <i> becomes <em>
		// <br> and <br/> become <br />
		theNewHTML = theNewHTML.replace(/~~~b~~~/gi, "<strong>");
		theNewHTML = theNewHTML.replace(/~~~\/b~~~/gi, "</strong>");
		theNewHTML = theNewHTML.replace(/~~~strong~~~/gi, "<strong>");
		theNewHTML = theNewHTML.replace(/~~~\/strong~~~/gi, "</strong>");
		theNewHTML = theNewHTML.replace(/~~~i~~~/gi, "<em>");
		theNewHTML = theNewHTML.replace(/~~~\/i~~~/gi, "</em>");
		theNewHTML = theNewHTML.replace(/~~~em~~~/gi, "<em>");
		theNewHTML = theNewHTML.replace(/~~~\/em~~~/gi, "</em>");
		theNewHTML = theNewHTML.replace(/~~~ul~~~/gi, "<ul>");
		theNewHTML = theNewHTML.replace(/~~~\/ul~~~/gi, "</ul>");
		theNewHTML = theNewHTML.replace(/~~~li~~~/gi, "<li>");
		theNewHTML = theNewHTML.replace(/~~~\/li~~~/gi, "</li>");
		theNewHTML = theNewHTML.replace(/~~~ol~~~/gi, "<ol>");
		theNewHTML = theNewHTML.replace(/~~~\/ol~~~/gi, "</ol>");
		theNewHTML = theNewHTML.replace(/~~~p~~~/gi, "<p>");
		theNewHTML = theNewHTML.replace(/~~~\/p~~~/gi, "</p>");
		theNewHTML = theNewHTML.replace(/~~~br~~~/gi, "<br />");
		theNewHTML = theNewHTML.replace(/~~~br\/~~~/gi, "<br />");
		theNewHTML = theNewHTML.replace(/~~~br \/~~~/gi, "<br />");
		theNewHTML = theNewHTML.replace(/~~~a~~~/gi, "<a");
		theNewHTML = theNewHTML.replace(/~~~\/a~~~/gi, "</a>");
		theNewHTML = theNewHTML.replace(/~~~h4~~~/gi, "<h4>");
		theNewHTML = theNewHTML.replace(/~~~\/h4~~~/gi, "</h4>");
		theNewHTML = theNewHTML.replace(/~~~h5~~~/gi, "<h5>");
		theNewHTML = theNewHTML.replace(/~~~\/h5~~~/gi, "</h5>");



		// Bad empty paragraph attempts by the various browsers.
		// Our goal is <p>&nbsp;</p>
		theNewHTML = theNewHTML.replace(/<div><br *\/><\/div>/gi, "<p>&nbsp;</p>");
		theNewHTML = theNewHTML.replace(/<p><br *\/><\/p>/gi, "<p>&nbsp;</p>");
		theNewHTML = theNewHTML.replace(/<p>&#160;<\/p>/gi, "<p>&nbsp;</p>")
		theNewHTML = theNewHTML.replace(/<p> *<\/p>/gi, "<p>&nbsp;</p>");
		theNewHTML = theNewHTML.replace(/<br *\/><br *\/>/gi, "<p>&nbsp;</p>");

		// Empty <a href=""></a> and <p></p> tags
		theNewHTML = theNewHTML.replace(/<a href=.> *<\/a>/gim, "");
		theNewHTML = theNewHTML.replace(/<p> *<\/p>/gim, "");

		// Repeating white space. Basically, a bunch of repeating
		// <p> tags that cause an extended white space.
		theNewHTML = theNewHTML.replace(/((<p>(&#160;|&nbsp;| *)<\/p>)[\s\r\n\f]*){2,}/gim, "<p>&nbsp;</p>");
	
		// Duplicates. Specifically <p> and </p>
		theNewHTML = theNewHTML.replace(/(\s*<p>\s*)+/gim, "<p>");
		theNewHTML = theNewHTML.replace(/(\s*<\/p>\s*)+/gim, "</p>");

		// HTML Source Formatting.
		// First, remove all linebreaks.
		theNewHTML = theNewHTML.replace(/(\r\n|\n|\r)/gim, "");
		
		// Next, remove excess whitespace
		theNewHTML = theNewHTML.replace(/ +/gim, " ");
		theNewHTML = theNewHTML.replace(/\s{2,}/gim, " ");
		
		// Then, we want to put back line breaks, but only on the 
		// things we want the line breaks on.
		theNewHTML = theNewHTML.replace(/<\/div>/gim, "</div>\r\n");
		theNewHTML = theNewHTML.replace(/<\/span>/gim, "</span>\r\n");
		theNewHTML = theNewHTML.replace(/<\/p>/gim, "</p>\r\n");
		theNewHTML = theNewHTML.replace(/<\/h3>/gim, "</h3>\r\n");
		theNewHTML = theNewHTML.replace(/<\/h4>/gim, "</h4>\r\n");
		theNewHTML = theNewHTML.replace(/<br\s+\/?>/gim, "<br />\r\n");
		theNewHTML = theNewHTML.replace(/<hr\s+\/>/gim, "<hr />\r\n");
		theNewHTML = theNewHTML.replace(/<ul>/gim, "<ul>\r\n");
		theNewHTML = theNewHTML.replace(/<\/ul>/gim, "</ul>\r\n");
		theNewHTML = theNewHTML.replace(/<\/li>/gim, "</li>\r\n");

		
		// Finally, re-apply this HTML to the panel. Or, if nothing 
		// has changed, do nothing.
		if (theOldHTML == theNewHTML ) { }
		else {
			// If the string doesn't begin with a <p> tag,
			// then we want to enclose the entire thing with one.
			if (theNewHTML.substring(0,3) != "<p>" && theNewHTML.indexOf("<p>") == -1) {
				theNewHTML = "<p>" + String(theNewHTML) + "</p>";
			}
			thePanel.innerHTML = theNewHTML;
		}

		//alert(theOldHTML);
		//alert(theNewHTML);

	});
}

Open in new window

0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I found this questions asking how to do this in many different forums, so I will describe here how to implement a solution using PHP and AJAX. The logical flow for the problem should be: Write an event handler for the first drop down box to get …
A while back, I ran into a situation where I was trying to use the calculated columns feature in SharePoint 2013 to do some simple math using values in two lists. Between certain data types not being accessible, and also with trying to make a one to…
The viewer will learn the basics of jQuery, including how to invoke it on a web page. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery.: (CODE)
The viewer will learn the basics of jQuery including how to code hide show and toggles. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery…
Suggested Courses

571 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question