JavaScript/JQuery RegEx problem...

Posted on 2011-10-03
Last Modified: 2012-05-12

I'm trying to use JavaScript/JQuery to apply a RegEx to a string, but I think I'm out of my league and/or I need more than RegEx alone.

The basic premise is that the string will be markup text - HTML, XML, or something with custom tags.

I want to remove ALL markup from the string - everything inclosed in < and >. However, I want to allow certain strings to remain. <b> and </b> for instance are ok. As is <i> and </i>, and a few others.

So, I know it's pretty simple to apply a RegEx that removes everything between < and >. But how do I create a library of tags I want it to ignore?

I would appreciate the answer in code so there's no ambiguity. For the RegEx, I was using this from RegEx Library:


A little long, but it matches tags with or without attribute(s) enclosed in single or double quotes. If you know of a better one for this purpose, please use it.

I would show my code ... but it's a mess. I'm over-thinking it, and it's not working. I know there's a simpler way to do this.

One thought I had was to change < and > to [ and ] for all the tags I wanted to keep, then run the RegEx replace, and then change them back. HOWEVER, that would also change ordinary [ and ], possibly messing up the original text.

Any ideas?

Question by:CAS-IT
    LVL 3

    Accepted Solution

    You are very close to the solution, I would replace the tags I want to keep with something like @[@ and @]@
    So you don´t end up replacing the preexisting [ and ]


    Author Comment

    Well, the other part is that I'm not very good with this, and my code isn't working.

    So, I was hoping someone could throw down what they think the code should be ... get me close, and then I can take it from there.
    LVL 3

    Assisted Solution

    I would just check the string if it's an allowed tag and if it isn't then replace it...

    function tagReplace(yourtag) {
    	var ignore = new RegExp("<\/(b|i|u)>"); // Tags you wish to keep
    	var replaceIt = new RegExp("<([^>]+)>");
    	if (yourtag.exec(ignore) == null) {
    		return (yourtag.replace(replaceIt));

    Open in new window

    LVL 3

    Expert Comment

    Sorry, the function should be:
    function tagReplace(yourtag) {
    	var ignore = new RegExp("<\/?(b|i|u|body)>"); // Tags you wish to keep
    	var replaceIt = new RegExp("([^<>]+)");
    	if (ignore.test(yourtag) == false) {
    		return (yourtag.replace(replaceIt, ''));
    	} else return yourtag;

    Open in new window

    How to use:
    tagReplace('<i>'); // returns '<i>'
    tagReplace('</body>'); // returns '</body>'
    tagReplace('<html>'); // returns '<>'

    Open in new window


    Author Comment

    Yeah that's really close, but that works by passing the function a tag.

    I'm going to be passing the function a big string of html.

    How would I get your function to work on a long string of html rather than a single tag?
    LVL 34

    Assisted Solution

    by:Terry Woods
    This seems to work for me. In my example, the list of tags you want to keep is (b|u|i|h1)
    <script type="text/javascript">
      var re = /<\/?(?!(b|u|i|h1)\W)(\w+)(\s*\w+\s*=\s*("[^"]*"|'[^']'|[^>]*)|\w+)*\s*\/?>/gi;
      var sourcestring = "source string to match with pattern";
      var replacementpattern = "";
      var result = sourcestring.replace(re, replacementpattern);
      alert("result = " + result);

    Open in new window


    Author Comment

    Sorry for the late reply - lost my password ;)

    So, we did finally solve this, and here's what we did:
    // A function that goes through the user's text and makes
    // sure the right formatting is applied.
    function FixContentText(thePanelID) {
    	$(document).ready(function() {
    		var thePanel = document.getElementById(String(thePanelID));
    		var theOldHTML = thePanel.innerHTML;
    		var theNewHTML = theOldHTML;
    		// As kind of a pre-processing, many browsers use
    		// <div> when it should be <p> So, I'm going to change
    		// all <div> to <p>, and while that may add some
    		// baggage to the page, it should help preserve spacing.
    		// I'm only going to replace raw <div> tags. Anything
    		// with attributes will not be cinubg over.
    		theNewHTML = theNewHTML.replace(/<div.>/gi, "<p>");
    		theNewHTML = theNewHTML.replace(/<\/div>/gi, "</p>");
    		// Ok, so, the first thing we're going to do is kill all
    		// of the bad html. But, some of the html is good. This
    		// may not be the best way, but we're going to change of
    		// of the good html tags to something that won't get caught
    		// by the filter, then run the filter, then change those
    		// good fields back. The good fields are:
    		// <b> and </b> and <strong> and </strong>
    		// <i> and </i> and <em> and </em>
    		// <a href=''> and </a> **** Still working on how to do this...
    		// <ul> and </ul>
    		// <li> and </li>
    		// <ol> and </ol>
    		// <p> and </p>
    		// <br> and <br/> and <br />
    		theNewHTML = theNewHTML.replace(/<b>/gi, "~~~b~~~");
    		theNewHTML = theNewHTML.replace(/<\/b>/gi, "~~~/b~~~");
    		theNewHTML = theNewHTML.replace(/<strong>/gi, "~~~strong~~~");
    		theNewHTML = theNewHTML.replace(/<\/strong>/gi, "~~~/strong~~~");
    		theNewHTML = theNewHTML.replace(/<i>/gi, "~~~i~~~");
    		theNewHTML = theNewHTML.replace(/<\/i>/gi, "~~~/i~~~");
    		theNewHTML = theNewHTML.replace(/<em>/gi, "~~~em~~~");
    		theNewHTML = theNewHTML.replace(/<\/em>/gi, "~~~/em~~~");
    		theNewHTML = theNewHTML.replace(/<ul>/gi, "~~~ul~~~");
    		theNewHTML = theNewHTML.replace(/<\/ul>/gi, "~~~/ul~~~");
    		theNewHTML = theNewHTML.replace(/<li>/gi, "~~~li~~~");
    		theNewHTML = theNewHTML.replace(/<\/li>/gi, "~~~/li~~~");
    		theNewHTML = theNewHTML.replace(/<ol>/gi, "~~~ol~~~");
    		theNewHTML = theNewHTML.replace(/<\/ol>/gi, "~~~/ol~~~");
    		theNewHTML = theNewHTML.replace(/<p>/gi, "~~~p~~~");
    		theNewHTML = theNewHTML.replace(/<\/p>/gi, "~~~/p~~~");
    		theNewHTML = theNewHTML.replace(/<br>/gi, "~~~br~~~");
    		theNewHTML = theNewHTML.replace(/<br\/>/gi, "~~~br/~~~");
    		theNewHTML = theNewHTML.replace(/<br \/>/gi, "~~~br /~~~");
    		theNewHTML = theNewHTML.replace(/<a/gi, "~~~a~~~");
    		theNewHTML = theNewHTML.replace(/<\/a>/gi, "~~~/a~~~");
    		theNewHTML = theNewHTML.replace(/<h(1|2)/gi, "~~~h4~~~");
    		theNewHTML = theNewHTML.replace(/<\/h(1|2)>/gi, "~~~/h4~~~");
    		theNewHTML = theNewHTML.replace(/<h(3|4|5|6)/gi, "~~~h5~~~");
    		theNewHTML = theNewHTML.replace(/<\/h(3|4|5|6)>/gi, "~~~/h5~~~");
    		// Ok, now we run it through the HTML filter and take out
    		// all of the HTML.
    		theNewHTML = theNewHTML.replace(/<\/?[^>]+(>|$)/ig, "");
    		// Now we work our way back, and re-replace all of the good tags
    		// We will also do some substitutions here:
    		// <b> becomes <strong>
    		// <i> becomes <em>
    		// <br> and <br/> become <br />
    		theNewHTML = theNewHTML.replace(/~~~b~~~/gi, "<strong>");
    		theNewHTML = theNewHTML.replace(/~~~\/b~~~/gi, "</strong>");
    		theNewHTML = theNewHTML.replace(/~~~strong~~~/gi, "<strong>");
    		theNewHTML = theNewHTML.replace(/~~~\/strong~~~/gi, "</strong>");
    		theNewHTML = theNewHTML.replace(/~~~i~~~/gi, "<em>");
    		theNewHTML = theNewHTML.replace(/~~~\/i~~~/gi, "</em>");
    		theNewHTML = theNewHTML.replace(/~~~em~~~/gi, "<em>");
    		theNewHTML = theNewHTML.replace(/~~~\/em~~~/gi, "</em>");
    		theNewHTML = theNewHTML.replace(/~~~ul~~~/gi, "<ul>");
    		theNewHTML = theNewHTML.replace(/~~~\/ul~~~/gi, "</ul>");
    		theNewHTML = theNewHTML.replace(/~~~li~~~/gi, "<li>");
    		theNewHTML = theNewHTML.replace(/~~~\/li~~~/gi, "</li>");
    		theNewHTML = theNewHTML.replace(/~~~ol~~~/gi, "<ol>");
    		theNewHTML = theNewHTML.replace(/~~~\/ol~~~/gi, "</ol>");
    		theNewHTML = theNewHTML.replace(/~~~p~~~/gi, "<p>");
    		theNewHTML = theNewHTML.replace(/~~~\/p~~~/gi, "</p>");
    		theNewHTML = theNewHTML.replace(/~~~br~~~/gi, "<br />");
    		theNewHTML = theNewHTML.replace(/~~~br\/~~~/gi, "<br />");
    		theNewHTML = theNewHTML.replace(/~~~br \/~~~/gi, "<br />");
    		theNewHTML = theNewHTML.replace(/~~~a~~~/gi, "<a");
    		theNewHTML = theNewHTML.replace(/~~~\/a~~~/gi, "</a>");
    		theNewHTML = theNewHTML.replace(/~~~h4~~~/gi, "<h4>");
    		theNewHTML = theNewHTML.replace(/~~~\/h4~~~/gi, "</h4>");
    		theNewHTML = theNewHTML.replace(/~~~h5~~~/gi, "<h5>");
    		theNewHTML = theNewHTML.replace(/~~~\/h5~~~/gi, "</h5>");
    		// Bad empty paragraph attempts by the various browsers.
    		// Our goal is <p>&nbsp;</p>
    		theNewHTML = theNewHTML.replace(/<div><br *\/><\/div>/gi, "<p>&nbsp;</p>");
    		theNewHTML = theNewHTML.replace(/<p><br *\/><\/p>/gi, "<p>&nbsp;</p>");
    		theNewHTML = theNewHTML.replace(/<p>&#160;<\/p>/gi, "<p>&nbsp;</p>")
    		theNewHTML = theNewHTML.replace(/<p> *<\/p>/gi, "<p>&nbsp;</p>");
    		theNewHTML = theNewHTML.replace(/<br *\/><br *\/>/gi, "<p>&nbsp;</p>");
    		// Empty <a href=""></a> and <p></p> tags
    		theNewHTML = theNewHTML.replace(/<a href=.> *<\/a>/gim, "");
    		theNewHTML = theNewHTML.replace(/<p> *<\/p>/gim, "");
    		// Repeating white space. Basically, a bunch of repeating
    		// <p> tags that cause an extended white space.
    		theNewHTML = theNewHTML.replace(/((<p>(&#160;|&nbsp;| *)<\/p>)[\s\r\n\f]*){2,}/gim, "<p>&nbsp;</p>");
    		// Duplicates. Specifically <p> and </p>
    		theNewHTML = theNewHTML.replace(/(\s*<p>\s*)+/gim, "<p>");
    		theNewHTML = theNewHTML.replace(/(\s*<\/p>\s*)+/gim, "</p>");
    		// HTML Source Formatting.
    		// First, remove all linebreaks.
    		theNewHTML = theNewHTML.replace(/(\r\n|\n|\r)/gim, "");
    		// Next, remove excess whitespace
    		theNewHTML = theNewHTML.replace(/ +/gim, " ");
    		theNewHTML = theNewHTML.replace(/\s{2,}/gim, " ");
    		// Then, we want to put back line breaks, but only on the 
    		// things we want the line breaks on.
    		theNewHTML = theNewHTML.replace(/<\/div>/gim, "</div>\r\n");
    		theNewHTML = theNewHTML.replace(/<\/span>/gim, "</span>\r\n");
    		theNewHTML = theNewHTML.replace(/<\/p>/gim, "</p>\r\n");
    		theNewHTML = theNewHTML.replace(/<\/h3>/gim, "</h3>\r\n");
    		theNewHTML = theNewHTML.replace(/<\/h4>/gim, "</h4>\r\n");
    		theNewHTML = theNewHTML.replace(/<br\s+\/?>/gim, "<br />\r\n");
    		theNewHTML = theNewHTML.replace(/<hr\s+\/>/gim, "<hr />\r\n");
    		theNewHTML = theNewHTML.replace(/<ul>/gim, "<ul>\r\n");
    		theNewHTML = theNewHTML.replace(/<\/ul>/gim, "</ul>\r\n");
    		theNewHTML = theNewHTML.replace(/<\/li>/gim, "</li>\r\n");
    		// Finally, re-apply this HTML to the panel. Or, if nothing 
    		// has changed, do nothing.
    		if (theOldHTML == theNewHTML ) { }
    		else {
    			// If the string doesn't begin with a <p> tag,
    			// then we want to enclose the entire thing with one.
    			if (theNewHTML.substring(0,3) != "<p>" && theNewHTML.indexOf("<p>") == -1) {
    				theNewHTML = "<p>" + String(theNewHTML) + "</p>";
    			thePanel.innerHTML = theNewHTML;

    Open in new window


    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    Enabling OSINT in Activity Based Intelligence

    Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

    In this article you'll learn how to use Ajax calls within your CodeIgniter application. To explain this, I'll illustrate how to implement a simple contact form to allow visitors to send you an email through your web site.
    International Data Corporation (IDC) prognosticates that before the current the year gets over disbursing on IT framework products to be sent in cloud environs will be $37.1B.
    The viewer will learn the basics of jQuery, including how to invoke it on a web page. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery.: (CODE)
    The viewer will learn the basics of jQuery including how to code hide show and toggles. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery…

    760 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    13 Experts available now in Live!

    Get 1:1 Help Now