[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 865
  • Last Modified:

Javascript Regex: do not replace a keyword within CSS/JS/link as well as in HTML comment, inside of a tag e.g. as an attribute or in value of the attribute, iframe

Hello,

I have this regex which currently works for CSS/JS/link i.e. does not replace a keyword when it is inside CSS/JS/link.

var re = new RegExp("\\b" + keyword + "\\b(?!(?:[\\s\\S](?!<(?:a|script|style)\\s))*</(?:a|script|style)>)","i");

Open in new window


Next step is improve it so it doesn't replace the keyword in HTML comment, inside of a tag e.g. as an attribute or in value of the attribute, iframe

e.g. <div title="test KEYWORD" KEYWORD="true"><div>


any ideas ?
0
svetoslavm
Asked:
svetoslavm
  • 8
  • 7
1 Solution
 
Jesse HouwingScrum Trainer | Microsoft MVP | ALM Ranger | ConsultantCommented:
May I suggest that you try the Html AgilityPack instead? (http://htmlagilitypack.codeplex.com/)

It allows you to load the HTML DOM as if it were XML and you can then apply simple XSLT or go through each individual node. It gives you a lot more power in that it allows you to select which attributes and which tag contents you want to replace. You can load up the document using HtmlDocument.Load, then you can go through each HtmlNode in the document and every child of these HtmlNodes recursively. Any changes made to the tag contents can be saved back to a file or string variable.

As to the expression you have so far, I don't really understand how you came to this. I assume with some trial and error.

You haven't specified exactly what cases you do and which cases you don't want to include in this regex. Let me try to summarize:

Match a keyword in:
- Any tag content (text)

Except:
- in HTML comments <!-- ... -->
- in tag attributes (so inside any tags)
- in a <LINK> tag (both attributes and contents)
- in a <SCRIPT> tag (both attributes and contents)
- in a <STYLE> tag (both attributes and contents)

This can be done using regex up to a certain level, should you really want to. If you do, please make sure you use Regex.Escape on the keyword text in question to ensure it doesn't include any regex special characters.

The problem lies with tag nesting. a script tag can contain < and > as well for example, which makes it very hard to positively identify text in a script tag. You expression will not catch certain cases correctly where tags are nested for example. Especially if you include comments (which can cause the correct order of closing and opening tags to change).

the expression I'd use would be something like:

Not inside any tag (catches all cases where KEYWORD is contained in attributes:
(?<!<[^>]*)\bKEYWORD\b(?![^<]>*)

Not inside LINK, SCRIPT, A, STYLE tags:
(?<!<(?<tag>script|link|style|a)[^>]*>.*)\bKEYWORD(?!.*</\k<tag>)

Not inside a comment tag:
(?<!<!--[^>]*>.*)\bKEYWORD(?!.*-->)

I wouldn't combine them into one expression, as it is usually faster to run separate expressions and it's much easier to maintain this way.

0
 
svetoslavmAuthor Commented:
thanks for the comment. this pack seems too big for my app.
If I were to combine those regexes aren't they going to match different occurrences of the same word ?

Example:
word1, word5, word9, word2, word1
0
 
Jesse HouwingScrum Trainer | Microsoft MVP | ALM Ranger | ConsultantCommented:
What do you mean with "match different occurences of the same word"? Is that not what you want?

WHat do you mean by "too big for your app"? The assembly isn't large at all and the syntax is very simple. You could even install the library into the GAC and be done with it. It's the simpelst and most robust solution to your problem.
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
svetoslavmAuthor Commented:
my goal is to have a snippet of code (the regex is part of it) as small as possible... it'll be installed on many sites.
0
 
Jesse HouwingScrum Trainer | Microsoft MVP | ALM Ranger | ConsultantCommented:
Installing (and copy pasting) code is n't the best way to put functionality into many websites. You could put it in a separate assembly and just drop that in the bin folder. That way you could just call one method to do the work for you from all these websites. In my opinion a much more manageable way.

But if you really want to take the regex tour, you should be aware of the serious issues you will have with certain HTML files with comments or nested tags. You will not be able to solve these using Regex.

This is the code for the HTML Agility Pack it doesn't come much more easy to read than this:

    class Program
    {
        private static IDictionary<string, string> replacements = new Dictionary<string, string>();

        private static string htmlText = @"
    <html>
        <body>
            <div>
                <h1>keyword text one</h1>
                <div>keyword test</div>
                Keyword test
            </div>
        </body>
    </html>
";

        static void Main(string[] args)
        {
            replacements.Add("keyword", "someother");

            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(htmlText);

            ReplaceKeywords(doc.DocumentNode);
            string text = doc.DocumentNode.OuterHtml;
        }

        public static void ReplaceKeywords(HtmlAgilityPack.HtmlNode node)
        {
            if (node.NodeType == HtmlAgilityPack.HtmlNodeType.Comment)
                return;

            if (node.NodeType == HtmlAgilityPack.HtmlNodeType.Element && node.Name == "A")
                return;

            if (node.NodeType == HtmlAgilityPack.HtmlNodeType.Element && node.Name == "SCRIPT")
                return;
           
            if (node.NodeType == HtmlAgilityPack.HtmlNodeType.Element && node.Name == "LINK")
                return;

            if (node.NodeType == HtmlAgilityPack.HtmlNodeType.Element && node.Name == "STYLE")
                return;

            if (node.NodeType == HtmlAgilityPack.HtmlNodeType.Text)
            {
                node.InnerHtml = ReplaceKeywords(node.InnerText);
            }

            if (node.HasChildNodes)
            {
                foreach (var childnode in node.ChildNodes)
                {
                    ReplaceKeywords(childnode);
                }
            }

        }

        private static string ReplaceKeywords(string p)
        {
            foreach (var item in replacements)
            {
                string expression = String.Format(@"\b{0}\b", Regex.Escape(item.Key));
                p = Regex.Replace(p, expression, item.Value, RegexOptions.IgnoreCase);
            }
            return p;
        }
    }


0
 
svetoslavmAuthor Commented:
Thanks again for the comment.

"Installing (and copy pasting) code isn't the best way to put functionality into many websites"
for that specific project it is.

Also can you check the keyword regex because it doesn't seem to work for me.
I am getting some JS errors about invalid qualifier.
(?<!<(?<tag>script|link|style|a)[^>]*>.*)\bKEYWORD(?!.*</\k<tag>)

I have attached an HTML code which tests your regex. I've tested it with XRegExp too still nothing.
I've used Firefox Addon WebDeveloper => View Generated Source

By the way I managed to come up with algorithm to skip the keyword when it is in comments/attributes.

Here are some stats from htmlagilitypack.and probably it'll require either Windows hosting (?) or it'll work for Microsoft browsers.

http://htmlagilitypack.codeplex.com/releases/view/44954
Html Agility Pack 1.4.0 Binaries
application, 127K, uploaded May 7 2010 - 47876 downloads

Documentation Documentation
documentation, 873K, uploaded May 7 2010 - 20841 downloads
Application HAP Explorer
application, 155K, uploaded May 7 2010 - 8703 downloads
Source Code Html Agility Pack 1.4.0 Source
source code, 127K, uploaded May 7 2010 - 9571 downloads

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' xmlns:data='http://www.google.com/2005/gml/data' xmlns:expr='http://www.google.com/2005/gml/expr'>
<head>
<body>
	<script type='text/javascript'>
		//<![CDATA[
		word1 = 'test';
		word2 = 'test';
		word3 = 'test';
		word4 = 'test';
		//]]>
	</script>    
	<style type="text/css">
        .word1{ display:block }
        .word2{ display:block }
        .word3{ display:block }
        .word4{ display:block }
	</style>

	word1 hello word2 <br/>
	word3 hello word4 <br/>
	
<script type="text/javascript">//<![CDATA[
	function replace(func) {	
		var buff = document.body.innerHTML;
		
        if (func == 1) {
            var re1 = new RegExp("(?!<(?<tag>script|link|style|a)[^>]*>.*)\bword1(?!.*</\k<tag>)", "i"); // invalid qualifier
            var re2 = new RegExp("(?!<(?<tag>script|link|style|a)[^>]*>.*)\bword2(?!.*</\k<tag>)", "i"); // invalid qualifier
        
            buff = buff.replace(re1, '___word1-replaced__');
            buff = buff.replace(re2, '___word2-replaced__');
        } else {
            var re1x = XRegExp("(?!<(?<tag>script|link|style|a)[^>]*>.*)\bword3(?!.*</\k<tag>)", "i");
            var re2x = XRegExp("(?!<(?<tag>script|link|style|a)[^>]*>.*)\bword4(?!.*</\k<tag>)", "i");
            
            buff = buff.replace(re1x, '___word1-replaced__');
            buff = buff.replace(re2x, '___word2-replaced__');
        }
        
		document.body.innerHTML = buff;
	}

    // http://xregexp.com/api/
    //XRegExp 1.5.0 <xregexp.com> MIT License
    var XRegExp;if(XRegExp){throw Error("can't load XRegExp twice in the same frame")}(function(){XRegExp=function(w,r){var q=[],u=XRegExp.OUTSIDE_CLASS,x=0,p,s,v,t,y;if(XRegExp.isRegExp(w)){if(r!==undefined){throw TypeError("can't supply flags when constructing one RegExp from another")}return j(w)}if(g){throw Error("can't call the XRegExp constructor within token definition functions")}r=r||"";p={hasNamedCapture:false,captureNames:[],hasFlag:function(z){return r.indexOf(z)>-1},setFlag:function(z){r+=z}};while(x<w.length){s=o(w,x,u,p);if(s){q.push(s.output);x+=(s.match[0].length||1)}else{if(v=m.exec.call(i[u],w.slice(x))){q.push(v[0]);x+=v[0].length}else{t=w.charAt(x);if(t==="["){u=XRegExp.INSIDE_CLASS}else{if(t==="]"){u=XRegExp.OUTSIDE_CLASS}}q.push(t);x++}}}y=RegExp(q.join(""),m.replace.call(r,h,""));y._xregexp={source:w,captureNames:p.hasNamedCapture?p.captureNames:null};return y};XRegExp.version="1.5.0";XRegExp.INSIDE_CLASS=1;XRegExp.OUTSIDE_CLASS=2;var c=/\$(?:(\d\d?|[$&`'])|{([$\w]+)})/g,h=/[^gimy]+|([\s\S])(?=[\s\S]*\1)/g,n=/^(?:[?*+]|{\d+(?:,\d*)?})\??/,g=false,k=[],m={exec:RegExp.prototype.exec,test:RegExp.prototype.test,match:String.prototype.match,replace:String.prototype.replace,split:String.prototype.split},a=m.exec.call(/()??/,"")[1]===undefined,e=function(){var p=/^/g;m.test.call(p,"");return !p.lastIndex}(),f=function(){var p=/x/g;m.replace.call("x",p,"");return !p.lastIndex}(),b=RegExp.prototype.sticky!==undefined,i={};i[XRegExp.INSIDE_CLASS]=/^(?:\\(?:[0-3][0-7]{0,2}|[4-7][0-7]?|x[\dA-Fa-f]{2}|u[\dA-Fa-f]{4}|c[A-Za-z]|[\s\S]))/;i[XRegExp.OUTSIDE_CLASS]=/^(?:\\(?:0(?:[0-3][0-7]{0,2}|[4-7][0-7]?)?|[1-9]\d*|x[\dA-Fa-f]{2}|u[\dA-Fa-f]{4}|c[A-Za-z]|[\s\S])|\(\?[:=!]|[?*+]\?|{\d+(?:,\d*)?}\??)/;XRegExp.addToken=function(s,r,q,p){k.push({pattern:j(s,"g"+(b?"y":"")),handler:r,scope:q||XRegExp.OUTSIDE_CLASS,trigger:p||null})};XRegExp.cache=function(r,p){var q=r+"/"+(p||"");return XRegExp.cache[q]||(XRegExp.cache[q]=XRegExp(r,p))};XRegExp.copyAsGlobal=function(p){return j(p,"g")};XRegExp.escape=function(p){return p.replace(/[-[\]{}()*+?.,\\^$|#\s]/g,"\\$&")};XRegExp.execAt=function(s,r,t,q){r=j(r,"g"+((q&&b)?"y":""));r.lastIndex=t=t||0;var p=r.exec(s);if(q){return(p&&p.index===t)?p:null}else{return p}};XRegExp.freezeTokens=function(){XRegExp.addToken=function(){throw Error("can't run addToken after freezeTokens")}};XRegExp.isRegExp=function(p){return Object.prototype.toString.call(p)==="[object RegExp]"};XRegExp.iterate=function(u,p,v,s){var t=j(p,"g"),r=-1,q;while(q=t.exec(u)){v.call(s,q,++r,u,t);if(t.lastIndex===q.index){t.lastIndex++}}if(p.global){p.lastIndex=0}};XRegExp.matchChain=function(q,p){return function r(s,x){var v=p[x].regex?p[x]:{regex:p[x]},u=j(v.regex,"g"),w=[],t;for(t=0;t<s.length;t++){XRegExp.iterate(s[t],u,function(y){w.push(v.backref?(y[v.backref]||""):y[0])})}return((x===p.length-1)||!w.length)?w:r(w,x+1)}([q],0)};RegExp.prototype.apply=function(q,p){return this.exec(p[0])};RegExp.prototype.call=function(p,q){return this.exec(q)};RegExp.prototype.exec=function(t){var r=m.exec.apply(this,arguments),q,p;if(r){if(!a&&r.length>1&&l(r,"")>-1){p=RegExp(this.source,m.replace.call(d(this),"g",""));m.replace.call(t.slice(r.index),p,function(){for(var u=1;u<arguments.length-2;u++){if(arguments[u]===undefined){r[u]=undefined}}})}if(this._xregexp&&this._xregexp.captureNames){for(var s=1;s<r.length;s++){q=this._xregexp.captureNames[s-1];if(q){r[q]=r[s]}}}if(!e&&this.global&&!r[0].length&&(this.lastIndex>r.index)){this.lastIndex--}}return r};if(!e){RegExp.prototype.test=function(q){var p=m.exec.call(this,q);if(p&&this.global&&!p[0].length&&(this.lastIndex>p.index)){this.lastIndex--}return !!p}}String.prototype.match=function(q){if(!XRegExp.isRegExp(q)){q=RegExp(q)}if(q.global){var p=m.match.apply(this,arguments);q.lastIndex=0;return p}return q.exec(this)};String.prototype.replace=function(r,s){var t=XRegExp.isRegExp(r),q,p,u;if(t&&typeof s.valueOf()==="string"&&s.indexOf("${")===-1&&f){return m.replace.apply(this,arguments)}if(!t){r=r+""}else{if(r._xregexp){q=r._xregexp.captureNames}}if(typeof s==="function"){p=m.replace.call(this,r,function(){if(q){arguments[0]=new String(arguments[0]);for(var v=0;v<q.length;v++){if(q[v]){arguments[0][q[v]]=arguments[v+1]}}}if(t&&r.global){r.lastIndex=arguments[arguments.length-2]+arguments[0].length}return s.apply(null,arguments)})}else{u=this+"";p=m.replace.call(u,r,function(){var v=arguments;return m.replace.call(s,c,function(x,w,A){if(w){switch(w){case"$":return"$";case"&":return v[0];case"`":return v[v.length-1].slice(0,v[v.length-2]);case"'":return v[v.length-1].slice(v[v.length-2]+v[0].length);default:var y="";w=+w;if(!w){return x}while(w>v.length-3){y=String.prototype.slice.call(w,-1)+y;w=Math.floor(w/10)}return(w?v[w]||"":"$")+y}}else{var z=+A;if(z<=v.length-3){return v[z]}z=q?l(q,A):-1;return z>-1?v[z+1]:x}})})}if(t&&r.global){r.lastIndex=0}return p};String.prototype.split=function(u,p){if(!XRegExp.isRegExp(u)){return m.split.apply(this,arguments)}var w=this+"",r=[],v=0,t,q;if(p===undefined||+p<0){p=Infinity}else{p=Math.floor(+p);if(!p){return[]}}u=XRegExp.copyAsGlobal(u);while(t=u.exec(w)){if(u.lastIndex>v){r.push(w.slice(v,t.index));if(t.length>1&&t.index<w.length){Array.prototype.push.apply(r,t.slice(1))}q=t[0].length;v=u.lastIndex;if(r.length>=p){break}}if(u.lastIndex===t.index){u.lastIndex++}}if(v===w.length){if(!m.test.call(u,"")||q){r.push("")}}else{r.push(w.slice(v))}return r.length>p?r.slice(0,p):r};function j(r,q){if(!XRegExp.isRegExp(r)){throw TypeError("type RegExp expected")}var p=r._xregexp;r=XRegExp(r.source,d(r)+(q||""));if(p){r._xregexp={source:p.source,captureNames:p.captureNames?p.captureNames.slice(0):null}}return r}function d(p){return(p.global?"g":"")+(p.ignoreCase?"i":"")+(p.multiline?"m":"")+(p.extended?"x":"")+(p.sticky?"y":"")}function o(v,u,w,p){var r=k.length,y,s,x;g=true;try{while(r--){x=k[r];if((w&x.scope)&&(!x.trigger||x.trigger.call(p))){x.pattern.lastIndex=u;s=x.pattern.exec(v);if(s&&s.index===u){y={output:x.handler.call(p,s,w),match:s};break}}}}catch(q){throw q}finally{g=false}return y}function l(s,q,r){if(Array.prototype.indexOf){return s.indexOf(q,r)}for(var p=r||0;p<s.length;p++){if(s[p]===q){return p}}return -1}XRegExp.addToken(/\(\?#[^)]*\)/,function(p){return m.test.call(n,p.input.slice(p.index+p[0].length))?"":"(?:)"});XRegExp.addToken(/\((?!\?)/,function(){this.captureNames.push(null);return"("});XRegExp.addToken(/\(\?<([$\w]+)>/,function(p){this.captureNames.push(p[1]);this.hasNamedCapture=true;return"("});XRegExp.addToken(/\\k<([\w$]+)>/,function(q){var p=l(this.captureNames,q[1]);return p>-1?"\\"+(p+1)+(isNaN(q.input.charAt(q.index+q[0].length))?"":"(?:)"):q[0]});XRegExp.addToken(/\[\^?]/,function(p){return p[0]==="[]"?"\\b\\B":"[\\s\\S]"});XRegExp.addToken(/^\(\?([imsx]+)\)/,function(p){this.setFlag(p[1]);return""});XRegExp.addToken(/(?:\s+|#.*)+/,function(p){return m.test.call(n,p.input.slice(p.index+p[0].length))?"":"(?:)"},XRegExp.OUTSIDE_CLASS,function(){return this.hasFlag("x")});XRegExp.addToken(/\./,function(){return"[\\s\\S]"},XRegExp.OUTSIDE_CLASS,function(){return this.hasFlag("s")})})();
	
	//]]>
</script>

<button onclick="replace(1);">Replace 1 (RegExp)</button>
<button onclick="replace(2);">Replace 2 (XRegExp)</button>
</body>
</html>

Open in new window

0
 
Jesse HouwingScrum Trainer | Microsoft MVP | ALM Ranger | ConsultantCommented:
I was under the assumption that you were running this inside of an ASP.NET application. But is this regular expression being executed in the browser? Sorry for that then. Disregard my previous replies. The main issue still applies, Regex isn't the best solution for your problem, but the HTML Agility Pack isn't going to help you in this.

Javascript doesn't like look behinds that have a variable length (the ?! construct). It also doesn't understand named references (the ?<name> construct). So you can't use my expressions directly in Javascript.

The fun fact however is that the document.body can be iterated very similar to the HTML Agility Pack method. so instead of search and replacing directly in the InnerHTML, you can go through each element and handle them separately. I must get to my house in a minute. I'll try and cook up come code for you when I get home.
0
 
svetoslavmAuthor Commented:
Thanks ToAoM.
I can't wait to see your delicious RegExp dish :D
0
 
Jesse HouwingScrum Trainer | Microsoft MVP | ALM Ranger | ConsultantCommented:
0
 
svetoslavmAuthor Commented:
Hi ToAoM, that's interesting approach.
I think I am going to need the keyword regex still because when I iterate over P,SPAN,DIV tags there could be scripts added inside them.
0
 
Jesse HouwingScrum Trainer | Microsoft MVP | ALM Ranger | ConsultantCommented:
yes that would be the trick, but while going through them, you don't havr to take into account the fact that it might be in some attribute or in comments
0
 
svetoslavmAuthor Commented:
so do you think you can revise your keyword regex so it doesn't match script/styles etc ?
I have an idea how to skip the comments and keyword found in attribs or values
0
 
Jesse HouwingScrum Trainer | Microsoft MVP | ALM Ranger | ConsultantCommented:
each HtmlElement has a Nodes property. Each Node has a NodeType property. You should be able to use those to find Text nodes and change the text of those only:
https://developer.mozilla.org/En/DOM/Node.nodeType

You should be able to store the new text in the nodeValue property.

That way you can simply replace "\\b"+keyword+"\\b" as your regexp
0
 
svetoslavmAuthor Commented:
is it going to be a cross-browser thing ?
0
 
Jesse HouwingScrum Trainer | Microsoft MVP | ALM Ranger | ConsultantCommented:
IE has a childNodes collection as well:
http://msdn.microsoft.com/en-us/library/ms537445(v=VS.85).aspx
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

  • 8
  • 7
Tackle projects and never again get stuck behind a technical roadblock.
Join Now