javascript function that strip javascript code from html

Hi,

I need a javascript script able to extract the text content of the web page where it is located on the fly.

When the javascript call document.body.innerHTML I get text with tags:

html = document.body.innerHTML;

No problem I found a way to strip all tags:

content = html.replace(/<[^>]*>/g, "");

Works fine except that the content of possible javascripts located into the page are not removed and I see the code in the returned content!

So I guess before strip all tags I need to remove posible javascript code inserted into the page.

I did several experiments with regular expressions but I did not find the solution to can strip javascript code from html.

Do you know a way?

Thanks in advance.
cybertonicAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

LakioCommented:
can you show me the returned content?
cybertonicAuthor Commented:
Not sure to understand your question.

Imagine you have a web page: ANY

And that you want to get the text content of the web page with javascript.

You can get it with:

html = document.body.innerHTML;

The problem is that if this page have a javascript then the coent of html will have:
....<script language"javascript"> .... </script> ....

I want a reg expresion able to remove javascript code in the html.

LakioCommented:
content = html.replace(/<script[^<]*<[^>]*script>/gi, '')

works ok, but if theres are < and > inside there script it will not work
Determine the Perfect Price for Your IT Services

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden with our free interactive tool and use it to determine the right price for your IT services. Download your free eBook now!

cybertonicAuthor Commented:
As you may imagine I need something that works ALWAYS with no exceptions.
mvan01Commented:
Based on this:
content = html.replace(/<script[^<]*<[^>]*script>/gi, '')

could this work?
content = html.replace(/<script*//script>/gi, '')

or this?
content = html.replace(/<script*\/script>/gi, '')

If not, how does one specify to look for a '/' within a RegExp?  Must be a way, no?

Peace and joy.  mvan
cybertonicAuthor Commented:
I do not know if anything above works.
Because for each possibility you enumerate I get:

Active Server Pages error 'ASP 0138'
Nested Script Block
A script block cannot be placed inside another script block.

The page that need to execute this script is an ASP page.

So no advance.
scrathcyboyCommented:
YOU simply cannot do this from ASP language.  It stops you editing script tags, because Microsoft has this bad idea that they "own" javascript, and they dont.  If you dont use ASP program, you can do it, just be deleting everything between the <script> ..... and ....... </script> tags.  However, be aware, almost all modern pages put most of their functionality in the javascript tags, including links to other pages.  If you wipe all this out, you can easily wipe out the structure of the site, as well as some of its content.

There is no simple "spider" to extract ONLY the text of web pages,  they are MUCH too complex for this.
cybertonicAuthor Commented:
When you do not want an ASP page error when using the keyword <script within an ASP page you write something like:

... + "<" + "CRIPT" + ....

So I guess there is a way to rewrite the regexp above to not have <script appear directly but write it like a concatenation of words, no?

I really do not know anything about regexp but I guess that's possible.

...

Now regarding your suggestion "... modern pages ..." this not affect in nothing for what I need this function done.
bubbledragonCommented:
regexp cannot handle mutli-line in javascript.

html = document.body.innerHTML;

var OScriptTags = new Array;
for (i=0;i<html.length;i++) {
      i = ((html.toLowerCase()).indexOf('<script', i) == -1)?html.length:(html.toLowerCase()).indexOf('<script', i);
      OScriptTags[OScriptTags.length] = i;
}

var CScriptTags = new Array;
for (i=0;i<html.length;i++) {
      i = ((html.toLowerCase()).indexOf('/script>', i) == -1)?html.length:(html.toLowerCase()).indexOf('/script>', i)+8;
      CScriptTags[CScriptTags.length] = i;
}

for (i=OScriptTags.length-2;i>=0;i--) {
      html = html.substring(0, OScriptTags[i])+html.substring(CScriptTags[i], html.length);
}
cybertonicAuthor Commented:
This line:
 i = ((html.toLowerCase()).indexOf('<script', i) == -1)?html.length:(html.toLowerCase()).indexOf('<script', i);
has been replaced by:
i = ((html.toLowerCase()).indexOf('<' + 'script', i) == -1)?html.length:(html.toLowerCase()).indexOf('<' +  'script', i);

To avoid nested script ASP error on this page.
....

The script above when inserted within a web page return:

Text after the <body> tage and all the javascript starting at:

', i) == -1)?html.length:(html.toLowerCase()).indexOf('<script', i); ....

cybertonicAuthor Commented:

I do not know why but I absolutely wanted have this done with a regexp.

In fact the last post simply opened my eyes and I think I know how to proceed I will post here the solution when fully tested.

...

I thanks all the kind persons here that tried to help me.

I also apologize as I have been a bit sharp with some guys here and against Experts Exchange.
I thought that in fact when you was paying the subscription fee to Exchange Experts in fact experts had the obligation to respond and be performant.
I have now understood it's simply a friendly community of developers that help each other.
So again THANKS for the time you spent on my problem and I promise to post my solution very soon so it can help you if you need something similar.
scrathcyboyCommented:
Idea is that you give "points" to best answer, or split points among best answerS, to close question.  These points mean nothing in terms of value, just "status" on expert exchange, however you want to interpret that.
cybertonicAuthor Commented:
Here you have it:

function GetTextContent(html)
{
     p1 = html.indexOf("<" + "script");
     while(p1 > 0)
     {
          p2 = html.indexOf("</" + "script>", p1+7);
          p = html.indexOf("<" + "script>", p1+7);
          if(p2 > 0)
          {
               if(p < 0 || p > p2)
               {
                    html = html.substring(0, p1) + html.substring(p2+9, html.length);
               }
               else
               {
                    html = html.substring(0, p) + html.substring(p2+9, html.length);                    
               }
          }
          else
          {
               html = html.substring(0, p1);
          }
          p1 = html.indexOf("<" + "script");
     }

     return html.replace(/<(.|\n)*?>/g, "");
}


To get the body text content (with no javascript)

html = GetTextContent(document.body.innerHTML.toLowerCase());

Remove scripts placed one after other and also possible nested scripts.
Finally removed all the other tags.
cybertonicAuthor Commented:
First time I do it. Not sure I clicked the good button to give points to each ones; sorry.
scrathcyboyCommented:
I will submit request for you to split points among all answers, glad you solved it.
scrathcyboyCommented:
Cybertronic -- here is what I told the moderators, hope this is OK with you.  I see your own solution is smarter than anything I would have done, therefore I cannot accept points on this, you should get them back, perhaps?  Congrats on some good coding.

"The asker of this question inadvertently gave me all the points.  Please undo this for him.  Since he gave such an elegant solution to his own question, you should return the points to him.  Also, since he is not getting the level of help he expected from Expert Exchange, you might open a dialog with him to refund all or part of his payments.  This is between you and him, I just want to correct this question, thanks.

http://www.experts-exchange.com/Web/Web_Languages/JavaScript/Q_21792150.html"

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
scrathcyboyCommented:
Note to moderators -- looking at others contributions here, suggest you split points among them, leave me out of any.  Thanks.
scrathcyboyCommented:
Cybertronic --

As a caution about complexity of tags, a lot of people are now putting /> to end a tag, sometimes before or after the script name.  e.g. for ending, I have variously seen </Script\> as well as </\script>, which mean different things.  If your routine now work right, might be one of these.  So plan in the backslash, you will see it surface a lot.  Good luck again, and Bye !
cybertonicAuthor Commented:
Good find with some DTD it's possible to have \>

Now I never seen </ begining a tag I am not sure it can be valid.

Here the revised script that takes care of this possible \> but also the case where there is space before the > like </script       >


function GetTextContent(html)
{
      p1 = html.indexOf("<" + "script");
      while(p1 >= 0)
      {
            p2 = html.indexOf("</" + "script", p1+7);
            p = html.indexOf("<" + "script", p1+7);
            if(p2 >= 0)
            {
                  p3 = p2+8;
                  
                  while(p3 < html.length && html.charAt(p3) != ">") p3++;
                  
                  if(p < 0 || p > p2)
                  {
                        html = html.substring(0, p1) + html.substring(p3+1, html.length);
                  }
                  else
                  {
                        html = html.substring(0, p) + html.substring(p3+1, html.length);                        
                  }
            }
            else
            {
                  html = html.substring(0, p1);
            }
            p1 = html.indexOf("<" + "script");
      }

      return html.replace(/<(.|\n)*?>/g, "");
}
scrathcyboyCommented:
You should become 'expert' here.  You get to ask questions for free if you answer enough questions to get 2000 points per month.  With your ability, that probably mean only one question, full points, per month.  This is easy for you, no?  I think you would be excellent addition to site, just my opinion.
rrzCommented:
>You should become 'expert' here.    
I agree.
You should understand that some of us self-proclaimed  "experts" only give help so that hopefully we will receive help when we need it.  rrz
cybertonicAuthor Commented:
Thanks for the invitation, I do not think I have more development skills than people here. And for sure no pretentions to be an expert.
Except when I am going to launch a new service generally I do no longer develop mainly because I have no time.
I suppose that like for most of you the annual membership saving cannot be a motivation when this amount represents absolutely nothing.
Now what I can do is thanks the community for the response to my question by helping another member that have a problem for which I have a solution. And make this a rule.
Thanks again for the kind words.
LakioCommented:
I got 2 ideas,

1. <a href="#" onclick="while(document.getElementsByTagName('script').length) {document.getElementsByTagName('script')[0].removeNode(true);}
alert(document.getElementsByTagName('html')[0].innerHTML);return false">Delete all scripts</a>

2. <TEXTAREA onclick="this.value=document.body.innerHTML.replace(/<\/script>/gi,'¨').replace(/<script[^¨]*¨/gi, 'noscript!')" rows=24 cols=80></TEXTAREA>
cybertonicAuthor Commented:
Greetings Lakio,

Point 1: have no interest for my needs, I cannot have the javascript deleted.

Point 2: simply works like a charm and it's exactly what I was looking for!

I tested it with complicated samples always with success.

So the function that stript javascript and tags is now:

function GetTextContent(html)
{
      html = html.replace(/<\/script>/gi,'¨').replace(/<script[^¨]*¨/gi, '');
           return html.replace(/<(.|\n)*?>/g, "");
}

Example of call:
alert(GetTextContent(document.body.innerHTML));

The interest of regex expressions is they are compilated while the javascript is interpreted so the process should be fast and that's what I need.

<removed by GranMod>

Thanks and again my congrats for the good job.

LakioCommented:
yes cybertonic it works..  unless "¨" is inside the script but this next code is "more" :P

Im useing "·" now, I have never seen it before so I guess its not often inside HTML but this code will start looking for it and if it finds a "·" it will change it to "." .

<body onload="document.write(GetTextContent(document.body.innerHTML))";>
<script type="text/javascript">
function GetTextContent(html){
return html.replace(/·/gi,'.').replace(/<\/script>/gi,'·').replace(/<script[^·]*·/gi, '').replace(/<(.|\n)*?>/g, '');
}
</script>

a better idea is to find some rare characters put them in a array and make it look for them, when unfound use that character in the regular expression.
-But I bet theres a better/easyer way.
(excuse: I have been very busy and regular expression is not old to me :P)
LakioCommented:
with PHP its so easy: http://php.net/manual/en/function.strip-tags.php

it feels silly that I dont see a good way of doing this with regular expression
cybertonicAuthor Commented:
Below the function revised with 2 improvements.

a) The use of the ascii 255, that is a character that is not used in html web pages.

b) I replace script by [s]cript otherwise if the code is inserted within an ASP page then the ASP interpreter will fire an error: "nested script..."

function GetTextContent(html)
{
      return html.replace(/\xff/gi,'.').replace(/<\/[s]cript>/gi,'\xff').replace(/<[s]cript[^\xff]*\xff/gi, '').replace(/<(.|\n)*?>/g, '');
};

P.S:

Here the problem is not PHP or ASP, I neded the code in javascript simply because it's part of a larger code that will be distributed to web pubilshers.
The javascript code need to extract text page content and send it to our servers to categorize on the fly the web page.

LakioCommented:
you can still do PHP strip_tags with Ajax but I understand,

If ascii 255 is never in a HTML you dont need .replace(/\xff/gi,'.')

http://computing-dictionary.thefreedictionary.com/hex%20chart
cybertonicAuthor Commented:
Ajax?

For 2 main reasons:

- Performance reasons.
- Cross domains not allowed.
LakioCommented:
again : "but I understand"
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
JavaScript

From novice to tech pro — start learning today.