Link to home
Start Free TrialLog in
Avatar of cybertonic
cybertonic

asked on

javascript function that strip javascript code from html

Hi,

I need a javascript script able to extract the text content of the web page where it is located on the fly.

When the javascript call document.body.innerHTML I get text with tags:

html = document.body.innerHTML;

No problem I found a way to strip all tags:

content = html.replace(/<[^>]*>/g, "");

Works fine except that the content of possible javascripts located into the page are not removed and I see the code in the returned content!

So I guess before strip all tags I need to remove posible javascript code inserted into the page.

I did several experiments with regular expressions but I did not find the solution to can strip javascript code from html.

Do you know a way?

Thanks in advance.
SOLUTION
Avatar of Lakio
Lakio
Flag of Iceland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of cybertonic
cybertonic

ASKER

Not sure to understand your question.

Imagine you have a web page: ANY

And that you want to get the text content of the web page with javascript.

You can get it with:

html = document.body.innerHTML;

The problem is that if this page have a javascript then the coent of html will have:
....<script language"javascript"> .... </script> ....

I want a reg expresion able to remove javascript code in the html.

content = html.replace(/<script[^<]*<[^>]*script>/gi, '')

works ok, but if theres are < and > inside there script it will not work
As you may imagine I need something that works ALWAYS with no exceptions.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I do not know if anything above works.
Because for each possibility you enumerate I get:

Active Server Pages error 'ASP 0138'
Nested Script Block
A script block cannot be placed inside another script block.

The page that need to execute this script is an ASP page.

So no advance.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
When you do not want an ASP page error when using the keyword <script within an ASP page you write something like:

... + "<" + "CRIPT" + ....

So I guess there is a way to rewrite the regexp above to not have <script appear directly but write it like a concatenation of words, no?

I really do not know anything about regexp but I guess that's possible.

...

Now regarding your suggestion "... modern pages ..." this not affect in nothing for what I need this function done.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
This line:
 i = ((html.toLowerCase()).indexOf('<script', i) == -1)?html.length:(html.toLowerCase()).indexOf('<script', i);
has been replaced by:
i = ((html.toLowerCase()).indexOf('<' + 'script', i) == -1)?html.length:(html.toLowerCase()).indexOf('<' +  'script', i);

To avoid nested script ASP error on this page.
....

The script above when inserted within a web page return:

Text after the <body> tage and all the javascript starting at:

', i) == -1)?html.length:(html.toLowerCase()).indexOf('<script', i); ....


I do not know why but I absolutely wanted have this done with a regexp.

In fact the last post simply opened my eyes and I think I know how to proceed I will post here the solution when fully tested.

...

I thanks all the kind persons here that tried to help me.

I also apologize as I have been a bit sharp with some guys here and against Experts Exchange.
I thought that in fact when you was paying the subscription fee to Exchange Experts in fact experts had the obligation to respond and be performant.
I have now understood it's simply a friendly community of developers that help each other.
So again THANKS for the time you spent on my problem and I promise to post my solution very soon so it can help you if you need something similar.
Idea is that you give "points" to best answer, or split points among best answerS, to close question.  These points mean nothing in terms of value, just "status" on expert exchange, however you want to interpret that.
Here you have it:

function GetTextContent(html)
{
     p1 = html.indexOf("<" + "script");
     while(p1 > 0)
     {
          p2 = html.indexOf("</" + "script>", p1+7);
          p = html.indexOf("<" + "script>", p1+7);
          if(p2 > 0)
          {
               if(p < 0 || p > p2)
               {
                    html = html.substring(0, p1) + html.substring(p2+9, html.length);
               }
               else
               {
                    html = html.substring(0, p) + html.substring(p2+9, html.length);                    
               }
          }
          else
          {
               html = html.substring(0, p1);
          }
          p1 = html.indexOf("<" + "script");
     }

     return html.replace(/<(.|\n)*?>/g, "");
}


To get the body text content (with no javascript)

html = GetTextContent(document.body.innerHTML.toLowerCase());

Remove scripts placed one after other and also possible nested scripts.
Finally removed all the other tags.
First time I do it. Not sure I clicked the good button to give points to each ones; sorry.
I will submit request for you to split points among all answers, glad you solved it.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Note to moderators -- looking at others contributions here, suggest you split points among them, leave me out of any.  Thanks.
Cybertronic --

As a caution about complexity of tags, a lot of people are now putting /> to end a tag, sometimes before or after the script name.  e.g. for ending, I have variously seen </Script\> as well as </\script>, which mean different things.  If your routine now work right, might be one of these.  So plan in the backslash, you will see it surface a lot.  Good luck again, and Bye !
Good find with some DTD it's possible to have \>

Now I never seen </ begining a tag I am not sure it can be valid.

Here the revised script that takes care of this possible \> but also the case where there is space before the > like </script       >


function GetTextContent(html)
{
      p1 = html.indexOf("<" + "script");
      while(p1 >= 0)
      {
            p2 = html.indexOf("</" + "script", p1+7);
            p = html.indexOf("<" + "script", p1+7);
            if(p2 >= 0)
            {
                  p3 = p2+8;
                  
                  while(p3 < html.length && html.charAt(p3) != ">") p3++;
                  
                  if(p < 0 || p > p2)
                  {
                        html = html.substring(0, p1) + html.substring(p3+1, html.length);
                  }
                  else
                  {
                        html = html.substring(0, p) + html.substring(p3+1, html.length);                        
                  }
            }
            else
            {
                  html = html.substring(0, p1);
            }
            p1 = html.indexOf("<" + "script");
      }

      return html.replace(/<(.|\n)*?>/g, "");
}
You should become 'expert' here.  You get to ask questions for free if you answer enough questions to get 2000 points per month.  With your ability, that probably mean only one question, full points, per month.  This is easy for you, no?  I think you would be excellent addition to site, just my opinion.
Avatar of rrz
>You should become 'expert' here.    
I agree.
You should understand that some of us self-proclaimed  "experts" only give help so that hopefully we will receive help when we need it.  rrz
Thanks for the invitation, I do not think I have more development skills than people here. And for sure no pretentions to be an expert.
Except when I am going to launch a new service generally I do no longer develop mainly because I have no time.
I suppose that like for most of you the annual membership saving cannot be a motivation when this amount represents absolutely nothing.
Now what I can do is thanks the community for the response to my question by helping another member that have a problem for which I have a solution. And make this a rule.
Thanks again for the kind words.
I got 2 ideas,

1. <a href="#" onclick="while(document.getElementsByTagName('script').length) {document.getElementsByTagName('script')[0].removeNode(true);}
alert(document.getElementsByTagName('html')[0].innerHTML);return false">Delete all scripts</a>

2. <TEXTAREA onclick="this.value=document.body.innerHTML.replace(/<\/script>/gi,'¨').replace(/<script[^¨]*¨/gi, 'noscript!')" rows=24 cols=80></TEXTAREA>
Greetings Lakio,

Point 1: have no interest for my needs, I cannot have the javascript deleted.

Point 2: simply works like a charm and it's exactly what I was looking for!

I tested it with complicated samples always with success.

So the function that stript javascript and tags is now:

function GetTextContent(html)
{
      html = html.replace(/<\/script>/gi,'¨').replace(/<script[^¨]*¨/gi, '');
           return html.replace(/<(.|\n)*?>/g, "");
}

Example of call:
alert(GetTextContent(document.body.innerHTML));

The interest of regex expressions is they are compilated while the javascript is interpreted so the process should be fast and that's what I need.

<removed by GranMod>

Thanks and again my congrats for the good job.

yes cybertonic it works..  unless "¨" is inside the script but this next code is "more" :P

Im useing "·" now, I have never seen it before so I guess its not often inside HTML but this code will start looking for it and if it finds a "·" it will change it to "." .

<body onload="document.write(GetTextContent(document.body.innerHTML))";>
<script type="text/javascript">
function GetTextContent(html){
return html.replace(/·/gi,'.').replace(/<\/script>/gi,'·').replace(/<script[^·]*·/gi, '').replace(/<(.|\n)*?>/g, '');
}
</script>

a better idea is to find some rare characters put them in a array and make it look for them, when unfound use that character in the regular expression.
-But I bet theres a better/easyer way.
(excuse: I have been very busy and regular expression is not old to me :P)
with PHP its so easy: http://php.net/manual/en/function.strip-tags.php

it feels silly that I dont see a good way of doing this with regular expression
Below the function revised with 2 improvements.

a) The use of the ascii 255, that is a character that is not used in html web pages.

b) I replace script by [s]cript otherwise if the code is inserted within an ASP page then the ASP interpreter will fire an error: "nested script..."

function GetTextContent(html)
{
      return html.replace(/\xff/gi,'.').replace(/<\/[s]cript>/gi,'\xff').replace(/<[s]cript[^\xff]*\xff/gi, '').replace(/<(.|\n)*?>/g, '');
};

P.S:

Here the problem is not PHP or ASP, I neded the code in javascript simply because it's part of a larger code that will be distributed to web pubilshers.
The javascript code need to extract text page content and send it to our servers to categorize on the fly the web page.

you can still do PHP strip_tags with Ajax but I understand,

If ascii 255 is never in a HTML you dont need .replace(/\xff/gi,'.')

http://computing-dictionary.thefreedictionary.com/hex%20chart
Ajax?

For 2 main reasons:

- Performance reasons.
- Cross domains not allowed.
again : "but I understand"