[Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 469
  • Last Modified:

javascript function that strip javascript code from html

Hi,

I need a javascript script able to extract the text content of the web page where it is located on the fly.

When the javascript call document.body.innerHTML I get text with tags:

html = document.body.innerHTML;

No problem I found a way to strip all tags:

content = html.replace(/<[^>]*>/g, "");

Works fine except that the content of possible javascripts located into the page are not removed and I see the code in the returned content!

So I guess before strip all tags I need to remove posible javascript code inserted into the page.

I did several experiments with regular expressions but I did not find the solution to can strip javascript code from html.

Do you know a way?

Thanks in advance.
0
cybertonic
Asked:
cybertonic
  • 13
  • 7
  • 7
  • +3
5 Solutions
 
LakioCommented:
can you show me the returned content?
0
 
cybertonicAuthor Commented:
Not sure to understand your question.

Imagine you have a web page: ANY

And that you want to get the text content of the web page with javascript.

You can get it with:

html = document.body.innerHTML;

The problem is that if this page have a javascript then the coent of html will have:
....<script language"javascript"> .... </script> ....

I want a reg expresion able to remove javascript code in the html.

0
 
LakioCommented:
content = html.replace(/<script[^<]*<[^>]*script>/gi, '')

works ok, but if theres are < and > inside there script it will not work
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
cybertonicAuthor Commented:
As you may imagine I need something that works ALWAYS with no exceptions.
0
 
mvan01Commented:
Based on this:
content = html.replace(/<script[^<]*<[^>]*script>/gi, '')

could this work?
content = html.replace(/<script*//script>/gi, '')

or this?
content = html.replace(/<script*\/script>/gi, '')

If not, how does one specify to look for a '/' within a RegExp?  Must be a way, no?

Peace and joy.  mvan
0
 
cybertonicAuthor Commented:
I do not know if anything above works.
Because for each possibility you enumerate I get:

Active Server Pages error 'ASP 0138'
Nested Script Block
A script block cannot be placed inside another script block.

The page that need to execute this script is an ASP page.

So no advance.
0
 
scrathcyboyCommented:
YOU simply cannot do this from ASP language.  It stops you editing script tags, because Microsoft has this bad idea that they "own" javascript, and they dont.  If you dont use ASP program, you can do it, just be deleting everything between the <script> ..... and ....... </script> tags.  However, be aware, almost all modern pages put most of their functionality in the javascript tags, including links to other pages.  If you wipe all this out, you can easily wipe out the structure of the site, as well as some of its content.

There is no simple "spider" to extract ONLY the text of web pages,  they are MUCH too complex for this.
0
 
cybertonicAuthor Commented:
When you do not want an ASP page error when using the keyword <script within an ASP page you write something like:

... + "<" + "CRIPT" + ....

So I guess there is a way to rewrite the regexp above to not have <script appear directly but write it like a concatenation of words, no?

I really do not know anything about regexp but I guess that's possible.

...

Now regarding your suggestion "... modern pages ..." this not affect in nothing for what I need this function done.
0
 
bubbledragonCommented:
regexp cannot handle mutli-line in javascript.

html = document.body.innerHTML;

var OScriptTags = new Array;
for (i=0;i<html.length;i++) {
      i = ((html.toLowerCase()).indexOf('<script', i) == -1)?html.length:(html.toLowerCase()).indexOf('<script', i);
      OScriptTags[OScriptTags.length] = i;
}

var CScriptTags = new Array;
for (i=0;i<html.length;i++) {
      i = ((html.toLowerCase()).indexOf('/script>', i) == -1)?html.length:(html.toLowerCase()).indexOf('/script>', i)+8;
      CScriptTags[CScriptTags.length] = i;
}

for (i=OScriptTags.length-2;i>=0;i--) {
      html = html.substring(0, OScriptTags[i])+html.substring(CScriptTags[i], html.length);
}
0
 
cybertonicAuthor Commented:
This line:
 i = ((html.toLowerCase()).indexOf('<script', i) == -1)?html.length:(html.toLowerCase()).indexOf('<script', i);
has been replaced by:
i = ((html.toLowerCase()).indexOf('<' + 'script', i) == -1)?html.length:(html.toLowerCase()).indexOf('<' +  'script', i);

To avoid nested script ASP error on this page.
....

The script above when inserted within a web page return:

Text after the <body> tage and all the javascript starting at:

', i) == -1)?html.length:(html.toLowerCase()).indexOf('<script', i); ....

0
 
cybertonicAuthor Commented:

I do not know why but I absolutely wanted have this done with a regexp.

In fact the last post simply opened my eyes and I think I know how to proceed I will post here the solution when fully tested.

...

I thanks all the kind persons here that tried to help me.

I also apologize as I have been a bit sharp with some guys here and against Experts Exchange.
I thought that in fact when you was paying the subscription fee to Exchange Experts in fact experts had the obligation to respond and be performant.
I have now understood it's simply a friendly community of developers that help each other.
So again THANKS for the time you spent on my problem and I promise to post my solution very soon so it can help you if you need something similar.
0
 
scrathcyboyCommented:
Idea is that you give "points" to best answer, or split points among best answerS, to close question.  These points mean nothing in terms of value, just "status" on expert exchange, however you want to interpret that.
0
 
cybertonicAuthor Commented:
Here you have it:

function GetTextContent(html)
{
     p1 = html.indexOf("<" + "script");
     while(p1 > 0)
     {
          p2 = html.indexOf("</" + "script>", p1+7);
          p = html.indexOf("<" + "script>", p1+7);
          if(p2 > 0)
          {
               if(p < 0 || p > p2)
               {
                    html = html.substring(0, p1) + html.substring(p2+9, html.length);
               }
               else
               {
                    html = html.substring(0, p) + html.substring(p2+9, html.length);                    
               }
          }
          else
          {
               html = html.substring(0, p1);
          }
          p1 = html.indexOf("<" + "script");
     }

     return html.replace(/<(.|\n)*?>/g, "");
}


To get the body text content (with no javascript)

html = GetTextContent(document.body.innerHTML.toLowerCase());

Remove scripts placed one after other and also possible nested scripts.
Finally removed all the other tags.
0
 
cybertonicAuthor Commented:
First time I do it. Not sure I clicked the good button to give points to each ones; sorry.
0
 
scrathcyboyCommented:
I will submit request for you to split points among all answers, glad you solved it.
0
 
scrathcyboyCommented:
Cybertronic -- here is what I told the moderators, hope this is OK with you.  I see your own solution is smarter than anything I would have done, therefore I cannot accept points on this, you should get them back, perhaps?  Congrats on some good coding.

"The asker of this question inadvertently gave me all the points.  Please undo this for him.  Since he gave such an elegant solution to his own question, you should return the points to him.  Also, since he is not getting the level of help he expected from Expert Exchange, you might open a dialog with him to refund all or part of his payments.  This is between you and him, I just want to correct this question, thanks.

http://www.experts-exchange.com/Web/Web_Languages/JavaScript/Q_21792150.html"
0
 
scrathcyboyCommented:
Note to moderators -- looking at others contributions here, suggest you split points among them, leave me out of any.  Thanks.
0
 
scrathcyboyCommented:
Cybertronic --

As a caution about complexity of tags, a lot of people are now putting /> to end a tag, sometimes before or after the script name.  e.g. for ending, I have variously seen </Script\> as well as </\script>, which mean different things.  If your routine now work right, might be one of these.  So plan in the backslash, you will see it surface a lot.  Good luck again, and Bye !
0
 
cybertonicAuthor Commented:
Good find with some DTD it's possible to have \>

Now I never seen </ begining a tag I am not sure it can be valid.

Here the revised script that takes care of this possible \> but also the case where there is space before the > like </script       >


function GetTextContent(html)
{
      p1 = html.indexOf("<" + "script");
      while(p1 >= 0)
      {
            p2 = html.indexOf("</" + "script", p1+7);
            p = html.indexOf("<" + "script", p1+7);
            if(p2 >= 0)
            {
                  p3 = p2+8;
                  
                  while(p3 < html.length && html.charAt(p3) != ">") p3++;
                  
                  if(p < 0 || p > p2)
                  {
                        html = html.substring(0, p1) + html.substring(p3+1, html.length);
                  }
                  else
                  {
                        html = html.substring(0, p) + html.substring(p3+1, html.length);                        
                  }
            }
            else
            {
                  html = html.substring(0, p1);
            }
            p1 = html.indexOf("<" + "script");
      }

      return html.replace(/<(.|\n)*?>/g, "");
}
0
 
scrathcyboyCommented:
You should become 'expert' here.  You get to ask questions for free if you answer enough questions to get 2000 points per month.  With your ability, that probably mean only one question, full points, per month.  This is easy for you, no?  I think you would be excellent addition to site, just my opinion.
0
 
rrzCommented:
>You should become 'expert' here.    
I agree.
You should understand that some of us self-proclaimed  "experts" only give help so that hopefully we will receive help when we need it.  rrz
0
 
cybertonicAuthor Commented:
Thanks for the invitation, I do not think I have more development skills than people here. And for sure no pretentions to be an expert.
Except when I am going to launch a new service generally I do no longer develop mainly because I have no time.
I suppose that like for most of you the annual membership saving cannot be a motivation when this amount represents absolutely nothing.
Now what I can do is thanks the community for the response to my question by helping another member that have a problem for which I have a solution. And make this a rule.
Thanks again for the kind words.
0
 
LakioCommented:
I got 2 ideas,

1. <a href="#" onclick="while(document.getElementsByTagName('script').length) {document.getElementsByTagName('script')[0].removeNode(true);}
alert(document.getElementsByTagName('html')[0].innerHTML);return false">Delete all scripts</a>

2. <TEXTAREA onclick="this.value=document.body.innerHTML.replace(/<\/script>/gi,'¨').replace(/<script[^¨]*¨/gi, 'noscript!')" rows=24 cols=80></TEXTAREA>
0
 
cybertonicAuthor Commented:
Greetings Lakio,

Point 1: have no interest for my needs, I cannot have the javascript deleted.

Point 2: simply works like a charm and it's exactly what I was looking for!

I tested it with complicated samples always with success.

So the function that stript javascript and tags is now:

function GetTextContent(html)
{
      html = html.replace(/<\/script>/gi,'¨').replace(/<script[^¨]*¨/gi, '');
           return html.replace(/<(.|\n)*?>/g, "");
}

Example of call:
alert(GetTextContent(document.body.innerHTML));

The interest of regex expressions is they are compilated while the javascript is interpreted so the process should be fast and that's what I need.

<removed by GranMod>

Thanks and again my congrats for the good job.

0
 
LakioCommented:
yes cybertonic it works..  unless "¨" is inside the script but this next code is "more" :P

Im useing "·" now, I have never seen it before so I guess its not often inside HTML but this code will start looking for it and if it finds a "·" it will change it to "." .

<body onload="document.write(GetTextContent(document.body.innerHTML))";>
<script type="text/javascript">
function GetTextContent(html){
return html.replace(/·/gi,'.').replace(/<\/script>/gi,'·').replace(/<script[^·]*·/gi, '').replace(/<(.|\n)*?>/g, '');
}
</script>

a better idea is to find some rare characters put them in a array and make it look for them, when unfound use that character in the regular expression.
-But I bet theres a better/easyer way.
(excuse: I have been very busy and regular expression is not old to me :P)
0
 
LakioCommented:
with PHP its so easy: http://php.net/manual/en/function.strip-tags.php

it feels silly that I dont see a good way of doing this with regular expression
0
 
cybertonicAuthor Commented:
Below the function revised with 2 improvements.

a) The use of the ascii 255, that is a character that is not used in html web pages.

b) I replace script by [s]cript otherwise if the code is inserted within an ASP page then the ASP interpreter will fire an error: "nested script..."

function GetTextContent(html)
{
      return html.replace(/\xff/gi,'.').replace(/<\/[s]cript>/gi,'\xff').replace(/<[s]cript[^\xff]*\xff/gi, '').replace(/<(.|\n)*?>/g, '');
};

P.S:

Here the problem is not PHP or ASP, I neded the code in javascript simply because it's part of a larger code that will be distributed to web pubilshers.
The javascript code need to extract text page content and send it to our servers to categorize on the fly the web page.

0
 
LakioCommented:
you can still do PHP strip_tags with Ajax but I understand,

If ascii 255 is never in a HTML you dont need .replace(/\xff/gi,'.')

http://computing-dictionary.thefreedictionary.com/hex%20chart
0
 
cybertonicAuthor Commented:
Ajax?

For 2 main reasons:

- Performance reasons.
- Cross domains not allowed.
0
 
LakioCommented:
again : "but I understand"
0

Featured Post

Vote for the Most Valuable Expert

It’s time to recognize experts that go above and beyond with helpful solutions and engagement on site. Choose from the top experts in the Hall of Fame or on the right rail of your favorite topic page. Look for the blue “Nominate” button on their profile to vote.

  • 13
  • 7
  • 7
  • +3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now