Solved

Methods of Crawling JavaScript Links

Posted on 2004-09-23
8
994 Views
Last Modified: 2013-12-16
I am now writing a PHP program to fetch the contents of webpages,  try to crawl all the links from them and store them in a MySQL database.

It was quite a success for me to crawl normal HTML links (ie. <a href = "">). However, it would be quite problematic for JavaScript links since they have lots of variations. For example, the link values might be embedded within the <select> tag and then, they would be passed into variables of a JavaScript function to generate a link eventually.

Is there any method to crawl JavaScript links using PHP or any other programs / softwares?

Thanks!
0
Comment
Question by:lcyandy
  • 2
  • 2
  • 2
8 Comments
 
LVL 9

Expert Comment

by:riyasjef
ID: 12140837
Hi
try this

<html>
<head>
<script>
function getLinks()
{
      var links=document.anchors;
      strLinks="";
      for(i=0;i<links.length;i++)
      {
            if(strLinks==""
                  strLinks=links[i].href;
            else      
                  strLinks+=","+links[i].href;
      }      
      
      document.forms[0].hdnLinks.value=strLinks;
}
</script>
</head>
<body>
<form method=post onsubmit="return getLinks()">
<a href="link1">
<a href="link2">
<a href="link3">
<input type="hidden" name=hdnLinks>
<input type="submit" value="submit">

</form>

0
 

Author Comment

by:lcyandy
ID: 12144580
I've your code but I don't quite understand what is the function of it.
Can you explain it to me?
0
 
LVL 9

Accepted Solution

by:
riyasjef earned 63 total points
ID: 12146790
Sorry there is change in the code

<html>
<head>
<script>
function getLinks()
{
     var links=document.anchors;
     alert(document.anchors.length);
     strLinks="";
     for(i=0;i<links.length;i++)
     {
          if(strLinks=="")
               strLinks=links[i].href;
          else
               strLinks+=","+links[i].href;
     }

     document.forms[0].hdnLinks.value=strLinks;
     alert(document.forms[0].hdnLinks.value);
}
</script>
</head>
<body>
<form method=post onsubmit="return getLinks()">
<a id="id1" href="link1">link1</a>
<a id="id2" href="link2">link2</a>
<a id="id3" href="link3">link3</a>
<input type="hidden" name=hdnLinks>
<input type="submit" value="submit">

</form>
</body>

"getLinks()" fn collects all the links in the document and put in a hidden box. You can access the hidden field from php to get links seperated by comma

Riyasjef


0
Microsoft Certification Exam 74-409

Veeam® is happy to provide the Microsoft community with a study guide prepared by MVP and MCT, Orin Thomas. This guide will take you through each of the exam objectives, helping you to prepare for and pass the examination.

 
LVL 36

Expert Comment

by:Zyloch
ID: 12149194
Riyasjef's is great for finding HTML links, but if you want Javascript links, there's no clear way to do it. Perhaps you can check for anything that has http:// in the beginning and assume it's a link? You can find them all with PHP preg_match_all
0
 

Author Comment

by:lcyandy
ID: 12149687
First, thanks Riyasjef for the dedicated help.

Really?! there's no absolute method to crawl javascript links??
Anyone knows how Google could do that?
0
 
LVL 36

Assisted Solution

by:Zyloch
Zyloch earned 62 total points
ID: 12151034
Google has said they can follow simplified Javascript links. I'll assume they mean following something like:

window.location.href="somewherenew.html" and window.open("somewherenew") amongst other usual ways to get stuff.

However, it can only do simplified JScript links as there is just too many. You could of course also test and find each http:// in the doument, assume it's a link since most of the time it is, use PHP @fopen to test if it exists, and if it does, add it to the link list.
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
What does GoogleTagMgr javascripts below do 5 34
JS does not refresh 6 30
Hide Table in merge 3 31
javascript - insert into js doc help 2 23
I've been trying to accomplish this for a while and it just struck me yesterday how to accomplish this task. I have done searches all over the internet looking for ways to email pages from my applications and finally I have done it!!! Every single s…
In Part 1 (http://www.experts-exchange.com/Programming/Languages/Scripting/JavaScript/A_7849-Hex-Maze.html) we covered the hexagonal maze basics -- how the cells are represented in a JavaScript array and how the maze is displayed.  In this part, we'…
The purpose of this video is to demonstrate how to add AdSense Ads to a WordPress Website, and how to set up WordPress to automatically place Ads in Sidebars. This will be demonstrated using a Windows 8 PC. Log into your AdSense account. : Cli…
The purpose of this video is to demonstrate how to set up an RSS Feed on a WordPress Website. This will be demonstrated using a Windows 8 PC. Feedburner will be used for this demonstration. Go to your WordPress login page. This will look like the…

856 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question