Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Web Crawler dealing with Javascript

Posted on 2006-05-26
14
Medium Priority
?
3,492 Views
Last Modified: 2010-07-29
My issue comes in 2 parts. The most immediate need I have is a targeted crawl of one particular site (as part of a demo application). So, I don't envision it to be too difficult. My second, less pressing concern, is an extensible solution that can be applied to most websites.

First, let me say that I've looked through a lot of web crawlers and a lot of them handle dynamic content poorly or not at all. I realize it's a tough problem. So you don't need to solve this harder part of the issue to get full points. Just point me in the right direction (though if you got a solution, I'd be grateful).

I am trying to gather data from the following site:

http://icasualties.org/oif/IED.aspx

Notice the bar graph in the center of the page. When you click on one of the bars in the graph, a javascript postback is called on the page and a table of names, dates, etc. is appended to the end of the page and looks like this:

27-Oct-2003 Sergeant Aubrey D. Bell Baghdad Hostile - hostile fire - IED attack
28-Oct-2003 Specialist Isaac Campoy Balad (near) - Salah ad Din Hostile - hostile fire - IED attack
06-Oct-2003 Specialist Spencer Timothy Karol Al Haswah - Babil Hostile - hostile fire - IED attack

I need a script that will automatically go through each javascript call and parse out the associated data tables. This data should then be saved into one big html file. I am thinking that Java is best suited for this task though I don't really care what language is used. I am not too familiar with Java or web crawling and my time is limited. I have a lot of other stuff on my plate. I am hoping someone familiar with the libraries and functions needed for this type of thing can help me out.

I will award full points to anyone that can provide me a script targeted at the above site and general advice about a more generic approach. I need the targeted crawler by the end of next week for a demo. Thank you.
0
Comment
Question by:kh7x
12 Comments
 
LVL 35

Expert Comment

by:mrichmon
ID: 16773320
First question.  Do you own that site?  If not do you have permission to grab that information?  If not you are asking us to help you with something illegal.
0
 

Author Comment

by:kh7x
ID: 16773329
I do not own the site but I have full permission to do with it as I will. I work for a government contractor. The site is for demo purposes.
0
 

Author Comment

by:kh7x
ID: 16773345
Or my specifically, we are allowed to use the information on the site for our demo purposes and none of it is classified.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 10

Expert Comment

by:ClickCentric
ID: 16774385
Being a government contractor alone doesn't give you permission to scrape the site.  And it doesn't seem like a government owned, operated or approved site to begin with.  And, if you had permission from the site owner, you'd know how the output was generated and how to get to it.  
0
 

Author Comment

by:kh7x
ID: 16775035
I didn't realize this was going to be that much of a problem. And no, it is not a government operated site. I doubt I would receive permission to scrape a fed site.

I know how the output was generated. If I knew "how to get to it", I wouldn't be here. I'm involved mostly with algorithm design in machine learning. This isn't my strong point.

If it will allay your concerns, all I really need an answer to, is how to get data from dynamically generated content on the same page. The Javascript calls a form submission. A new page isn't loaded. The data is appended to the same page. I can't get data from a new URL. That's my problem. How do I deal with forms? Don't even answer in context of that particular page if you're so worried about it.
0
 

Author Comment

by:kh7x
ID: 16775042
Basically, is there a way to get dynamically generated data off the same URL. I'm sure there is because I'm passing some parameters (2 of them in fact). I just don't know how to use that knowledge. It's probably something incredibly simple (like adding it to the end of the URL?).
0
 

Author Comment

by:kh7x
ID: 16775120
Nevermind. I solved the issue. It was as ridiculously simple as I thought. I guess I was too overstressed to think it through clearly. Consider the question closed. I don't suppose there's a way to delete it? Sorry if I phrased it badly. Frankly, I could have painstakingly clicked on each link and got the data myself, manually, but something about that seemed repulsive to me.
0
 
LVL 44

Expert Comment

by:scrathcyboy
ID: 16778747
"I've looked through a lot of web crawlers and a lot of them handle dynamic content poorly or not at all"

Web crawlers are SPECIFICALLY DESIGNED to AVOID dynamic content.  They are NOT interested in it, just simply NOT INTERESTED.  When you grasp that fact, you are ready to optimize for crawlers.

Moreover, the job of a tractor is to tractor CONTENT -- specifically, words that can be indexed.  It sees a web page as a set of text words stripped of essentially all other stuff except OBVIOUS LINKS -- and it loves links that send it here and there, and BACK again -- never forget the BACK again.  THat is crucial to indexing.  Take your website, and strip everything out of it, but the text and links NOT embedded in javascript.  Strip out all flash, all pictures, all DHTML and all other fancy stuff you want the USER to experience -- now you are seeing the page the way a tractor sees it.  Comprend?
0
 
LVL 33

Expert Comment

by:shalomc
ID: 16782170
kh7x,
please post back at least the outline of the method you used to solve the problem.

0
 
LVL 44

Expert Comment

by:scrathcyboy
ID: 16782325
It is acceptable to split points among the people who got closest to the answer.  See the split Points link right where you  reply...  Thanks
0
 

Author Comment

by:kh7x
ID: 16792861
Well, the answer is simply a workaround for that particular page. The data exists in 2 places. I simply parsed out what I needed from pages that actually took a parameter: http://icasualties.org/oif/prdDetails.aspx?hndRef=2-2004

I also found a great resource for a more generic approach. Mozilla has something called "Rhino" which is a Javascript interpretter in Java. That shows lots of promise. Beyond that, people just need to follow Ajax when designing dynamic content and it'll be much easier on the next generation of crawlers. And please, no one tell me what crawlers can't do. That's precisely what I'm trying to change.
0
 

Accepted Solution

by:
PashaMod earned 0 total points
ID: 16823902
Closed, 500 points refunded.
PashaMod
Community Support Moderator
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Dramatic changes are revolutionizing how we build and use technology. Every company is automating, digitizing, and modernizing operations. We need a better, more connected way to work together as teams so we can harness the insights from our system…
When the s#!t hits the fan, you don’t have time to look up who’s on call, draft emails, call collaborators, or send text messages. An instant chat window is definitely the way to go, especially one like HipChat. HipChat is a true business app. An…
The viewer will learn how to dynamically set the form action using jQuery.
Learn how to create flexible layouts using relative units in CSS.  New relative units added in CSS3 include vw(viewports width), vh(viewports height), vmin(minimum of viewports height and width), and vmax (maximum of viewports height and width).
Suggested Courses
Course of the Month20 days, 17 hours left to enroll

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question