asked on

Web Crawler dealing with Javascript

My issue comes in 2 parts. The most immediate need I have is a targeted crawl of one particular site (as part of a demo application). So, I don't envision it to be too difficult. My second, less pressing concern, is an extensible solution that can be applied to most websites.

First, let me say that I've looked through a lot of web crawlers and a lot of them handle dynamic content poorly or not at all. I realize it's a tough problem. So you don't need to solve this harder part of the issue to get full points. Just point me in the right direction (though if you got a solution, I'd be grateful).

I am trying to gather data from the following site:

http://icasualties.org/oif/IED.aspx

Notice the bar graph in the center of the page. When you click on one of the bars in the graph, a javascript postback is called on the page and a table of names, dates, etc. is appended to the end of the page and looks like this:

27-Oct-2003 Sergeant Aubrey D. Bell Baghdad Hostile - hostile fire - IED attack
28-Oct-2003 Specialist Isaac Campoy Balad (near) - Salah ad Din Hostile - hostile fire - IED attack
06-Oct-2003 Specialist Spencer Timothy Karol Al Haswah - Babil Hostile - hostile fire - IED attack

I need a script that will automatically go through each javascript call and parse out the associated data tables. This data should then be saved into one big html file. I am thinking that Java is best suited for this task though I don't really care what language is used. I am not too familiar with Java or web crawling and my time is limited. I have a lot of other stuff on my plate. I am hoping someone familiar with the libraries and functions needed for this type of thing can help me out.

I will award full points to anyone that can provide me a script targeted at the above site and general advice about a more generic approach. I need the targeted crawler by the end of next week for a demo. Thank you.

mrichmon

First question. Do you own that site? If not do you have permission to grab that information? If not you are asking us to help you with something illegal.

kh7x

ASKER

I do not own the site but I have full permission to do with it as I will. I work for a government contractor. The site is for demo purposes.

kh7x

ASKER

Or my specifically, we are allowed to use the information on the site for our demo purposes and none of it is classified.

ClickCentric

Being a government contractor alone doesn't give you permission to scrape the site. And it doesn't seem like a government owned, operated or approved site to begin with. And, if you had permission from the site owner, you'd know how the output was generated and how to get to it.

kh7x

ASKER

I didn't realize this was going to be that much of a problem. And no, it is not a government operated site. I doubt I would receive permission to scrape a fed site.

I know how the output was generated. If I knew "how to get to it", I wouldn't be here. I'm involved mostly with algorithm design in machine learning. This isn't my strong point.

If it will allay your concerns, all I really need an answer to, is how to get data from dynamically generated content on the same page. The Javascript calls a form submission. A new page isn't loaded. The data is appended to the same page. I can't get data from a new URL. That's my problem. How do I deal with forms? Don't even answer in context of that particular page if you're so worried about it.

kh7x

ASKER

Basically, is there a way to get dynamically generated data off the same URL. I'm sure there is because I'm passing some parameters (2 of them in fact). I just don't know how to use that knowledge. It's probably something incredibly simple (like adding it to the end of the URL?).

kh7x

ASKER

Nevermind. I solved the issue. It was as ridiculously simple as I thought. I guess I was too overstressed to think it through clearly. Consider the question closed. I don't suppose there's a way to delete it? Sorry if I phrased it badly. Frankly, I could have painstakingly clicked on each link and got the data myself, manually, but something about that seemed repulsive to me.

scrathcyboy

"I've looked through a lot of web crawlers and a lot of them handle dynamic content poorly or not at all"

Web crawlers are SPECIFICALLY DESIGNED to AVOID dynamic content. They are NOT interested in it, just simply NOT INTERESTED. When you grasp that fact, you are ready to optimize for crawlers.

Moreover, the job of a tractor is to tractor CONTENT -- specifically, words that can be indexed. It sees a web page as a set of text words stripped of essentially all other stuff except OBVIOUS LINKS -- and it loves links that send it here and there, and BACK again -- never forget the BACK again. THat is crucial to indexing. Take your website, and strip everything out of it, but the text and links NOT embedded in javascript. Strip out all flash, all pictures, all DHTML and all other fancy stuff you want the USER to experience -- now you are seeing the page the way a tractor sees it. Comprend?

Shalom Carmel

kh7x,
please post back at least the outline of the method you used to solve the problem.

scrathcyboy

It is acceptable to split points among the people who got closest to the answer. See the split Points link right where you reply... Thanks

kh7x

ASKER

Well, the answer is simply a workaround for that particular page. The data exists in 2 places. I simply parsed out what I needed from pages that actually took a parameter: http://icasualties.org/oif/prdDetails.aspx?hndRef=2-2004

I also found a great resource for a more generic approach. Mozilla has something called "Rhino" which is a Javascript interpretter in Java. That shows lots of promise. Beyond that, people just need to follow Ajax when designing dynamic content and it'll be much easier on the next generation of crawlers. And please, no one tell me what crawlers can't do. That's precisely what I'm trying to change.

ASKER CERTIFIED SOLUTION

PashaMod

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial