Link to home
Create AccountLog in
Avatar of CarlosScheidecker
CarlosScheidecker

asked on

Javascript pagination with HTTPUNIT

Hello,

I am trying to crawl a paginated table where the pagination links are Javascript links.

If I use HTTPUnit I can click on javascript links by doing the following code.

This page has a table with pagination through Javascript calls. It works great with HTTPUnit, however there are other links which are not JS and I do not want to click them.

HttpUnitOptions.setScriptingEnabled(true);
HttpUnitOptions.setExceptionsThrownOnScriptError(false);
HttpUnitOptions.setJavaScriptOptimizationLevel(9);
WebRequest req = new GetMethodWebRequest("http://someurl.com.pt/pagina.jsp?OAFunc=PON_ABSTRACT_PAGE");
resp = wc.getResponse(req);
                  
WebLink[] links = resp.getLinks();
                  
for (WebLink link: links) {
      System.out.println(link.getURLString()+" "+link.getText());
      respAux = link.click();
}

Here are my questions:

1) How do you determine with the Link object that that link is a Javascript link, not an ordinary link. Those links have # on the href but they have an onclick event. Hence, how do I do that on the object?

2) Since each link returns the content of a frame, when clicking another link it gets new content and I need to craw it. How can I guarantee that the same content is not being crawled again? I was think about adding the content to a data structure so that the content is not crawled again. So I was thinking about doing a recursive call passing the datastructure so that the same link is not crawled twice. Any ideas?

Avatar of for_yan
for_yan
Flag of United States of America image




By JavaScript links you means this kind of links:
<a href="#" onclick="myJsFunc();">Link</a>
<a href="javascript:void(0)" onclick="myJsFunc();">Link</a>

Open in new window


Would not it be reflected in  link.getURLString() ?

I guess they should start with "#" or "javascript":

if (link.getURLString().startsWith("#")  ||  link.getURLString().startsWith("javascript"))
respAux = link.click();

You can probably see it in your printout



Avatar of CarlosScheidecker
CarlosScheidecker

ASKER

The problem with the Javascript links is that getURLString() will always produce the same value. I need to inspect the object to see where the javascript fiunction information goes. getURLString will always print http://someurl.com.pt/pagina.jsp?OAFunc=PON_ABSTRACT_PAGE
well, pprint out link.getParameterNames() and link.geParameterVlaues()
for all kinds of links on your page and try to see the pattern with which you can distinguish

Also try method link.asText()
It is deprtecated but may give useful information

after analyzing this stuff perhpas you'll see the way to distingish
Params will only output the params on the ? part. That prints out PON_ABSTRACT_PAGE as the value for OAFunc.
But I guess resp.getText() returns the whole HTML page as it is.
So you can then parse  it looking for  links and identify JavaScript links in this way.
I'm not sure how to accomplish link() in this case, though I believe there should be a way.
If I do thiis: WebLink[] links = resp.getLinks();

It will return all the Links are objects. But it will not tell me which ones are JS links or not.

If I do a click() call on each link it will click it and perform the action wheather there is a onclick method or not as per source:

    public WebResponse click() throws IOException, SAXException {
      if (handleEvent("onclick")) {
          ((HTMLElementImpl) getNode()).doClickAction();
      }
      return getCurrentFrameContents();
    }

So, not onlick event gets me just the current frame content.
What I actually need is to return if the link has an 'onClick', 'onMouseDown' and 'onMouseUp' event.

I'm wondring can we have guarantee if in the
WebLink[] links = resp.getLinks();
array the links are ordered sequentially as they are on the page?

If so, then we can parse resp.getText() and in this way determione
which do have onclick or mouseover and which do not have
But the length of the links[] array corresponds to the total number of links - inlcuding javascript links- ?
ASKER CERTIFIED SOLUTION
Avatar of CarlosScheidecker
CarlosScheidecker

Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
But when you determine which element has onClick() can you then use onClick() method?

When you want keep them and not to check  again - do you mean within one run or you want it to be persisted
for the next run. Or you run this crawl continuously all the time?
 


I need to run it continuously. When you call the click() method for the link it will click it if it is a normal link or it will execute the onclick as per the code:

public WebResponse click() throws IOException, SAXException {
      if (handleEvent("onclick")) {
          ((HTMLElementImpl) getNode()).doClickAction();
      }
      return getCurrentFrameContents();
    }

I think I know how to make sure I have not visited the page twice. I will call getText() on the response and hash it putting it on a map. If then I visit them all, then I will return from the recursive call.
Yes, that makes sense, though you'll need to refresh it at some point or you may run out of memory
I will set up a limit for how deep the crawler can go.

I have engineered a crawler for normal links, it is multi-threaded and has a limit as well. Works great.
Good!
I was able to find the solution myself by writing the piece of code I have included here.