Avatar of CarlosScheidecker
CarlosScheidecker
 asked on

Javascript pagination with HTTPUNIT

Hello,

I am trying to crawl a paginated table where the pagination links are Javascript links.

If I use HTTPUnit I can click on javascript links by doing the following code.

This page has a table with pagination through Javascript calls. It works great with HTTPUnit, however there are other links which are not JS and I do not want to click them.

HttpUnitOptions.setScriptingEnabled(true);
HttpUnitOptions.setExceptionsThrownOnScriptError(false);
HttpUnitOptions.setJavaScriptOptimizationLevel(9);
WebRequest req = new GetMethodWebRequest("http://someurl.com.pt/pagina.jsp?OAFunc=PON_ABSTRACT_PAGE");
resp = wc.getResponse(req);
                  
WebLink[] links = resp.getLinks();
                  
for (WebLink link: links) {
      System.out.println(link.getURLString()+" "+link.getText());
      respAux = link.click();
}

Here are my questions:

1) How do you determine with the Link object that that link is a Javascript link, not an ordinary link. Those links have # on the href but they have an onclick event. Hence, how do I do that on the object?

2) Since each link returns the content of a frame, when clicking another link it gets new content and I need to craw it. How can I guarantee that the same content is not being crawled again? I was think about adding the content to a data structure so that the content is not crawled again. So I was thinking about doing a recursive call passing the datastructure so that the same link is not crawled twice. Any ideas?

Java

Avatar of undefined
Last Comment
CarlosScheidecker

8/22/2022 - Mon
for_yan




By JavaScript links you means this kind of links:
<a href="#" onclick="myJsFunc();">Link</a>
<a href="javascript:void(0)" onclick="myJsFunc();">Link</a>

Open in new window


Would not it be reflected in  link.getURLString() ?

I guess they should start with "#" or "javascript":

if (link.getURLString().startsWith("#")  ||  link.getURLString().startsWith("javascript"))
respAux = link.click();

You can probably see it in your printout



CarlosScheidecker

ASKER
The problem with the Javascript links is that getURLString() will always produce the same value. I need to inspect the object to see where the javascript fiunction information goes. getURLString will always print http://someurl.com.pt/pagina.jsp?OAFunc=PON_ABSTRACT_PAGE
for_yan

well, pprint out link.getParameterNames() and link.geParameterVlaues()
for all kinds of links on your page and try to see the pattern with which you can distinguish

Also try method link.asText()
It is deprtecated but may give useful information

after analyzing this stuff perhpas you'll see the way to distingish
All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck
CarlosScheidecker

ASKER
Params will only output the params on the ? part. That prints out PON_ABSTRACT_PAGE as the value for OAFunc.
for_yan

But I guess resp.getText() returns the whole HTML page as it is.
So you can then parse  it looking for  links and identify JavaScript links in this way.
I'm not sure how to accomplish link() in this case, though I believe there should be a way.
CarlosScheidecker

ASKER
If I do thiis: WebLink[] links = resp.getLinks();

It will return all the Links are objects. But it will not tell me which ones are JS links or not.

If I do a click() call on each link it will click it and perform the action wheather there is a onclick method or not as per source:

    public WebResponse click() throws IOException, SAXException {
      if (handleEvent("onclick")) {
          ((HTMLElementImpl) getNode()).doClickAction();
      }
      return getCurrentFrameContents();
    }

So, not onlick event gets me just the current frame content.
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
CarlosScheidecker

ASKER
What I actually need is to return if the link has an 'onClick', 'onMouseDown' and 'onMouseUp' event.

for_yan

I'm wondring can we have guarantee if in the
WebLink[] links = resp.getLinks();
array the links are ordered sequentially as they are on the page?

If so, then we can parse resp.getText() and in this way determione
which do have onclick or mouseover and which do not have
for_yan

But the length of the links[] array corresponds to the total number of links - inlcuding javascript links- ?
I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck
ASKER CERTIFIED SOLUTION
CarlosScheidecker

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question
for_yan

But when you determine which element has onClick() can you then use onClick() method?

When you want keep them and not to check  again - do you mean within one run or you want it to be persisted
for the next run. Or you run this crawl continuously all the time?
 


CarlosScheidecker

ASKER
I need to run it continuously. When you call the click() method for the link it will click it if it is a normal link or it will execute the onclick as per the code:

public WebResponse click() throws IOException, SAXException {
      if (handleEvent("onclick")) {
          ((HTMLElementImpl) getNode()).doClickAction();
      }
      return getCurrentFrameContents();
    }

I think I know how to make sure I have not visited the page twice. I will call getText() on the response and hash it putting it on a map. If then I visit them all, then I will return from the recursive call.
for_yan

Yes, that makes sense, though you'll need to refresh it at some point or you may run out of memory
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
CarlosScheidecker

ASKER
I will set up a limit for how deep the crawler can go.

I have engineered a crawler for normal links, it is multi-threaded and has a limit as well. Works great.
for_yan

Good!
CarlosScheidecker

ASKER
I was able to find the solution myself by writing the piece of code I have included here.
Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy