Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 402
  • Last Modified:

Urgent !! Client side redirection.

Hi,

I want to fetch the html code for a url when there is a client side redirection. For example, http://www.uhcl.edu, there is a client side redirection using the meta tag like

<META HTTP-EQUIV="REFRESH" CONTENT="0;URL=some relative path here">  to

http://prtl.uhcl.edu/portal/page?_pageid=328,1,328_216933&_dad=portal&_schema=PORTALP

I tried some projects like httpunit and jspider in sourceforge.net also, but of no use.

Can some one give me the code to get the html data when there is a client side redirection.
0
sumantedla
Asked:
sumantedla
  • 23
  • 17
  • 9
1 Solution
 
aozarovCommented:
did you try jakarta http client -> http://jakarta.apache.org/commons/httpclient/
0
 
aozarovCommented:
I think in either you should specify if you want your client to follow redirection. (I guess in your case it is not)
0
 
CEHJCommented:
What about just parsing that url and opening it?
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
aozarovCommented:
to get the value in httpunit -> response.getMetaTagContent("http-equiv","REFRESH")
0
 
CEHJCommented:
You may well be able to do that with java.net.URLConnection in fact. Can you give us an example?
0
 
CEHJCommented:
Sorry - you did ;-)
0
 
CEHJCommented:
e.g. this gets it for me:


                  Pattern p = Pattern.compile("<META http-equiv.+URL\\s*=\\s*(.+?)\"*>", Pattern.CASE_INSENSITIVE);
                  BufferedReader in = new BufferedReader(new InputStreamReader(new URL("http://www.uhcl.edu/").openStream()));
                  String buffer = null;
                  while((buffer = in.readLine()) != null) {
                        if (buffer.toLowerCase().indexOf("http-equiv") > -1) {
                              Matcher m = p.matcher(buffer);
                              if (m.find()) {
                                    String redirectUrl = m.group(1);
                                    System.out.println(redirectUrl);
                              }
                        }
                  }
                  in.close();
0
 
CEHJCommented:
Of course you can break out of that loop if you want to when it's found ;-)
0
 
aozarovCommented:
As I said before,
if you are already using httpunit than that can be done for you by simply calling: response.getMetaTagContent("http-equiv","REFRESH")
0
 
CEHJCommented:
>> As I said before ...

Yes - there's no need to repeat it ;-) If, e.g.  sumantedla is only using HttpUnit to do this, she/he might like to do without specialised libraries
0
 
sumantedlaAuthor Commented:
Its working in the way CEHJ specified. But I think the regular expression can be improved.

And I am not comfortable in using the HttpUnit. Can you(aozarov) explain it in detail.
0
 
CEHJCommented:
>>But I think the regular expression can be improved

Possibly - i didn't exactly labour over it in great detail, but it works for me ;-)
0
 
aozarovCommented:
If using httpunit:
    WebConversation wc = new WebConversation();
    WebResponse   resp = wc.getResponse( "http://www.uhcl.edu/" );
    String[] metaContent = resp.getMetaTagContent("http-equiv","REFRESH") ;
    for (int i = 0; i < metaContent.length; i++)
            System.out.println(metaContent[i]);
0
 
sumantedlaAuthor Commented:

I am trying to extract the meta tag atttributes data (when there is a client side redirection) using an EditorKit etc. It is working when I try for HTML.Tag.A but not for HTML.Tag.META. Is there a bug with getAttributes() method.

urlContent  is

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en,us">
<HEAD>  
<META http-equiv="REFRESH" content="0;URL=/pls/portal/portalp.home"></HEAD><body></BODY></HTML>

---------------------------------------
String redirectURL = null;
try
{
      Reader reader = new StringReader(urlContent );
      // here urlcontent contains the html code of any webpage
      EditorKit kit = new HTMLEditorKit();
      HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();
      doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
      kit.read(reader, doc, 0);
      HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.META);
      while (it.isValid())
      {      
            AttributeSet attrs =  it.getAttributes();
            String httpEquiv = (String) attrs.getAttribute(HTML.Attribute.HTTPEQUIV);
            String content = (String) attrs.getAttribute(HTML.Attribute.CONTENT);
            if ("REFRESH".equalsIgnoreCase(httpEquiv) && content != null)
            {      
                  String[] strings = content.split(";");
                  String timeAttr = strings[0].trim();
                  String urlAttr = strings[1].replaceAll(" ", "");
                  System.out.println("time => " + timeAttr);
                  System.out.println("urlAttr => " + urlAttr);
                if ("0".equals(timeAttr) && urlAttr.toLowerCase().indexOf("url=")== 0)
                {      redirectURL = urlAttr.substring(4);
                  break;
                }
            }      
      it.next();
      }
}catch (Exception e)
      {
            e.printStackTrace();
      }

0
 
aozarovCommented:
import com.meterware.httpunit.*;
public class T
{
public static void main(String st[]) throws Exception
{
WebConversation wc = new WebConversation();
    WebResponse   resp = wc.getResponse( "http://www.uhcl.edu/" );
    String[] metaContent = resp.getMetaTagContent("http-equiv","REFRESH") ;
    for (int i = 0; i < metaContent.length; i++)
            System.out.println(metaContent[i]);
}
}

C:\lib\httpunit-1.6>java -classpath lib\httpunit.jar;jars\js.jar;jars\nekohtml.jar;jars\xercesImpl.jar;. T
0; URL=/pls/portal/portalp.home
0
 
aozarovCommented:
sumantedla,
Regarding EditorKit someone asked that question before (not long time ago, was it you?) and I think
the question was not answered.
From above you can see how easy it is to do it using httpunit ;-)
0
 
CEHJCommented:
I've only ever had disappointing results from EditorKit html
0
 
aozarovCommented:
Actually there is even more elegant way:
import com.meterware.httpunit.*;
public class T
{
public static void main(String st[]) throws Exception
{
WebConversation wc = new WebConversation();
    WebResponse   resp = wc.getResponse( "http://www.uhcl.edu/" );
    String[] metaContent = resp.getMetaTagContent("http-equiv","REFRESH") ;
    WebRequest refreshRequest = resp.getRefreshRequest(); // you can use this to send another request (based on the refresh value)
    System.out.println(refreshRequest.getURL());
}
}

C:\lib\httpunit-1.6>java -classpath lib\httpunit.jar;jars\js.jar;jars\nekohtml.jar;jars\xercesImpl.jar;. T
http://www.uhcl.edu/pls/portal/portalp.home
0
 
CEHJCommented:
You'd certainly expect elegance if it takes four separate libraries to do it ;-)
0
 
sumantedlaAuthor Commented:
Can we extract the links(urls) from a webpage using the Httpunit software??
0
 
sumantedlaAuthor Commented:
I mean can it detect the redirection both on server side and on client side and still able to get the links.
0
 
aozarovCommented:
it is basically httpunit.jar (the rest are used by him)
no need for js.jar (only if need to parse javascript)
xerces.jar is probably already part of its project (very common)
And does it really matter?
Should this make you code everything from scratch ;-)
0
 
aozarovCommented:
>> Can we extract the links(urls) from a webpage using the Httpunit software??
Sure, it is a very powerfull library (much stronger the HTMLKit)
you can extract everything including JavaScript (and even invoke it).
0
 
aozarovCommented:
>> I mean can it detect the redirection both on server side and on client side and still able to get the links.
What do you mean in the "server side" ?

see API: http://www.httpunit.org/doc/cookbook.html
cookbook: http://www.httpunit.org/doc/cookbook.html
0
 
sumantedlaAuthor Commented:
Yeah, it is very good. I tried to extract the links in a webpage and it worked. I am getting the relative links. Is there any direct support in httpunit to get the absolute links or we have to do it ourself by constructing an URL object??/
0
 
aozarovCommented:
0
 
sumantedlaAuthor Commented:
After making some dirty additions to your code, I wrote the following.I am getting the relative links.What is the way to get absolute links??

import com.meterware.httpunit.*;
public class Test
{
public static void main(String st[]) throws Exception
{
WebConversation wc = new WebConversation();
WebConversation wc1 = new WebConversation();
WebResponse   resp = wc.getResponse( "http://www.uhcl.edu" );
WebRequest refreshRequest = resp.getRefreshRequest();
WebResponse   resp1 = wc1.getResponse(refreshRequest.getURL().toString());
WebLink w[] = resp1.getLinks();
for(int i = 0; i < w.length;i++)
System.out.println(w[i].getURLString());
}
}
0
 
aozarovCommented:
change
System.out.println(w[i].getURLString());
to
System.out.println(w[i].getRequest().getURL());
0
 
sumantedlaAuthor Commented:
Hi,

I am using httpunit. When I execute the following code I am getting exception. How can I avoid the exception.
I executed it as
>java Test http://abc.com
----------------------------
com.meterware.httpunit.ScriptException: Script 'function nsearch() {
      //alert('inside search function')
      if (!document.seekdark.rq['0'].checked || document.seekdark.rq['1'].checked) {
      //      alert('keyword')
            document.seekdark.action = "http://abc.go.com/keyword/lookup";
      } else {
      //      alert('search')
            document.seekdark.action = "http://search.abc.go.com/abctv/query.html";
      }
}
----------------------------------------------------------------
import com.meterware.httpunit.*;
public class Test
{
public static void main(String args[]) throws Exception
{
      try
      {
if(args.length != 1)
{   System.out.println(" Usage : java Test URL");
      System.exit(0);
}

WebConversation wc = new WebConversation();
WebResponse   resp = wc.getResponse(args[0]);
WebRequest refreshRequest;
if((refreshRequest = resp.getRefreshRequest()) == null)
{
      WebLink w[] = resp.getLinks();
      for(int i = 0; i < w.length;i++)
      {
            if(w[i].getRequest().getURL().toString().startsWith("http"))
            System.out.println(w[i].getRequest().getURL());
      }
}
else
{   WebConversation wc1 = new WebConversation();
   // WebRequest refreshRequest = resp.getRefreshRequest(); // you can use this to send another request (based on the refresh value)

//    System.out.println(refreshRequest.getURL());
      WebResponse   resp1 = wc1.getResponse(refreshRequest.getURL().toString());
      
      WebLink w[] = resp1.getLinks();
      for(int i = 0; i < w.length;i++)
      {
            if(w[i].getRequest().getURL().toString().startsWith("http"))
            System.out.println(w[i].getRequest().getURL());
      }
      System.out.println("this" + w[1].getRequest().getURL());
}
}catch(Exception e)
      {
      System.out.println(e);
      e.printStackTrace();
      }
}
}
0
 
aozarovCommented:
That page (actually one of its included js files) contains parsing error (reported by mozilla rhino).
You can disable javascript parsing/evaluation by settings this flag:
HttpUnitOptions.setScriptingEnabled(false);

e.g:
public static void main(String args[]) throws Exception
{
HttpUnitOptions.setScriptingEnabled(false);
...
0
 
sumantedlaAuthor Commented:
When I removed the js.jar file from the classpath, it was working. Which is the better way to do it??
Should I set the option as above or remove the js.jar file from classpath?

Is there any way to improve the performance, because I have to retrieve links from atleast 5000 webpages. It is becoming very slow.

Is there any solution for this??
0
 
aozarovCommented:
>> When I removed the js.jar file from the classpath, it was working. Which is the better way to do it??
Right, the one I just suggested should be better (by removing the jar you get an ugly warning message and also
if need to enable later on)

>> 5000 webpages. It is becoming very slow.
you can disable more features that you don't need. (see HttpOptions)
Try to increase your memory settings (make sure that you process while you go instead of piling up).
Use threads/thread pool to collect concurrently
0
 
sumantedlaAuthor Commented:
when I try my above program on say url like

 http://store.sun.com

the response from that site is slow.

I used

System.setProperty("sun.net.client.defaultConnectTimeout", "9000");
System.setProperty("sun.net.client.defaultReadTimeout", "9000");

but of no use.The program is blocking for ever. I searched the httpunitoptions for any properties to set this, but of no use.

Is there any way we can make the program return when the response is slow??

ps: I think I am asking too many questions which definitely worth more that 500 points. Should I ask in a new post.
0
 
aozarovCommented:
>>but of no use.The program is blocking for ever.
HttpUnit uses URLConnection internaly and therefore those settings should effect it.
 
I found that the problem is not related to the sockets but rather a bug in httpunit which didn't expect invalid entry.
I fixed that bug localy and reported it to httpunit team (see: https://sourceforge.net/tracker/index.php?func=detail&aid=1197526&group_id=6550&atid=106550 )
You can fix it by modifying the file HttpUnitUtils and replace "continue" with "break" in the replaceEntities method (line 257)
The file should be located at httpunit-1.6\src\com\meterware\httpunit
Then use the ant build.xml to rebuild the jar (using the default task. Just call ant with no arguments)

>> ps: I think I am asking too many questions which definitely worth more that 500 points. Should I ask in a new post.
I have no problem with that but I can't speak for others.
0
 
sumantedlaAuthor Commented:


I modified the file.

I dont have the ant tool installed in my machine(I dont have priveleges). Is there any other way to do that.

The directory path is like ..

E:\httpunit-1.6\src\com\meterware\httpunit

Even I am unable to compile the file. What are the paths need to be set??

Thanks.
0
 
aozarovCommented:
>> I dont have the ant tool installed in my machine(I dont have priveleges).
Ant doesn't require special installation just download it from http://www.axint.net/apache/ant/binaries/apache-ant-1.6.3-bin.zip
unzip it anywhere.
Open a command shell
Goto E:\httpunit-1.6\
set ant_home=the_path_to_the_main_ant_folder (e.g. ANT_HOME=E:\apache-ant-1.6.3)
set PATH=%PATH%;%ant_home%\bin
ant (build.xml is already in E:\httpunit-1.6\ so just calling ant should be sufficient)

Can't you do those steps?
0
 
sumantedlaAuthor Commented:
I did that.

I am using java1.5 and when I called "ant" in command prompt,

some source files were using "enum" and the compiler is complaining that it is a keyword.

Is it possible to do it like javac -source1.4 Myprog.java

Thanks.
0
 
sumantedlaAuthor Commented:
I think we need to change the source attribute of javac in build.xml

Is it true.
0
 
aozarovCommented:
Yes it should work for you.
in the build.xml add to all the <javac  ... elements this attribute source="1.4"
then run
ant clean
and
ant
Ignore the warnings.
0
 
sumantedlaAuthor Commented:
I did the same.
I got 12 warnings. And after that

jar:

BUILD FAILED
E:\httpunit-1.6\build.xml:186: E:\httpunit-1.6\META-INF not found.


What to do??

Sorry for asking too many questions.
0
 
aozarovCommented:
This is a different issue.
Just create the directory E:\httpunit-1.6\META-INF  (add META-INF folder to eE:\httpunit-1.6)
0
 
sumantedlaAuthor Commented:
I created a META-INF directory in  E:\httpunit-1.6\

And fortunately the build was successful.
0
 
aozarovCommented:
Good :-)
0
 
sumantedlaAuthor Commented:
Thank God!!!!!!!!


It was working. Now I just want to close this thread. 500 points are not really sufficient.

How can I ask you questions on httpunit, if i have more??

I think I can ask here in another new post, right??

Thanks for your help.
0
 
aozarovCommented:
I think I can ask here in another new post, right??
Right but don't bother :-)

I am glag that it is working for you :-)
0
 
sumantedlaAuthor Commented:
Last question.

System.setProperty("sun.net.client.defaultConnectTimeout", "9000");
System.setProperty("sun.net.client.defaultReadTimeout", "9000");


Will these options work for timeout??
0
 
aozarovCommented:
Yes they should as httpunit is uses URLConnection. (but this is relevant only for network timeout and net parsing and document handling).
0
 
sumantedlaAuthor Commented:
thanks.
0
 
aozarovCommented:
NP. :-)
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 23
  • 17
  • 9
Tackle projects and never again get stuck behind a technical roadblock.
Join Now