• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 223
  • Last Modified:

Problem with my code

Hi,

the following is my code to extract links from web page.

It is not working when the given page is redirected to some other url.

For example the program raises an exception when the input url is http://rediff.com.

I am working on jdk 1.4

Why it is not working??? What other changes will make the program efficient.

Thanks.

import java.io.*;
import java.net.*;
import java.util.*;
import javax.swing.text.html.*;
import javax.swing.text.*;


class Out  // gets the html links
{
      
      public  String[] getLinks(String uriStr) { // uriStr is an url

        List result = new ArrayList();
             try
                  {
            URL locator = new URL(uriStr);
                  URLConnection connection = (HttpURLConnection)locator.openConnection();
                  connection.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Q312461)");
                  connection.connect();    
                  // if the content type is not html return null
                  // purpose of doing this is, when some not html content like a pdf file is requested
                  // program blocks for ever
                  if(!("text/html".equals((connection.getContentType()))))
                        return null;

                  Reader rd = new InputStreamReader(connection.getInputStream());
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
                  doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
            kit.read(rd, doc, 0);
                  // Find all the A elements in the HTML document
            HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
            while (it.isValid())
                        {
                              SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();
                              String link = (String)s.getAttribute(HTML.Attribute.HREF);
                              if (link != null)
                              {
                                    URL temp = new URL(locator,link);
                                    result.add(temp.toString());
                              }// END OF IF
                              it.next();
                        }// END OF WHILE
                  // ALL THE CATCH STATEMENTS ARE FOR DEBUGGING
        } catch (FileNotFoundException e) {
                  System.out.println("In Out.java" + uriStr);
                              e.printStackTrace();
        } catch (MalformedURLException e) {
                  System.out.println("In Out.java");
                              e.printStackTrace();
        } catch (BadLocationException e) {      
                                    System.out.println("In Out.java");
                              e.printStackTrace();
        } catch (IOException e) {                  
                                    System.out.println("In Out.java");
                              e.printStackTrace();
        }catch (NullPointerException e){
                  System.out.println("In Out.java");
                              e.printStackTrace();
            }catch (Exception e){
                  System.out.println("In Out.java");
                              e.printStackTrace();
            }
        return (String[])result.toArray(new String[result.size()]);
// RETURN THE SET OF LINKS IN A STRING ARRAY
    }

      public static void main(String[] args)
      {
            try
            {
            if(args.length != 1)
            {
                  System.out.println(" Usage : java Out Url");
                  System.exit(0);
                  }
            Out o = new Out();
            String links[] = o.getLinks(args[0]);      
            if(links == null)
            {
                  System.out.println("The Content Type is Not Html");
                  return;
            }
            System.out.println(links.length);
            for(int i = 0 ; i < links.length ; i++)
                  System.out.println(links[i]);
            }catch(Exception e)
            {
                  System.out.println(" In out Main" + e);
                  e.printStackTrace();
            }
      }

}
0
sumantedla
Asked:
sumantedla
  • 7
  • 4
1 Solution
 
sudhakar_koundinyaCommented:
0
 
sudhakar_koundinyaCommented:
And also the URLs which redirect to other URLs will not work with either of the codes

i.e http://www.rediff.com will not work because it will redirect to http://in.rediff.com/index.html when you try to access it from any type of HTTP client.


Regards
Sudhakar


0
 
sumantedlaAuthor Commented:
hi,
For http://in.rediff.com/index.html , I got
java.net.MalformedURLException: unknown protocol: javascript
        at java.net.URL.<init>(Unknown Source)
        at java.net.URL.<init>(Unknown Source)
        at Out.getLinks(Out.java:41)
        at Out.main(Out.java:80)
Why this is not working??
How can I make my program work when there is a redirection ????
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
sumantedlaAuthor Commented:
I even tried

HttpURLConnection connection = (HttpURLConnection)locator.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Q312461)");

connection.setFollowRedirects(true);

But still not working.
0
 
sudhakar_koundinyaCommented:
for your javascript exception here is the code

 
                         String link = (String)s.getAttribute(HTML.Attribute.HREF);

                         if (link != null && link.toLowerCase().indexOf("javascript:")==-1)
                         {
                              URL temp = new URL(locator,link);
                              result.add(temp.toString());
                         }


The reason to have javascript: check because you can call javascript methods in anchor tag something like <a href="javascript:opnenwin">open</a> as an example
                     
0
 
sudhakar_koundinyaCommented:
This works at my end
connection.setFollowRedirects(true);

Check it once again.
0
 
sumantedlaAuthor Commented:
connection.setFollowRedirects(true);

Is this dependent on the jdk version. I am using the jdk 1.4 version.
0
 
sudhakar_koundinyaCommented:
I use jdk1.4 only :-)
0
 
sudhakar_koundinyaCommented:
both http://www.rediff.com and http://rediff.com works now my end using belo prop setting
connection.setFollowRedirects(true);
0
 
sudhakar_koundinyaCommented:
This is the code what I have tested and is working fine at my end


import java.io.*;
import java.net.*;
import java.util.*;
import javax.swing.text.html.*;
import javax.swing.text.*;


class Out
{
     
     public  String[] getLinks(String uriStr) {

        List result = new ArrayList();
            try
               {
            URL locator = new URL(uriStr);
               HttpURLConnection connection = (HttpURLConnection)locator.openConnection();
               connection.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Q312461)");

                        connection.setFollowRedirects(true);
               connection.connect();    
             
               if(!("text/html".equals((connection.getContentType()))))
                    return null;

               Reader rd = new InputStreamReader(connection.getInputStream());
           
            EditorKit kit = new HTMLEditorKit();
            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
               doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
            kit.read(rd, doc, 0);
           
            HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);

            while (it.isValid())
                    {
                         SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();
                                     System.err.println("HREF: "+s.getAttribute(HTML.Attribute.HREF));
                         String link = (String)s.getAttribute(HTML.Attribute.HREF);

                         if (link != null && link.toLowerCase().indexOf("javascript:")==-1)
                         {
                              URL temp = new URL(locator,link);
                              result.add(temp.toString());
                         }
                         it.next();
                    }
               
        } catch (Exception e){
         
                         e.printStackTrace();
          }
        return (String[])result.toArray(new String[result.size()]);

    }

     public static void main(String[] args)
     {
          try
          {
          if(args.length != 1)
          {
               System.out.println(" Usage : java Out Url");
               System.exit(0);
               }
          Out o = new Out();
          String links[] = o.getLinks(args[0]);    
          if(links == null)
          {
               System.out.println("The Content Type is Not Html");
               return;
          }
          System.out.println(links.length);
          for(int i = 0 ; i < links.length ; i++)
               System.out.println(links[i]);
          }catch(Exception e)
          {
               System.out.println(" In out Main" + e);
               e.printStackTrace();
          }
     }

}
0
 
sumantedlaAuthor Commented:
Yeah,

Its working fine. But try for http://isauhcl.org

when I tried it on old version it was not working but when I tried it on new version it was working.

what might be the reason.

Anyway, thanks for your help.
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

  • 7
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now