Solved

url spidering question

Posted on 2004-08-29
17
290 Views
Last Modified: 2010-03-31
What java classes should I use to spider the internet (opening a url, following a set number of links, and analyzing the text?)
0
Comment
Question by:ew0kian
  • 7
  • 6
  • 4
17 Comments
 
LVL 35

Accepted Solution

by:
girionis earned 300 total points
ID: 11926697
Use the html package that comes with JDK to get all the links of a web page:
http://javaalmanac.com/egs/javax.swing.text.html/GetLinks.html

and thne use the java.net classes to rad the text form a link:

http://javaalmanac.com/egs/java.net/ReadFromURL.html?l=rel
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 11927368
Points to girionis.. you could then use JDBC to store retrieved data :o) (although probably not the BEST choice :P)
=>  http://javaalmanac.com/cgi-bin/search/find.pl?words=JDBC

You would probably also need to use the  java.util  class to tokenize the text, etc..  :o)

gL,
[r.D]
0
 

Author Comment

by:ew0kian
ID: 11958709
And how can I avoid websites from banning me for spidering?
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:ew0kian
ID: 11959989
i have increased the point value to 200, can someone please tell me how to spider the internet without getting in trouble from websites?
0
 
LVL 35

Expert Comment

by:girionis
ID: 11960875
> i have increased the point value to 200, can someone please tell me how to spider the internet without getting in trouble from websites?

What exactly do you mean here? I do not think anyone ever got in trouble for spidering. The owner of the site cannot know if someone is spidering or not. As far as the server is concerned the request is a normal http request that retrieves data from the server and displays it on the user.
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 11967965
Agreed.
Note though: Just don't spider sites that request no to be..  (For a site to specify not to be spidered, it must contain some particular HTTP tags ... I'm sure someone will send you a link with info on these tags. :o) )

You may find here a touch useful:  http://www.robotstxt.org/wc/faq.html#use

:o)
[r.D]
0
 

Author Comment

by:ew0kian
ID: 11980725
ok i used your advice, and used that code to get the links.  on some sites i can get it to output the links fine, but on others it doesnt work.  for example when i do google.com it ends with javax.swing.text.ChangedCharSetException

here's the code:

        import javax.swing.text.html.*;
import javax.swing.text.EditorKit;
import javax.swing.text.SimpleAttributeSet;
import javax.swing.text.BadLocationException;
import java.io.*;
        import java.net.*;
import java.util.ArrayList;

/**
 * Created by IntelliJ IDEA.
 * User: ewok
 * Date: Sep 1, 2004
 * Time: 10:34:44 PM
 * To change this template use File | Settings | File Templates.
 */
public class test {

    public static void main(String[] args){

            String[] output = getLinks("http://www.google.com");
            for(int i=0; i<output.length; i++){
                System.out.println(output[i]);

            }

    }


                // This method takes a URI which can be either a filename (e.g. file://c:/dir/file.html)
    // or a URL (e.g. http://host.com/page.html) and returns all HREF links in the document.
    public static String[] getLinks(String uriStr) {
        ArrayList result = new ArrayList();

        try {
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());

            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            kit.read(rd, doc, 0);

            // Find all the A elements in the HTML document
            HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
            while (it.isValid()) {
                SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();

                String link = (String)s.getAttribute(HTML.Attribute.HREF);
                if (link != null) {
                    // Add the link to the result list
                    result.add(link);
                }
                it.next();
            }
        } catch (MalformedURLException e) {   System.out.println(e);
        } catch (URISyntaxException e) {       System.out.println(e);
        } catch (BadLocationException e) {     System.out.println(e);
        } catch (IOException e) {               System.out.println(e);
        }

        // Return all found links
        return (String[])result.toArray(new String[result.size()]);
    }




}
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 11980760
Try changing:

            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            kit.read(rd, doc, 0);

to

            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            doc.putProperty("IgnoreCharsetDirective", new Boolean(true));
            kit.read(rd, doc, 0);


(Thanks to TimYates, for giving me that advice once.. and it worked for me. :o)

gL,
[r.D]
0
 

Author Comment

by:ew0kian
ID: 11981184
wow it worked! thanks!  what exactly does that line of code do?
0
 
LVL 35

Expert Comment

by:girionis
ID: 11981404
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 11983635
>"wow it worked! thanks!  what exactly does that line of code do?"

Basically, the HTML page tries to change charset half way through (or after it has started at any rate), so the document throws this exception...  unless you tell it not to ;-)

(Thanks to Tim, AGAIN, for giving me that explanation when I previously had a similar problem :o) )

gL,
[r.D]
0
 

Author Comment

by:ew0kian
ID: 11988068
ive increased the point value to 300

once it finds the links, and then reads the html from each link line by line,i want it to use that loop to get every linkin that document and save it to a text file.  could someone post the code to do that? here's what i have so far (all it does is print out the text in the found html document line by line):

import javax.swing.text.html.*;
import javax.swing.text.EditorKit;
import javax.swing.text.SimpleAttributeSet;
import javax.swing.text.BadLocationException;
import java.io.*;
import java.net.*;
import java.util.ArrayList;

/**
 * Created by IntelliJ IDEA.
 * User: ewok
 * Date: Sep 1, 2004
 * Time: 10:34:44 PM
 * To change this template use File | Settings | File Templates.
 */
public class test {

    public static void main(String[] args){
            String site = "http://www.google.com";
            String[] output = getLinks(site);
            for(int i=0; i<output.length; i++){
                String temp = output[i];


                try {
        // Create a URL for the desired page
        URL url2 = new URL(temp);

        // Read all the text returned by the server
        BufferedReader in = new BufferedReader(new InputStreamReader(url2.openStream()));
        String str;
        while ((str = in.readLine()) != null) {
            System.out.println(str);
        }
        in.close();
    } catch (MalformedURLException e) { System.out.println(e);
    } catch (IOException e) {   System.out.println(e);
    }


            }


    }


                // This method takes a URI which can be either a filename (e.g. file://c:/dir/file.html)
    // or a URL (e.g. http://host.com/page.html) and returns all HREF links in the document.
    public static String[] getLinks(String uriStr) {
        ArrayList result = new ArrayList();

        try {
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());

            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            doc.putProperty("IgnoreCharsetDirective", new Boolean(true));     // suggested by experts-exchange
            kit.read(rd, doc, 0);

            // Find all the A elements in the HTML document
            HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
            while (it.isValid()) {
                SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();

                String link = (String)s.getAttribute(HTML.Attribute.HREF);
                if (link != null) {
                    // Add the link to the result list
                    result.add(link);
                }
                it.next();
            }
        } catch (MalformedURLException e) {   System.out.println(e);
        } catch (URISyntaxException e) {       System.out.println(e);
        } catch (BadLocationException e) {     System.out.println(e);
        } catch (IOException e) {               System.out.println(e);
        }

        // Return all found links
        return (String[])result.toArray(new String[result.size()]);
    }




}
0
 

Author Comment

by:ew0kian
ID: 11988113
Also curious is there a way to, isntead of using the for loop to iterate through each link and then read the text from that link... to somehow have the program open up several processes so as to simultaneiously crunch through the links? thanks alot.
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 11989163
Yes. Use threads :)

Try doing this:

Create a class that is passed a URL as an argument, and run as a new thread. Once this class finds a load of new URLs, it creates a new thread to process each URL for that.

However, you would then have the problem of your program running up to hundreds of different threads/processes at once, VERY quickly!! This is going to decrease your systems performance ALOT!

I would recommend creating some sort of 'queue' system.
With a String array, containing ALL the URLs to be processed. Then, have, for example, 5 threads each taking a different URL from that queue (and removing it from the queue) and processing it. Then, for every URL each thread then finds, they add it to the back of the queue! :D )..

I'm heading off back to school now (it's my lunch hour), but after school, I shall post some code to demonstrate this if you'd like! :D

gL,
[r.D]
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 11989174
btw: if you'd like to give using threads a go, check out: www.javaalmanac.com, then search for "threads".. And check out the results.

:o)
0
 

Author Comment

by:ew0kian
ID: 12007926
some example code would be awesome thanks!
0
 
LVL 35

Expert Comment

by:girionis
ID: 12010772
You won't find better examples than in the following link: http://java.sun.com/docs/books/tutorial/essential/threads/
0

Featured Post

PeopleSoft Has Never Been Easier

PeopleSoft Adoption Made Smooth & Simple!

On-The-Job Training Is made Intuitive & Easy With WalkMe's On-Screen Guidance Tool.  Claim Your Free WalkMe Account Now

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Are you developing a Java application and want to create Excel Spreadsheets? You have come to the right place, this article will describe how you can create Excel Spreadsheets from a Java Application. For the purposes of this article, I will be u…
Basic understanding on "OO- Object Orientation" is needed for designing a logical solution to solve a problem. Basic OOAD is a prerequisite for a coder to ensure that they follow the basic design of OO. This would help developers to understand the b…
Viewers learn how to read error messages and identify possible mistakes that could cause hours of frustration. Coding is as much about debugging your code as it is about writing it. Define Error Message: Line Numbers: Type of Error: Break Down…
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:

730 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question