Solved

url spidering question

Posted on 2004-08-29
17
285 Views
Last Modified: 2010-03-31
What java classes should I use to spider the internet (opening a url, following a set number of links, and analyzing the text?)
0
Comment
Question by:ew0kian
  • 7
  • 6
  • 4
17 Comments
 
LVL 35

Accepted Solution

by:
girionis earned 300 total points
ID: 11926697
Use the html package that comes with JDK to get all the links of a web page:
http://javaalmanac.com/egs/javax.swing.text.html/GetLinks.html

and thne use the java.net classes to rad the text form a link:

http://javaalmanac.com/egs/java.net/ReadFromURL.html?l=rel
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 11927368
Points to girionis.. you could then use JDBC to store retrieved data :o) (although probably not the BEST choice :P)
=>  http://javaalmanac.com/cgi-bin/search/find.pl?words=JDBC

You would probably also need to use the  java.util  class to tokenize the text, etc..  :o)

gL,
[r.D]
0
 

Author Comment

by:ew0kian
ID: 11958709
And how can I avoid websites from banning me for spidering?
0
Courses: Start Training Online With Pros, Today

Brush up on the basics or master the advanced techniques required to earn essential industry certifications, with Courses. Enroll in a course and start learning today. Training topics range from Android App Dev to the Xen Virtualization Platform.

 

Author Comment

by:ew0kian
ID: 11959989
i have increased the point value to 200, can someone please tell me how to spider the internet without getting in trouble from websites?
0
 
LVL 35

Expert Comment

by:girionis
ID: 11960875
> i have increased the point value to 200, can someone please tell me how to spider the internet without getting in trouble from websites?

What exactly do you mean here? I do not think anyone ever got in trouble for spidering. The owner of the site cannot know if someone is spidering or not. As far as the server is concerned the request is a normal http request that retrieves data from the server and displays it on the user.
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 11967965
Agreed.
Note though: Just don't spider sites that request no to be..  (For a site to specify not to be spidered, it must contain some particular HTTP tags ... I'm sure someone will send you a link with info on these tags. :o) )

You may find here a touch useful:  http://www.robotstxt.org/wc/faq.html#use

:o)
[r.D]
0
 

Author Comment

by:ew0kian
ID: 11980725
ok i used your advice, and used that code to get the links.  on some sites i can get it to output the links fine, but on others it doesnt work.  for example when i do google.com it ends with javax.swing.text.ChangedCharSetException

here's the code:

        import javax.swing.text.html.*;
import javax.swing.text.EditorKit;
import javax.swing.text.SimpleAttributeSet;
import javax.swing.text.BadLocationException;
import java.io.*;
        import java.net.*;
import java.util.ArrayList;

/**
 * Created by IntelliJ IDEA.
 * User: ewok
 * Date: Sep 1, 2004
 * Time: 10:34:44 PM
 * To change this template use File | Settings | File Templates.
 */
public class test {

    public static void main(String[] args){

            String[] output = getLinks("http://www.google.com");
            for(int i=0; i<output.length; i++){
                System.out.println(output[i]);

            }

    }


                // This method takes a URI which can be either a filename (e.g. file://c:/dir/file.html)
    // or a URL (e.g. http://host.com/page.html) and returns all HREF links in the document.
    public static String[] getLinks(String uriStr) {
        ArrayList result = new ArrayList();

        try {
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());

            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            kit.read(rd, doc, 0);

            // Find all the A elements in the HTML document
            HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
            while (it.isValid()) {
                SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();

                String link = (String)s.getAttribute(HTML.Attribute.HREF);
                if (link != null) {
                    // Add the link to the result list
                    result.add(link);
                }
                it.next();
            }
        } catch (MalformedURLException e) {   System.out.println(e);
        } catch (URISyntaxException e) {       System.out.println(e);
        } catch (BadLocationException e) {     System.out.println(e);
        } catch (IOException e) {               System.out.println(e);
        }

        // Return all found links
        return (String[])result.toArray(new String[result.size()]);
    }




}
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 11980760
Try changing:

            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            kit.read(rd, doc, 0);

to

            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            doc.putProperty("IgnoreCharsetDirective", new Boolean(true));
            kit.read(rd, doc, 0);


(Thanks to TimYates, for giving me that advice once.. and it worked for me. :o)

gL,
[r.D]
0
 

Author Comment

by:ew0kian
ID: 11981184
wow it worked! thanks!  what exactly does that line of code do?
0
 
LVL 35

Expert Comment

by:girionis
ID: 11981404
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 11983635
>"wow it worked! thanks!  what exactly does that line of code do?"

Basically, the HTML page tries to change charset half way through (or after it has started at any rate), so the document throws this exception...  unless you tell it not to ;-)

(Thanks to Tim, AGAIN, for giving me that explanation when I previously had a similar problem :o) )

gL,
[r.D]
0
 

Author Comment

by:ew0kian
ID: 11988068
ive increased the point value to 300

once it finds the links, and then reads the html from each link line by line,i want it to use that loop to get every linkin that document and save it to a text file.  could someone post the code to do that? here's what i have so far (all it does is print out the text in the found html document line by line):

import javax.swing.text.html.*;
import javax.swing.text.EditorKit;
import javax.swing.text.SimpleAttributeSet;
import javax.swing.text.BadLocationException;
import java.io.*;
import java.net.*;
import java.util.ArrayList;

/**
 * Created by IntelliJ IDEA.
 * User: ewok
 * Date: Sep 1, 2004
 * Time: 10:34:44 PM
 * To change this template use File | Settings | File Templates.
 */
public class test {

    public static void main(String[] args){
            String site = "http://www.google.com";
            String[] output = getLinks(site);
            for(int i=0; i<output.length; i++){
                String temp = output[i];


                try {
        // Create a URL for the desired page
        URL url2 = new URL(temp);

        // Read all the text returned by the server
        BufferedReader in = new BufferedReader(new InputStreamReader(url2.openStream()));
        String str;
        while ((str = in.readLine()) != null) {
            System.out.println(str);
        }
        in.close();
    } catch (MalformedURLException e) { System.out.println(e);
    } catch (IOException e) {   System.out.println(e);
    }


            }


    }


                // This method takes a URI which can be either a filename (e.g. file://c:/dir/file.html)
    // or a URL (e.g. http://host.com/page.html) and returns all HREF links in the document.
    public static String[] getLinks(String uriStr) {
        ArrayList result = new ArrayList();

        try {
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());

            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            doc.putProperty("IgnoreCharsetDirective", new Boolean(true));     // suggested by experts-exchange
            kit.read(rd, doc, 0);

            // Find all the A elements in the HTML document
            HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
            while (it.isValid()) {
                SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();

                String link = (String)s.getAttribute(HTML.Attribute.HREF);
                if (link != null) {
                    // Add the link to the result list
                    result.add(link);
                }
                it.next();
            }
        } catch (MalformedURLException e) {   System.out.println(e);
        } catch (URISyntaxException e) {       System.out.println(e);
        } catch (BadLocationException e) {     System.out.println(e);
        } catch (IOException e) {               System.out.println(e);
        }

        // Return all found links
        return (String[])result.toArray(new String[result.size()]);
    }




}
0
 

Author Comment

by:ew0kian
ID: 11988113
Also curious is there a way to, isntead of using the for loop to iterate through each link and then read the text from that link... to somehow have the program open up several processes so as to simultaneiously crunch through the links? thanks alot.
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 11989163
Yes. Use threads :)

Try doing this:

Create a class that is passed a URL as an argument, and run as a new thread. Once this class finds a load of new URLs, it creates a new thread to process each URL for that.

However, you would then have the problem of your program running up to hundreds of different threads/processes at once, VERY quickly!! This is going to decrease your systems performance ALOT!

I would recommend creating some sort of 'queue' system.
With a String array, containing ALL the URLs to be processed. Then, have, for example, 5 threads each taking a different URL from that queue (and removing it from the queue) and processing it. Then, for every URL each thread then finds, they add it to the back of the queue! :D )..

I'm heading off back to school now (it's my lunch hour), but after school, I shall post some code to demonstrate this if you'd like! :D

gL,
[r.D]
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 11989174
btw: if you'd like to give using threads a go, check out: www.javaalmanac.com, then search for "threads".. And check out the results.

:o)
0
 

Author Comment

by:ew0kian
ID: 12007926
some example code would be awesome thanks!
0
 
LVL 35

Expert Comment

by:girionis
ID: 12010772
You won't find better examples than in the following link: http://java.sun.com/docs/books/tutorial/essential/threads/
0

Featured Post

Gigs: Get Your Project Delivered by an Expert

Select from freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
HSSFWorkbook cannot be resolved error 10 66
javap bin 2 34
factorial example 4 39
table example 4 29
For customizing the look of your lightweight component and making it look opaque like it was made of plastic.  This tip assumes your component to be of rectangular shape and completely opaque.   (CODE)
By the end of 1980s, object oriented programming using languages like C++, Simula69 and ObjectPascal gained momentum. It looked like programmers finally found the perfect language. C++ successfully combined the object oriented principles of Simula w…
Viewers will learn about basic arrays, how to declare them, and how to use them. Introduction and definition: Declare an array and cover the syntax of declaring them: Initialize every index in the created array: Example/Features of a basic arr…
Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:

808 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question