Solved

url spidering question

Posted on 2004-08-29
17
288 Views
Last Modified: 2010-03-31
What java classes should I use to spider the internet (opening a url, following a set number of links, and analyzing the text?)
0
Comment
Question by:ew0kian
  • 7
  • 6
  • 4
17 Comments
 
LVL 35

Accepted Solution

by:
girionis earned 300 total points
ID: 11926697
Use the html package that comes with JDK to get all the links of a web page:
http://javaalmanac.com/egs/javax.swing.text.html/GetLinks.html

and thne use the java.net classes to rad the text form a link:

http://javaalmanac.com/egs/java.net/ReadFromURL.html?l=rel
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 11927368
Points to girionis.. you could then use JDBC to store retrieved data :o) (although probably not the BEST choice :P)
=>  http://javaalmanac.com/cgi-bin/search/find.pl?words=JDBC

You would probably also need to use the  java.util  class to tokenize the text, etc..  :o)

gL,
[r.D]
0
 

Author Comment

by:ew0kian
ID: 11958709
And how can I avoid websites from banning me for spidering?
0
Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

 

Author Comment

by:ew0kian
ID: 11959989
i have increased the point value to 200, can someone please tell me how to spider the internet without getting in trouble from websites?
0
 
LVL 35

Expert Comment

by:girionis
ID: 11960875
> i have increased the point value to 200, can someone please tell me how to spider the internet without getting in trouble from websites?

What exactly do you mean here? I do not think anyone ever got in trouble for spidering. The owner of the site cannot know if someone is spidering or not. As far as the server is concerned the request is a normal http request that retrieves data from the server and displays it on the user.
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 11967965
Agreed.
Note though: Just don't spider sites that request no to be..  (For a site to specify not to be spidered, it must contain some particular HTTP tags ... I'm sure someone will send you a link with info on these tags. :o) )

You may find here a touch useful:  http://www.robotstxt.org/wc/faq.html#use

:o)
[r.D]
0
 

Author Comment

by:ew0kian
ID: 11980725
ok i used your advice, and used that code to get the links.  on some sites i can get it to output the links fine, but on others it doesnt work.  for example when i do google.com it ends with javax.swing.text.ChangedCharSetException

here's the code:

        import javax.swing.text.html.*;
import javax.swing.text.EditorKit;
import javax.swing.text.SimpleAttributeSet;
import javax.swing.text.BadLocationException;
import java.io.*;
        import java.net.*;
import java.util.ArrayList;

/**
 * Created by IntelliJ IDEA.
 * User: ewok
 * Date: Sep 1, 2004
 * Time: 10:34:44 PM
 * To change this template use File | Settings | File Templates.
 */
public class test {

    public static void main(String[] args){

            String[] output = getLinks("http://www.google.com");
            for(int i=0; i<output.length; i++){
                System.out.println(output[i]);

            }

    }


                // This method takes a URI which can be either a filename (e.g. file://c:/dir/file.html)
    // or a URL (e.g. http://host.com/page.html) and returns all HREF links in the document.
    public static String[] getLinks(String uriStr) {
        ArrayList result = new ArrayList();

        try {
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());

            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            kit.read(rd, doc, 0);

            // Find all the A elements in the HTML document
            HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
            while (it.isValid()) {
                SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();

                String link = (String)s.getAttribute(HTML.Attribute.HREF);
                if (link != null) {
                    // Add the link to the result list
                    result.add(link);
                }
                it.next();
            }
        } catch (MalformedURLException e) {   System.out.println(e);
        } catch (URISyntaxException e) {       System.out.println(e);
        } catch (BadLocationException e) {     System.out.println(e);
        } catch (IOException e) {               System.out.println(e);
        }

        // Return all found links
        return (String[])result.toArray(new String[result.size()]);
    }




}
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 11980760
Try changing:

            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            kit.read(rd, doc, 0);

to

            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            doc.putProperty("IgnoreCharsetDirective", new Boolean(true));
            kit.read(rd, doc, 0);


(Thanks to TimYates, for giving me that advice once.. and it worked for me. :o)

gL,
[r.D]
0
 

Author Comment

by:ew0kian
ID: 11981184
wow it worked! thanks!  what exactly does that line of code do?
0
 
LVL 35

Expert Comment

by:girionis
ID: 11981404
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 11983635
>"wow it worked! thanks!  what exactly does that line of code do?"

Basically, the HTML page tries to change charset half way through (or after it has started at any rate), so the document throws this exception...  unless you tell it not to ;-)

(Thanks to Tim, AGAIN, for giving me that explanation when I previously had a similar problem :o) )

gL,
[r.D]
0
 

Author Comment

by:ew0kian
ID: 11988068
ive increased the point value to 300

once it finds the links, and then reads the html from each link line by line,i want it to use that loop to get every linkin that document and save it to a text file.  could someone post the code to do that? here's what i have so far (all it does is print out the text in the found html document line by line):

import javax.swing.text.html.*;
import javax.swing.text.EditorKit;
import javax.swing.text.SimpleAttributeSet;
import javax.swing.text.BadLocationException;
import java.io.*;
import java.net.*;
import java.util.ArrayList;

/**
 * Created by IntelliJ IDEA.
 * User: ewok
 * Date: Sep 1, 2004
 * Time: 10:34:44 PM
 * To change this template use File | Settings | File Templates.
 */
public class test {

    public static void main(String[] args){
            String site = "http://www.google.com";
            String[] output = getLinks(site);
            for(int i=0; i<output.length; i++){
                String temp = output[i];


                try {
        // Create a URL for the desired page
        URL url2 = new URL(temp);

        // Read all the text returned by the server
        BufferedReader in = new BufferedReader(new InputStreamReader(url2.openStream()));
        String str;
        while ((str = in.readLine()) != null) {
            System.out.println(str);
        }
        in.close();
    } catch (MalformedURLException e) { System.out.println(e);
    } catch (IOException e) {   System.out.println(e);
    }


            }


    }


                // This method takes a URI which can be either a filename (e.g. file://c:/dir/file.html)
    // or a URL (e.g. http://host.com/page.html) and returns all HREF links in the document.
    public static String[] getLinks(String uriStr) {
        ArrayList result = new ArrayList();

        try {
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());

            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            doc.putProperty("IgnoreCharsetDirective", new Boolean(true));     // suggested by experts-exchange
            kit.read(rd, doc, 0);

            // Find all the A elements in the HTML document
            HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
            while (it.isValid()) {
                SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();

                String link = (String)s.getAttribute(HTML.Attribute.HREF);
                if (link != null) {
                    // Add the link to the result list
                    result.add(link);
                }
                it.next();
            }
        } catch (MalformedURLException e) {   System.out.println(e);
        } catch (URISyntaxException e) {       System.out.println(e);
        } catch (BadLocationException e) {     System.out.println(e);
        } catch (IOException e) {               System.out.println(e);
        }

        // Return all found links
        return (String[])result.toArray(new String[result.size()]);
    }




}
0
 

Author Comment

by:ew0kian
ID: 11988113
Also curious is there a way to, isntead of using the for loop to iterate through each link and then read the text from that link... to somehow have the program open up several processes so as to simultaneiously crunch through the links? thanks alot.
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 11989163
Yes. Use threads :)

Try doing this:

Create a class that is passed a URL as an argument, and run as a new thread. Once this class finds a load of new URLs, it creates a new thread to process each URL for that.

However, you would then have the problem of your program running up to hundreds of different threads/processes at once, VERY quickly!! This is going to decrease your systems performance ALOT!

I would recommend creating some sort of 'queue' system.
With a String array, containing ALL the URLs to be processed. Then, have, for example, 5 threads each taking a different URL from that queue (and removing it from the queue) and processing it. Then, for every URL each thread then finds, they add it to the back of the queue! :D )..

I'm heading off back to school now (it's my lunch hour), but after school, I shall post some code to demonstrate this if you'd like! :D

gL,
[r.D]
0
 
LVL 9

Expert Comment

by:DrWarezz
ID: 11989174
btw: if you'd like to give using threads a go, check out: www.javaalmanac.com, then search for "threads".. And check out the results.

:o)
0
 

Author Comment

by:ew0kian
ID: 12007926
some example code would be awesome thanks!
0
 
LVL 35

Expert Comment

by:girionis
ID: 12010772
You won't find better examples than in the following link: http://java.sun.com/docs/books/tutorial/essential/threads/
0

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
null output 3 35
jboss wildfly 10.1 10 219
ejb message driven bean mdb creation steps 2 15
Java Eclipse Loop 3 20
By the end of 1980s, object oriented programming using languages like C++, Simula69 and ObjectPascal gained momentum. It looked like programmers finally found the perfect language. C++ successfully combined the object oriented principles of Simula w…
Introduction This article is the first of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article explains our test automation goals. Then rationale is given for the tools we use to a…
Viewers learn about the scanner class in this video and are introduced to receiving user input for their programs. Additionally, objects, conditional statements, and loops are used to help reinforce the concepts. Introduce Scanner class: Importing…
This tutorial covers a practical example of lazy loading technique and early loading technique in a Singleton Design Pattern.

791 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question