[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 307
  • Last Modified:

url spidering question

What java classes should I use to spider the internet (opening a url, following a set number of links, and analyzing the text?)
0
ew0kian
Asked:
ew0kian
  • 7
  • 6
  • 4
1 Solution
 
girionisCommented:
Use the html package that comes with JDK to get all the links of a web page:
http://javaalmanac.com/egs/javax.swing.text.html/GetLinks.html

and thne use the java.net classes to rad the text form a link:

http://javaalmanac.com/egs/java.net/ReadFromURL.html?l=rel
0
 
DrWarezzCommented:
Points to girionis.. you could then use JDBC to store retrieved data :o) (although probably not the BEST choice :P)
=>  http://javaalmanac.com/cgi-bin/search/find.pl?words=JDBC

You would probably also need to use the  java.util  class to tokenize the text, etc..  :o)

gL,
[r.D]
0
 
ew0kianAuthor Commented:
And how can I avoid websites from banning me for spidering?
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
ew0kianAuthor Commented:
i have increased the point value to 200, can someone please tell me how to spider the internet without getting in trouble from websites?
0
 
girionisCommented:
> i have increased the point value to 200, can someone please tell me how to spider the internet without getting in trouble from websites?

What exactly do you mean here? I do not think anyone ever got in trouble for spidering. The owner of the site cannot know if someone is spidering or not. As far as the server is concerned the request is a normal http request that retrieves data from the server and displays it on the user.
0
 
DrWarezzCommented:
Agreed.
Note though: Just don't spider sites that request no to be..  (For a site to specify not to be spidered, it must contain some particular HTTP tags ... I'm sure someone will send you a link with info on these tags. :o) )

You may find here a touch useful:  http://www.robotstxt.org/wc/faq.html#use

:o)
[r.D]
0
 
ew0kianAuthor Commented:
ok i used your advice, and used that code to get the links.  on some sites i can get it to output the links fine, but on others it doesnt work.  for example when i do google.com it ends with javax.swing.text.ChangedCharSetException

here's the code:

        import javax.swing.text.html.*;
import javax.swing.text.EditorKit;
import javax.swing.text.SimpleAttributeSet;
import javax.swing.text.BadLocationException;
import java.io.*;
        import java.net.*;
import java.util.ArrayList;

/**
 * Created by IntelliJ IDEA.
 * User: ewok
 * Date: Sep 1, 2004
 * Time: 10:34:44 PM
 * To change this template use File | Settings | File Templates.
 */
public class test {

    public static void main(String[] args){

            String[] output = getLinks("http://www.google.com");
            for(int i=0; i<output.length; i++){
                System.out.println(output[i]);

            }

    }


                // This method takes a URI which can be either a filename (e.g. file://c:/dir/file.html)
    // or a URL (e.g. http://host.com/page.html) and returns all HREF links in the document.
    public static String[] getLinks(String uriStr) {
        ArrayList result = new ArrayList();

        try {
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());

            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            kit.read(rd, doc, 0);

            // Find all the A elements in the HTML document
            HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
            while (it.isValid()) {
                SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();

                String link = (String)s.getAttribute(HTML.Attribute.HREF);
                if (link != null) {
                    // Add the link to the result list
                    result.add(link);
                }
                it.next();
            }
        } catch (MalformedURLException e) {   System.out.println(e);
        } catch (URISyntaxException e) {       System.out.println(e);
        } catch (BadLocationException e) {     System.out.println(e);
        } catch (IOException e) {               System.out.println(e);
        }

        // Return all found links
        return (String[])result.toArray(new String[result.size()]);
    }




}
0
 
DrWarezzCommented:
Try changing:

            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            kit.read(rd, doc, 0);

to

            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            doc.putProperty("IgnoreCharsetDirective", new Boolean(true));
            kit.read(rd, doc, 0);


(Thanks to TimYates, for giving me that advice once.. and it worked for me. :o)

gL,
[r.D]
0
 
ew0kianAuthor Commented:
wow it worked! thanks!  what exactly does that line of code do?
0
 
DrWarezzCommented:
>"wow it worked! thanks!  what exactly does that line of code do?"

Basically, the HTML page tries to change charset half way through (or after it has started at any rate), so the document throws this exception...  unless you tell it not to ;-)

(Thanks to Tim, AGAIN, for giving me that explanation when I previously had a similar problem :o) )

gL,
[r.D]
0
 
ew0kianAuthor Commented:
ive increased the point value to 300

once it finds the links, and then reads the html from each link line by line,i want it to use that loop to get every linkin that document and save it to a text file.  could someone post the code to do that? here's what i have so far (all it does is print out the text in the found html document line by line):

import javax.swing.text.html.*;
import javax.swing.text.EditorKit;
import javax.swing.text.SimpleAttributeSet;
import javax.swing.text.BadLocationException;
import java.io.*;
import java.net.*;
import java.util.ArrayList;

/**
 * Created by IntelliJ IDEA.
 * User: ewok
 * Date: Sep 1, 2004
 * Time: 10:34:44 PM
 * To change this template use File | Settings | File Templates.
 */
public class test {

    public static void main(String[] args){
            String site = "http://www.google.com";
            String[] output = getLinks(site);
            for(int i=0; i<output.length; i++){
                String temp = output[i];


                try {
        // Create a URL for the desired page
        URL url2 = new URL(temp);

        // Read all the text returned by the server
        BufferedReader in = new BufferedReader(new InputStreamReader(url2.openStream()));
        String str;
        while ((str = in.readLine()) != null) {
            System.out.println(str);
        }
        in.close();
    } catch (MalformedURLException e) { System.out.println(e);
    } catch (IOException e) {   System.out.println(e);
    }


            }


    }


                // This method takes a URI which can be either a filename (e.g. file://c:/dir/file.html)
    // or a URL (e.g. http://host.com/page.html) and returns all HREF links in the document.
    public static String[] getLinks(String uriStr) {
        ArrayList result = new ArrayList();

        try {
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());

            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            doc.putProperty("IgnoreCharsetDirective", new Boolean(true));     // suggested by experts-exchange
            kit.read(rd, doc, 0);

            // Find all the A elements in the HTML document
            HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
            while (it.isValid()) {
                SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();

                String link = (String)s.getAttribute(HTML.Attribute.HREF);
                if (link != null) {
                    // Add the link to the result list
                    result.add(link);
                }
                it.next();
            }
        } catch (MalformedURLException e) {   System.out.println(e);
        } catch (URISyntaxException e) {       System.out.println(e);
        } catch (BadLocationException e) {     System.out.println(e);
        } catch (IOException e) {               System.out.println(e);
        }

        // Return all found links
        return (String[])result.toArray(new String[result.size()]);
    }




}
0
 
ew0kianAuthor Commented:
Also curious is there a way to, isntead of using the for loop to iterate through each link and then read the text from that link... to somehow have the program open up several processes so as to simultaneiously crunch through the links? thanks alot.
0
 
DrWarezzCommented:
Yes. Use threads :)

Try doing this:

Create a class that is passed a URL as an argument, and run as a new thread. Once this class finds a load of new URLs, it creates a new thread to process each URL for that.

However, you would then have the problem of your program running up to hundreds of different threads/processes at once, VERY quickly!! This is going to decrease your systems performance ALOT!

I would recommend creating some sort of 'queue' system.
With a String array, containing ALL the URLs to be processed. Then, have, for example, 5 threads each taking a different URL from that queue (and removing it from the queue) and processing it. Then, for every URL each thread then finds, they add it to the back of the queue! :D )..

I'm heading off back to school now (it's my lunch hour), but after school, I shall post some code to demonstrate this if you'd like! :D

gL,
[r.D]
0
 
DrWarezzCommented:
btw: if you'd like to give using threads a go, check out: www.javaalmanac.com, then search for "threads".. And check out the results.

:o)
0
 
ew0kianAuthor Commented:
some example code would be awesome thanks!
0
 
girionisCommented:
You won't find better examples than in the following link: http://java.sun.com/docs/books/tutorial/essential/threads/
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

  • 7
  • 6
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now