Solved

url spidering question

Posted on 2004-08-29
17
279 Views
Last Modified: 2010-03-31
What java classes should I use to spider the internet (opening a url, following a set number of links, and analyzing the text?)
0
Comment
Question by:ew0kian
  • 7
  • 6
  • 4
17 Comments
 
LVL 35

Accepted Solution

by:
girionis earned 300 total points
Comment Utility
Use the html package that comes with JDK to get all the links of a web page:
http://javaalmanac.com/egs/javax.swing.text.html/GetLinks.html

and thne use the java.net classes to rad the text form a link:

http://javaalmanac.com/egs/java.net/ReadFromURL.html?l=rel
0
 
LVL 9

Expert Comment

by:DrWarezz
Comment Utility
Points to girionis.. you could then use JDBC to store retrieved data :o) (although probably not the BEST choice :P)
=>  http://javaalmanac.com/cgi-bin/search/find.pl?words=JDBC

You would probably also need to use the  java.util  class to tokenize the text, etc..  :o)

gL,
[r.D]
0
 

Author Comment

by:ew0kian
Comment Utility
And how can I avoid websites from banning me for spidering?
0
 

Author Comment

by:ew0kian
Comment Utility
i have increased the point value to 200, can someone please tell me how to spider the internet without getting in trouble from websites?
0
 
LVL 35

Expert Comment

by:girionis
Comment Utility
> i have increased the point value to 200, can someone please tell me how to spider the internet without getting in trouble from websites?

What exactly do you mean here? I do not think anyone ever got in trouble for spidering. The owner of the site cannot know if someone is spidering or not. As far as the server is concerned the request is a normal http request that retrieves data from the server and displays it on the user.
0
 
LVL 9

Expert Comment

by:DrWarezz
Comment Utility
Agreed.
Note though: Just don't spider sites that request no to be..  (For a site to specify not to be spidered, it must contain some particular HTTP tags ... I'm sure someone will send you a link with info on these tags. :o) )

You may find here a touch useful:  http://www.robotstxt.org/wc/faq.html#use

:o)
[r.D]
0
 

Author Comment

by:ew0kian
Comment Utility
ok i used your advice, and used that code to get the links.  on some sites i can get it to output the links fine, but on others it doesnt work.  for example when i do google.com it ends with javax.swing.text.ChangedCharSetException

here's the code:

        import javax.swing.text.html.*;
import javax.swing.text.EditorKit;
import javax.swing.text.SimpleAttributeSet;
import javax.swing.text.BadLocationException;
import java.io.*;
        import java.net.*;
import java.util.ArrayList;

/**
 * Created by IntelliJ IDEA.
 * User: ewok
 * Date: Sep 1, 2004
 * Time: 10:34:44 PM
 * To change this template use File | Settings | File Templates.
 */
public class test {

    public static void main(String[] args){

            String[] output = getLinks("http://www.google.com");
            for(int i=0; i<output.length; i++){
                System.out.println(output[i]);

            }

    }


                // This method takes a URI which can be either a filename (e.g. file://c:/dir/file.html)
    // or a URL (e.g. http://host.com/page.html) and returns all HREF links in the document.
    public static String[] getLinks(String uriStr) {
        ArrayList result = new ArrayList();

        try {
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());

            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            kit.read(rd, doc, 0);

            // Find all the A elements in the HTML document
            HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
            while (it.isValid()) {
                SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();

                String link = (String)s.getAttribute(HTML.Attribute.HREF);
                if (link != null) {
                    // Add the link to the result list
                    result.add(link);
                }
                it.next();
            }
        } catch (MalformedURLException e) {   System.out.println(e);
        } catch (URISyntaxException e) {       System.out.println(e);
        } catch (BadLocationException e) {     System.out.println(e);
        } catch (IOException e) {               System.out.println(e);
        }

        // Return all found links
        return (String[])result.toArray(new String[result.size()]);
    }




}
0
 
LVL 9

Expert Comment

by:DrWarezz
Comment Utility
Try changing:

            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            kit.read(rd, doc, 0);

to

            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            doc.putProperty("IgnoreCharsetDirective", new Boolean(true));
            kit.read(rd, doc, 0);


(Thanks to TimYates, for giving me that advice once.. and it worked for me. :o)

gL,
[r.D]
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 

Author Comment

by:ew0kian
Comment Utility
wow it worked! thanks!  what exactly does that line of code do?
0
 
LVL 35

Expert Comment

by:girionis
Comment Utility
0
 
LVL 9

Expert Comment

by:DrWarezz
Comment Utility
>"wow it worked! thanks!  what exactly does that line of code do?"

Basically, the HTML page tries to change charset half way through (or after it has started at any rate), so the document throws this exception...  unless you tell it not to ;-)

(Thanks to Tim, AGAIN, for giving me that explanation when I previously had a similar problem :o) )

gL,
[r.D]
0
 

Author Comment

by:ew0kian
Comment Utility
ive increased the point value to 300

once it finds the links, and then reads the html from each link line by line,i want it to use that loop to get every linkin that document and save it to a text file.  could someone post the code to do that? here's what i have so far (all it does is print out the text in the found html document line by line):

import javax.swing.text.html.*;
import javax.swing.text.EditorKit;
import javax.swing.text.SimpleAttributeSet;
import javax.swing.text.BadLocationException;
import java.io.*;
import java.net.*;
import java.util.ArrayList;

/**
 * Created by IntelliJ IDEA.
 * User: ewok
 * Date: Sep 1, 2004
 * Time: 10:34:44 PM
 * To change this template use File | Settings | File Templates.
 */
public class test {

    public static void main(String[] args){
            String site = "http://www.google.com";
            String[] output = getLinks(site);
            for(int i=0; i<output.length; i++){
                String temp = output[i];


                try {
        // Create a URL for the desired page
        URL url2 = new URL(temp);

        // Read all the text returned by the server
        BufferedReader in = new BufferedReader(new InputStreamReader(url2.openStream()));
        String str;
        while ((str = in.readLine()) != null) {
            System.out.println(str);
        }
        in.close();
    } catch (MalformedURLException e) { System.out.println(e);
    } catch (IOException e) {   System.out.println(e);
    }


            }


    }


                // This method takes a URI which can be either a filename (e.g. file://c:/dir/file.html)
    // or a URL (e.g. http://host.com/page.html) and returns all HREF links in the document.
    public static String[] getLinks(String uriStr) {
        ArrayList result = new ArrayList();

        try {
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());

            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            doc.putProperty("IgnoreCharsetDirective", new Boolean(true));     // suggested by experts-exchange
            kit.read(rd, doc, 0);

            // Find all the A elements in the HTML document
            HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
            while (it.isValid()) {
                SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();

                String link = (String)s.getAttribute(HTML.Attribute.HREF);
                if (link != null) {
                    // Add the link to the result list
                    result.add(link);
                }
                it.next();
            }
        } catch (MalformedURLException e) {   System.out.println(e);
        } catch (URISyntaxException e) {       System.out.println(e);
        } catch (BadLocationException e) {     System.out.println(e);
        } catch (IOException e) {               System.out.println(e);
        }

        // Return all found links
        return (String[])result.toArray(new String[result.size()]);
    }




}
0
 

Author Comment

by:ew0kian
Comment Utility
Also curious is there a way to, isntead of using the for loop to iterate through each link and then read the text from that link... to somehow have the program open up several processes so as to simultaneiously crunch through the links? thanks alot.
0
 
LVL 9

Expert Comment

by:DrWarezz
Comment Utility
Yes. Use threads :)

Try doing this:

Create a class that is passed a URL as an argument, and run as a new thread. Once this class finds a load of new URLs, it creates a new thread to process each URL for that.

However, you would then have the problem of your program running up to hundreds of different threads/processes at once, VERY quickly!! This is going to decrease your systems performance ALOT!

I would recommend creating some sort of 'queue' system.
With a String array, containing ALL the URLs to be processed. Then, have, for example, 5 threads each taking a different URL from that queue (and removing it from the queue) and processing it. Then, for every URL each thread then finds, they add it to the back of the queue! :D )..

I'm heading off back to school now (it's my lunch hour), but after school, I shall post some code to demonstrate this if you'd like! :D

gL,
[r.D]
0
 
LVL 9

Expert Comment

by:DrWarezz
Comment Utility
btw: if you'd like to give using threads a go, check out: www.javaalmanac.com, then search for "threads".. And check out the results.

:o)
0
 

Author Comment

by:ew0kian
Comment Utility
some example code would be awesome thanks!
0
 
LVL 35

Expert Comment

by:girionis
Comment Utility
You won't find better examples than in the following link: http://java.sun.com/docs/books/tutorial/essential/threads/
0

Featured Post

Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
sumDigits  challenge 7 60
stringclean challenge 26 54
computer science syllabus 3 52
Java array passed to SQL where clause 23 39
INTRODUCTION Working with files is a moderately common task in Java.  For most projects hard coding the file names, using parameters in configuration files, or using command-line arguments is sufficient.   However, when your application has vi…
Are you developing a Java application and want to create Excel Spreadsheets? You have come to the right place, this article will describe how you can create Excel Spreadsheets from a Java Application. For the purposes of this article, I will be u…
Viewers learn how to read error messages and identify possible mistakes that could cause hours of frustration. Coding is as much about debugging your code as it is about writing it. Define Error Message: Line Numbers: Type of Error: Break Down…
The viewer will learn how to implement Singleton Design Pattern in Java.

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now