Solved

How to: Read Headers and Grab HTML???

Posted on 2004-10-24
190 Views
Last Modified: 2011-09-20
Hello,
I am trying to create a java application that will retrieve html source (from an html file uploaded by staff members) from a slow server. There are approx. 500 of these files and they are rather long because they store lots of information. I need to (a) figure out how to download each file into a string (each file into one element of a string array) and I need to do this as fast as possible (I know there are several ways to do this but I need the fastest method), and (b) figure out how to read the headers of all the files before I begin downloading so that I can show the download progress in %.

Also, if possible, I need to figure out a way to store the downloaded content into a microsoft access database.

The data is stored in the following manor: Firstname - Lastname - Date - Phone Number.
I can parse the content easily but I need a very fast and effective method for storing the data for faster access in the future. The files are updated everyweek, therefore i need to get the date of the first item in the database and compare it to the date in each line of the updated page so that i don't download unnecessary content.

I know this is a long solution so It will I have set the points to as much as I could.  Also, since there are so many parts, I will most likely be distributing the points.

Thanks for your help, pop.
0
Question by:popnfresh86
    8 Comments
     
    LVL 15

    Expert Comment

    by:Javatm
    You can read the file it's self after downloaded and display that in a JTextComponent like :


                    JTextPane x = new JTextPane();

           try {

          File fileName = new File("Sample.htm");

            BufferedReader br = new BufferedReader(new FileReader(fileName));
            String text = null;
            StringBuffer allText = new StringBuffer();

           while((text = br.readLine()) != null)
            {
            allText.append(text + "\r\n");
            }
            t2.setText(allText.toString());
            br.close();
            }
           catch (IOException e)
            {
            JOptionPane.showMessageDialog(this,"Error opening file . . .",
           "Warning . . .",JOptionPane.ERROR_MESSAGE);
            }

    Now you can use JDBC for storing the Firstname - Lastname - Date - Phone Numbe  in Access database. Our help might be limited due to the
    fact that we can not give all the answer you have to learn it. I'll give you several links for your tutorial.
    0
     
    LVL 15

    Expert Comment

    by:Javatm
    0
     
    LVL 2

    Expert Comment

    by:tdisessa
    Check out this package:

    http://www.matuschek.net/software/jobo/index.html

    It is a web-spider package, but it contains a couple of classes that will help you out.

    Basically, all you would need is:

    import net.matuschek.http.*;

    public class getPage {

       public static void main (String [] urls) throws Exception
       {
          HttpTool oHTool = new HttpTool();
          HttpDoc oHDoc = retrieveDocument(urls[0],
                                    HttpConstants.GET,
                                    "");

          Stiring sContentLength = oHDoc.getHeader("Content-length").getValue();
          String sContents = new String (oHDoc.getContents());
       }
    }

    Though, this might not help you completely, because the oHDoc.getContents() might
    not return until the complete contents are retrieved.

    0
     

    Author Comment

    by:popnfresh86
    Thanks for the quick responses guys. Im gonna check out ur suggestions and see what I come up with.

    Let me show you what I am currently using...

    I have the following imports:
    import java.awt.*;
    import java.io.*;
    import java.net.*;

    This method retrieves the source.
        public static String copy (String urlstring)
        {
            DataInputStream dis = null;
            ByteArrayOutputStream fos = null;
            try
            {
                URL url = new URL (urlstring);
                URLConnection connection = url.openConnection ();
                dis = new DataInputStream (connection.getInputStream ());
                fos = new ByteArrayOutputStream ();
                while (true)
                {
                    fos.write (dis.readByte ());
                }
            }
            catch (EOFException eofe)
            {
                try
                {
                    dis.close ();
                    fos.close ();
                }
                catch (IOException ioe)
                {
                    System.out.println ("IOException: " + ioe.getMessage ());
                }
            }
            catch (MalformedURLException murle)
            {
                System.out.println ("MalformedURLException: " + murle.getMessage ());
            }
            catch (IOException ioe)
            {
                System.out.println ("IOException: " + ioe.getMessage ());
            }
            return fos.toString ();
        }

    This method retrieves the site size.

        public static int size (String urlstring)
        {
            int size = 0;
            try
            {
                URL url = new URL (urlstring);
                HttpURLConnection connection = (HttpURLConnection) url.openConnection ();
                size = connection.getContentLength ();
            }
            catch (MalformedURLException murle)
            {
                System.out.println ("MalformedURLException: " + murle.getMessage ());
            }
            catch (IOException ioe)
            {
                System.out.println ("IOException: " + ioe.getMessage ());
            }
            return size;

        }

    The problem is that these methods are very slow with the files I am retrieving. I would list the files but because they are confidential, I can't.

    I implemented a timer to show how long it takes to get the output of the copy method and size method.

    The size method takes 20269 milliseconds for a given file and outputs the size as 644062 bytes.
    The copy method takes 32967 milliseconds for a given file and outputs the source of the file.

    I don't know if this is as fast as possible but if it is then im screwed because i have 500+ files and so 500 x 33 seconds is 4 hours and 35 minutes. I can't wait that long.

    I know that using readByte is faster than readLine in the because I tried my copy method with both and readByte takes much less time than readLine.

    Anyways, Im not looking for a "how to" but rather a "how fast" to my question.

    Thanks again for the posts. I will accept an answer(s) shortly.
    0
     
    LVL 86

    Expert Comment

    by:CEHJ
    Why do you have a separate size method? The fastest way would be to download it directly
    0
     
    LVL 86

    Accepted Solution

    by:
    I've answered my own question about the size thing. You can speed things up by getting rid of DataInput (unused) and using a larger buffer:

    public static String copy (String urlstring) throws Exception
    {
          final int BUF_SIZE = 1 << 10 << 3; // 8KiB buffer
          int bytesRead = -1;
          byte[] buffer = new byte[BUF_SIZE];
          InputStream in = null;
          ByteArrayOutputStream fos = null;
          try
          {
                URL url = new URL (urlstring);
                URLConnection connection = url.openConnection();
                int downloadSize = connection.getContentLength();
                // You can use the above for calculations
                in = connection.getInputStream();
                fos = new ByteArrayOutputStream ();
                while ((bytesRead = in.read(buffer)) > -1)
                {
                      fos.write (buffer, 0, bytesRead);
                      // A callback could be called here with downloadSize / bytesRead
                }
          }
          finally
          {
                try
                {
                      dis.close ();
                      fos.close ();
                }
                catch (IOException ioe)
                {
                      /* ignore this */
                }
          }
    }


    0
     

    Author Comment

    by:popnfresh86
    Thanks for the responses guys. I kinda figured out how to do the majority of what I asked on my own but CEHJ gets the points cuz he helped with the speed concern.

    Thanks again, pop.
    0
     
    LVL 86

    Expert Comment

    by:CEHJ
    8-)
    0

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone. Privacy Policy Terms of Use

    Featured Post

    Highfive Gives IT Their Time Back

    Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

    For customizing the look of your lightweight component and making it look opaque like it was made of plastic.  This tip assumes your component to be of rectangular shape and completely opaque.   (CODE)
    Java Flight Recorder and Java Mission Control together create a complete tool chain to continuously collect low level and detailed runtime information enabling after-the-fact incident analysis. Java Flight Recorder is a profiling and event collectio…
    This theoretical tutorial explains exceptions, reasons for exceptions, different categories of exception and exception hierarchy.
    Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …

    860 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    14 Experts available now in Live!

    Get 1:1 Help Now