How to: Read Headers and Grab HTML???

Hello,
I am trying to create a java application that will retrieve html source (from an html file uploaded by staff members) from a slow server. There are approx. 500 of these files and they are rather long because they store lots of information. I need to (a) figure out how to download each file into a string (each file into one element of a string array) and I need to do this as fast as possible (I know there are several ways to do this but I need the fastest method), and (b) figure out how to read the headers of all the files before I begin downloading so that I can show the download progress in %.

Also, if possible, I need to figure out a way to store the downloaded content into a microsoft access database.

The data is stored in the following manor: Firstname - Lastname - Date - Phone Number.
I can parse the content easily but I need a very fast and effective method for storing the data for faster access in the future. The files are updated everyweek, therefore i need to get the date of the first item in the database and compare it to the date in each line of the updated page so that i don't download unnecessary content.

I know this is a long solution so It will I have set the points to as much as I could.  Also, since there are so many parts, I will most likely be distributing the points.

Thanks for your help, pop.
popnfresh86Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

JavatmCommented:
You can read the file it's self after downloaded and display that in a JTextComponent like :


                JTextPane x = new JTextPane();

       try {

      File fileName = new File("Sample.htm");

        BufferedReader br = new BufferedReader(new FileReader(fileName));
        String text = null;
        StringBuffer allText = new StringBuffer();

       while((text = br.readLine()) != null)
        {
        allText.append(text + "\r\n");
        }
        t2.setText(allText.toString());
        br.close();
        }
       catch (IOException e)
        {
        JOptionPane.showMessageDialog(this,"Error opening file . . .",
       "Warning . . .",JOptionPane.ERROR_MESSAGE);
        }

Now you can use JDBC for storing the Firstname - Lastname - Date - Phone Numbe  in Access database. Our help might be limited due to the
fact that we can not give all the answer you have to learn it. I'll give you several links for your tutorial.
0
JavatmCommented:
0
tdisessaCommented:
Check out this package:

http://www.matuschek.net/software/jobo/index.html

It is a web-spider package, but it contains a couple of classes that will help you out.

Basically, all you would need is:

import net.matuschek.http.*;

public class getPage {

   public static void main (String [] urls) throws Exception
   {
      HttpTool oHTool = new HttpTool();
      HttpDoc oHDoc = retrieveDocument(urls[0],
                                HttpConstants.GET,
                                "");

      Stiring sContentLength = oHDoc.getHeader("Content-length").getValue();
      String sContents = new String (oHDoc.getContents());
   }
}

Though, this might not help you completely, because the oHDoc.getContents() might
not return until the complete contents are retrieved.

0
Cloud Class® Course: SQL Server Core 2016

This course will introduce you to SQL Server Core 2016, as well as teach you about SSMS, data tools, installation, server configuration, using Management Studio, and writing and executing queries.

popnfresh86Author Commented:
Thanks for the quick responses guys. Im gonna check out ur suggestions and see what I come up with.

Let me show you what I am currently using...

I have the following imports:
import java.awt.*;
import java.io.*;
import java.net.*;

This method retrieves the source.
    public static String copy (String urlstring)
    {
        DataInputStream dis = null;
        ByteArrayOutputStream fos = null;
        try
        {
            URL url = new URL (urlstring);
            URLConnection connection = url.openConnection ();
            dis = new DataInputStream (connection.getInputStream ());
            fos = new ByteArrayOutputStream ();
            while (true)
            {
                fos.write (dis.readByte ());
            }
        }
        catch (EOFException eofe)
        {
            try
            {
                dis.close ();
                fos.close ();
            }
            catch (IOException ioe)
            {
                System.out.println ("IOException: " + ioe.getMessage ());
            }
        }
        catch (MalformedURLException murle)
        {
            System.out.println ("MalformedURLException: " + murle.getMessage ());
        }
        catch (IOException ioe)
        {
            System.out.println ("IOException: " + ioe.getMessage ());
        }
        return fos.toString ();
    }

This method retrieves the site size.

    public static int size (String urlstring)
    {
        int size = 0;
        try
        {
            URL url = new URL (urlstring);
            HttpURLConnection connection = (HttpURLConnection) url.openConnection ();
            size = connection.getContentLength ();
        }
        catch (MalformedURLException murle)
        {
            System.out.println ("MalformedURLException: " + murle.getMessage ());
        }
        catch (IOException ioe)
        {
            System.out.println ("IOException: " + ioe.getMessage ());
        }
        return size;

    }

The problem is that these methods are very slow with the files I am retrieving. I would list the files but because they are confidential, I can't.

I implemented a timer to show how long it takes to get the output of the copy method and size method.

The size method takes 20269 milliseconds for a given file and outputs the size as 644062 bytes.
The copy method takes 32967 milliseconds for a given file and outputs the source of the file.

I don't know if this is as fast as possible but if it is then im screwed because i have 500+ files and so 500 x 33 seconds is 4 hours and 35 minutes. I can't wait that long.

I know that using readByte is faster than readLine in the because I tried my copy method with both and readByte takes much less time than readLine.

Anyways, Im not looking for a "how to" but rather a "how fast" to my question.

Thanks again for the posts. I will accept an answer(s) shortly.
0
CEHJCommented:
Why do you have a separate size method? The fastest way would be to download it directly
0
CEHJCommented:
I've answered my own question about the size thing. You can speed things up by getting rid of DataInput (unused) and using a larger buffer:

public static String copy (String urlstring) throws Exception
{
      final int BUF_SIZE = 1 << 10 << 3; // 8KiB buffer
      int bytesRead = -1;
      byte[] buffer = new byte[BUF_SIZE];
      InputStream in = null;
      ByteArrayOutputStream fos = null;
      try
      {
            URL url = new URL (urlstring);
            URLConnection connection = url.openConnection();
            int downloadSize = connection.getContentLength();
            // You can use the above for calculations
            in = connection.getInputStream();
            fos = new ByteArrayOutputStream ();
            while ((bytesRead = in.read(buffer)) > -1)
            {
                  fos.write (buffer, 0, bytesRead);
                  // A callback could be called here with downloadSize / bytesRead
            }
      }
      finally
      {
            try
            {
                  dis.close ();
                  fos.close ();
            }
            catch (IOException ioe)
            {
                  /* ignore this */
            }
      }
}


0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
popnfresh86Author Commented:
Thanks for the responses guys. I kinda figured out how to do the majority of what I asked on my own but CEHJ gets the points cuz he helped with the speed concern.

Thanks again, pop.
0
CEHJCommented:
8-)
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Java

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.