Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

How to: Read Headers and Grab HTML???

Posted on 2004-10-24
8
Medium Priority
?
213 Views
Last Modified: 2011-09-20
Hello,
I am trying to create a java application that will retrieve html source (from an html file uploaded by staff members) from a slow server. There are approx. 500 of these files and they are rather long because they store lots of information. I need to (a) figure out how to download each file into a string (each file into one element of a string array) and I need to do this as fast as possible (I know there are several ways to do this but I need the fastest method), and (b) figure out how to read the headers of all the files before I begin downloading so that I can show the download progress in %.

Also, if possible, I need to figure out a way to store the downloaded content into a microsoft access database.

The data is stored in the following manor: Firstname - Lastname - Date - Phone Number.
I can parse the content easily but I need a very fast and effective method for storing the data for faster access in the future. The files are updated everyweek, therefore i need to get the date of the first item in the database and compare it to the date in each line of the updated page so that i don't download unnecessary content.

I know this is a long solution so It will I have set the points to as much as I could.  Also, since there are so many parts, I will most likely be distributing the points.

Thanks for your help, pop.
0
Comment
Question by:popnfresh86
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
  • 2
  • +1
8 Comments
 
LVL 15

Expert Comment

by:Javatm
ID: 12396694
You can read the file it's self after downloaded and display that in a JTextComponent like :


                JTextPane x = new JTextPane();

       try {

      File fileName = new File("Sample.htm");

        BufferedReader br = new BufferedReader(new FileReader(fileName));
        String text = null;
        StringBuffer allText = new StringBuffer();

       while((text = br.readLine()) != null)
        {
        allText.append(text + "\r\n");
        }
        t2.setText(allText.toString());
        br.close();
        }
       catch (IOException e)
        {
        JOptionPane.showMessageDialog(this,"Error opening file . . .",
       "Warning . . .",JOptionPane.ERROR_MESSAGE);
        }

Now you can use JDBC for storing the Firstname - Lastname - Date - Phone Numbe  in Access database. Our help might be limited due to the
fact that we can not give all the answer you have to learn it. I'll give you several links for your tutorial.
0
 
LVL 15

Expert Comment

by:Javatm
ID: 12396703
0
 
LVL 2

Expert Comment

by:tdisessa
ID: 12396914
Check out this package:

http://www.matuschek.net/software/jobo/index.html

It is a web-spider package, but it contains a couple of classes that will help you out.

Basically, all you would need is:

import net.matuschek.http.*;

public class getPage {

   public static void main (String [] urls) throws Exception
   {
      HttpTool oHTool = new HttpTool();
      HttpDoc oHDoc = retrieveDocument(urls[0],
                                HttpConstants.GET,
                                "");

      Stiring sContentLength = oHDoc.getHeader("Content-length").getValue();
      String sContents = new String (oHDoc.getContents());
   }
}

Though, this might not help you completely, because the oHDoc.getContents() might
not return until the complete contents are retrieved.

0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:popnfresh86
ID: 12397026
Thanks for the quick responses guys. Im gonna check out ur suggestions and see what I come up with.

Let me show you what I am currently using...

I have the following imports:
import java.awt.*;
import java.io.*;
import java.net.*;

This method retrieves the source.
    public static String copy (String urlstring)
    {
        DataInputStream dis = null;
        ByteArrayOutputStream fos = null;
        try
        {
            URL url = new URL (urlstring);
            URLConnection connection = url.openConnection ();
            dis = new DataInputStream (connection.getInputStream ());
            fos = new ByteArrayOutputStream ();
            while (true)
            {
                fos.write (dis.readByte ());
            }
        }
        catch (EOFException eofe)
        {
            try
            {
                dis.close ();
                fos.close ();
            }
            catch (IOException ioe)
            {
                System.out.println ("IOException: " + ioe.getMessage ());
            }
        }
        catch (MalformedURLException murle)
        {
            System.out.println ("MalformedURLException: " + murle.getMessage ());
        }
        catch (IOException ioe)
        {
            System.out.println ("IOException: " + ioe.getMessage ());
        }
        return fos.toString ();
    }

This method retrieves the site size.

    public static int size (String urlstring)
    {
        int size = 0;
        try
        {
            URL url = new URL (urlstring);
            HttpURLConnection connection = (HttpURLConnection) url.openConnection ();
            size = connection.getContentLength ();
        }
        catch (MalformedURLException murle)
        {
            System.out.println ("MalformedURLException: " + murle.getMessage ());
        }
        catch (IOException ioe)
        {
            System.out.println ("IOException: " + ioe.getMessage ());
        }
        return size;

    }

The problem is that these methods are very slow with the files I am retrieving. I would list the files but because they are confidential, I can't.

I implemented a timer to show how long it takes to get the output of the copy method and size method.

The size method takes 20269 milliseconds for a given file and outputs the size as 644062 bytes.
The copy method takes 32967 milliseconds for a given file and outputs the source of the file.

I don't know if this is as fast as possible but if it is then im screwed because i have 500+ files and so 500 x 33 seconds is 4 hours and 35 minutes. I can't wait that long.

I know that using readByte is faster than readLine in the because I tried my copy method with both and readByte takes much less time than readLine.

Anyways, Im not looking for a "how to" but rather a "how fast" to my question.

Thanks again for the posts. I will accept an answer(s) shortly.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 12399059
Why do you have a separate size method? The fastest way would be to download it directly
0
 
LVL 86

Accepted Solution

by:
CEHJ earned 1500 total points
ID: 12399157
I've answered my own question about the size thing. You can speed things up by getting rid of DataInput (unused) and using a larger buffer:

public static String copy (String urlstring) throws Exception
{
      final int BUF_SIZE = 1 << 10 << 3; // 8KiB buffer
      int bytesRead = -1;
      byte[] buffer = new byte[BUF_SIZE];
      InputStream in = null;
      ByteArrayOutputStream fos = null;
      try
      {
            URL url = new URL (urlstring);
            URLConnection connection = url.openConnection();
            int downloadSize = connection.getContentLength();
            // You can use the above for calculations
            in = connection.getInputStream();
            fos = new ByteArrayOutputStream ();
            while ((bytesRead = in.read(buffer)) > -1)
            {
                  fos.write (buffer, 0, bytesRead);
                  // A callback could be called here with downloadSize / bytesRead
            }
      }
      finally
      {
            try
            {
                  dis.close ();
                  fos.close ();
            }
            catch (IOException ioe)
            {
                  /* ignore this */
            }
      }
}


0
 

Author Comment

by:popnfresh86
ID: 12430430
Thanks for the responses guys. I kinda figured out how to do the majority of what I asked on my own but CEHJ gets the points cuz he helped with the speed concern.

Thanks again, pop.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 12432733
8-)
0

Featured Post

Ask an Anonymous Question!

Don't feel intimidated by what you don't know. Ask your question anonymously. It's easy! Learn more and upgrade.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction This article is the last of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers our test design approach and then goes through a simple test case example, how …
Basic understanding on "OO- Object Orientation" is needed for designing a logical solution to solve a problem. Basic OOAD is a prerequisite for a coder to ensure that they follow the basic design of OO. This would help developers to understand the b…
Viewers will learn about if statements in Java and their use The if statement: The condition required to create an if statement: Variations of if statements: An example using if statements:
This video teaches viewers about errors in exception handling.
Suggested Courses

636 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question