Solved

Parsing HTML - extracting <title> contents

Posted on 2002-06-27
23
755 Views
Last Modified: 2013-11-23
I need to extract the contents from the <title> tag in several HTML documents. Up to now I've been using some self made construct which is starting to give me greif (especially when the title has attributes - which it seems to have on microsoft.com). Anyway I came accross the HTMLEditorKit.Parser class that maybe could do the job. However, from what I understand the parse() method runs in a separate thread. I need it to run in sequence (my getTitle() method returns the title as a String) to adapt it to the rest of the program. Can someone give an example of fetching the HTML title from a document given an InputStreamReader to the URL as input?
0
Comment
Question by:boomerang061797
  • 8
  • 7
  • 6
  • +1
23 Comments
 
LVL 92

Expert Comment

by:objects
Comment Utility
Use HTMLEditorKit.Parser and just have your getTitle() method wait until the title has been parsed before returning.
0
 

Author Comment

by:boomerang061797
Comment Utility
How do I do that? I don't create the thread so I can't possibly Thread.join(). The documentation says that this method should be implemented to be thread-safe. How do you do that? I'm not too familiar with multithreading.
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
Something like this:

class MyParser implements HTMLEditorKit.ParserCallback
{
   String Title;

   public String getTitle()
   {
      // kick off your parse passing
      // yourself as listener for callbacks

      Parser.parse(reader, this);

      // Suspend thread until title recieved

      synchronized (this)
      {
         wait();
      }
      return Title;
   }

   // Callback for parser

   public void handleXXXX()
   {
      if (got title)
      {
         // wake up the suspended thread
         // now that we have the title

        synchronized (this)
        {
           Title = ...;
           notify();
        }
      }
   }
}
0
 

Author Comment

by:boomerang061797
Comment Utility
I understand what you're saying. However, I seem to be suffering from javax.swing.text.ChangedCharSetException exceptions when parsing the HTML at the moment.
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
I have had experience of doing the sort of thing you're doing and have found that my own parsing techniques are more reliable than using the HTMLEditorKit. what version of JDK are you using?
0
 
LVL 9

Expert Comment

by:Ovi
Comment Utility
More simple then using the parser callback is to use something like :

HTMLEditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();

// now, depending on what you have (InputStream or String) you could use both methods for readinh html into document read(InputStream, doc, pos) or read(new StringReader("htmlText"), doc, pos)

String html = "<html><title>Test</title><body></body></html>";
kit.read(new StringReader(html), doc, 0);

//as you probably know the document structure is a tree like one, so you need to navigate'it until you found your title tag.

Element root = doc.getRootElements()[0];

searchTitle(root);

prottected String searchTitle(Element root) {
 if(root.getAttributes().getAttribute(StyleConstants.NameAttribute) == HTML.Tag.TITLE)
  return(retrieve the content());
//check for child elements using Element.getElementCount() != 0 and possibly call recursive this method for each child

}
0
 
LVL 9

Expert Comment

by:Ovi
Comment Utility
Why easy than the parser callback method ? Because your callback class methods will be not necesary called in a specific order so until you'll get your title content you could receive a lot of other information, which will get you confused. For more details, read the following article :
http://java.sun.com/products/jfc/tsc/articles/bookmarks/

... or
http://java.sun.com/products/jfc/tsc/articles/tictactoe/index.html
for details about navigation of HtmlDocument.
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
> Because your callback class methods will be not necesary
> called in a specific order so until you'll get your
> title content you could receive a lot of other
> information

So you just ignore it, whats the problem there.
And as the titles generally early in the stream, then it saves you reading the entire document.
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
> I seem to be suffering from javax.swing.text.ChangedCharSetException exceptions

Try specifying to ignore char sets in the parse call:

parser.parse(reader, this, true);
0
 

Author Comment

by:boomerang061797
Comment Utility
Objects, you say that I don't have to read the entire document and that may be true, but I have to download and parse the entire document. Anyway, the syncronize block isn't working the way it should. The program hangs on the wait statement (never gets a call from notify it seems) so only the first title fetching statement works (in main)

Is there any way to stop the parser when the title is found? Surly closing the stream won't do that?

The ChangedCharSetException is gone now.

Anyway, here's my code so far (I'm using jdk 1.3 - eclipse IDE)


import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class TitleTest extends HTMLEditorKit.ParserCallback {
     
 private boolean found = false;
 private String  title = null;

 public static void main(String[] args) throws Exception {
   TitleTest tt = new TitleTest();
   System.out.println(tt.getTitle("http://java.sun.com/"));
   System.out.println(tt.getTitle("http://www.ibm.com/"));
 }
 
 public String getTitle(String link) throws Exception {
   System.out.println("Starting getTitle()");
       
   URL url = new URL(link);
   InputStreamReader in = new InputStreamReader(url.openStream());
             
   // kick off your parse passing
   // yourself as listener for callbacks
   HTMLEditorKit.Parser parser = new ParserGetter().getParser();
   parser.parse(in, this, true);

   // Suspend thread until title recieved
   synchronized (this) {
     System.out.println("Waiting for signal");
     wait();
   }
   
   in.close();

   System.out.println("Returing from getTitle()");
   return title;
 }
 
 public void handleStartTag(HTML.Tag tag, MutableAttributeSet attibutes, int position) {
  if (tag == HTML.Tag.TITLE) {
    System.out.println("Title start tag found");
    found = true;
  }
 }
 
 public void handleEndTag(HTML.Tag tag, int position) {
   if (tag == HTML.Tag.TITLE) {
     System.out.println("Title end tag found");
     found = false;
   
     synchronized (this) {
       System.out.println("Sending signal");
       notifyAll();
     }
   }
 }
 
 public void handleText(char[] text, int position) {
   if (found) {
     title = new String(text);
     found = false;  
   }
 }
 
 class ParserGetter extends HTMLEditorKit {
   public HTMLEditorKit.Parser getParser() {
     return super.getParser();
   }
 }
}

0
 
LVL 86

Accepted Solution

by:
CEHJ earned 100 total points
Comment Utility
>>Objects, you say that I don't have to read the entire document and that may be true, but I have to download and parse the entire document.

That's one very good reason not to use the HTMLEditorKit. Another one is that the kit will fall over very easily due to bad html coding. The following will get you the title in a few lines of code. Most of it [the latter part] is just a wrapper class of mine to read from a url.


/**
 * Description of the Class
 *
 * Get the <title> tag from a url
 */
public class TitleGetter {

  public static void main(String[] args) throws Exception {
    if (args.length < 1) {
      System.err.println("Usage: java TitleGetter <url to get title from>");
      System.exit(-1);
    }
    URLReader reader = new URLReader(args[0]);
    // use the default buffer size of URLReader (1024 should be big enough to get the title)
    reader.setReadAll(false);
    StringBuffer sb = reader.read();
    // clone a lower case version of the file head
    String lcString = sb.toString().toLowerCase();
    // note the tag not closed - sometimes title mangled with attributes
    int titleStart = lcString.indexOf("<title");
    if (titleStart < 0) {
      System.err.println("No title tag found");
      System.exit(-1);
    }
    int titleEnd = lcString.indexOf("</title>");
    System.out.println("Title follows:");
    System.out.println(sb.toString().substring(titleStart, titleEnd + 8));
  }
}

/**
* A wrapper to read from a URL
*/

import java.net.*;
import java.io.*;
import java.util.*;

public class URLReader {
  URLConnection conn;
  URL url;
  protected Reader in = null;
  protected final int defBufSize = 1024;
  protected boolean unprettyPrint = false;
  int bufSize;
  char[] buf;
  char[] blankBuf;
  boolean readAll = true;
  StringBuffer stringBuffer;

  public URLReader(String strURL){
    try {
      bufSize = defBufSize;
      url = new URL(strURL);
    }
    catch(MalformedURLException eURL) {
      System.err.println("\nThe URL specified to URLReader() is invalid");
    }
    try {
      conn = url.openConnection();
      conn.connect();
    }
    catch(IOException eIO){
      System.err.println("Failed to connect to URL specified in URLReader(): [" + strURL + "] - Are you connected to the network\\internet?");
    }
  }

  public URLReader(URL url){
    this(url.toString());
  }



  public Reader getReader() throws IOException {
    //DEBUG
    /**
    * A BufferedReader buffer size of 0x8000 (32768) will
    * hold all of the first search page for debugging purposes
    */


    if(unprettyPrint){
      return (in = new UnprettyPrintReader(new BufferedReader(new InputStreamReader((conn.getInputStream())),bufSize)));
    }
    else {
      return (in = new BufferedReader(new InputStreamReader(conn.getInputStream())));
    }
  }

  /**
  * We determine the amount of the url to read by calling setReadAll().
  * If setReadAll(true) called then all the file is read in defaultBufSize
  * chunks, or bufSize if the buffer size has been set with a call to setBufferSize()
  * If setReadAll(false) is called, only bufSize bytes of the url are read;
  */

  public boolean getUnprettyPrint(){
    return unprettyPrint;
  }

  public void setUnprettyPrint(boolean value){
    unprettyPrint = value;
  }

  public void setReadAll(boolean readAll){
    this.readAll = readAll;
  }

  public boolean getReadAll(){
    return readAll;
  }

  public void setBufSize(int bufSize){
    this.bufSize = bufSize;
  }

  public int getBufSize(){
    return bufSize;
  }


  public StringBuffer read() throws IOException {
    /**
    * Get a StringBuffer to hold the whole file
    * and size the buffer
    */
    stringBuffer = new StringBuffer(bufSize);
    buf = new char[bufSize];
    blankBuf = new char[bufSize];
    /**
    * initialise the reader if has not already been done
    */
    try {
      if(in == null) getReader();
      if(readAll){
        int c = 0;
        // read the whole URL
        while((c = in.read()) > -1){
          stringBuffer.append((char)c);
        }
      }
      else {
        in.read(buf,0,bufSize);
        stringBuffer.append(buf);
      }
    }
    finally {
      in.close();
    }
    /**
    * Ensure buffer not holding trailing nulls
    */
    //return StringUtil.trimStringBuffer(stringBuffer);
    return stringBuffer;
  }

}// end class URLReader
0
What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
Sorry, I just noticed some references to UnprettyPrintReader. This is just a class to remove unneeded whitespace from an html file. You'll have to comment these references out (unless you wnat me to send it to you :-))
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
> The program hangs on the wait statement (never gets a
> call from notify it seems) so only the first title
> fetching statement works (in main)

Are saying the first call works ok, but the second one hangs?
If so, maybe that the parser's not find a title tag. Need to add a notify to the eof callback (flush) to handle this case.

Not sure about stopping the parsing, how does it behave if you close the stream?

0
 

Author Comment

by:boomerang061797
Comment Utility
To CEHJ: That's not too far off what I am already doing. However, I've been searching for the "<Title>" tag not "<Title" meaning that I wasn't getting a title for Microsoft web pages (they use some kind of XML syntax with attributes for <Title>. However, if someone writes <title> in the header (i.e Javascript code) the parser can't tell the difference. I'm not saying that the Sun HTML parser can either, just so that's said.

To Objects: The first call worked ok, the second and thereafter hang. So what you're saying is use the override the HTMLEditorKit.ParserCallback.flush() with a call to notifyAll()? Haven't tried to close the stream, but I imagine it wounldn't be a smart thing to do. Still don't like the fact that pages must be downloaded in full though. For links to pdf files etc. the I/O overhead would be terrible. That said, I don't know if the Sun HTML parser reads the content-type before downloading.

Why couldn't the Title be part of the HTTP header. It would definatly be much easier. :-)
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
> Haven't tried to close the stream, but I imagine it
> wounldn't be a smart thing to do.

It'll certainly stop the download ;-)

> For links to pdf files etc. the I/O overhead would be terrible.

Can't you check the content type before parsing?
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
Another approach would be to try SAX parsing the page. Since it does not attempt to build a DOM it will be possible to return from the function as soon as you hit the closing title tag.
0
 

Author Comment

by:boomerang061797
Comment Utility
To objects: yes of course I could check the content type before parsing (that's what I already do in my existing solution). Sorry about that.

To CEHJ: Although a good idea, doing this would mean introducing another API and a another JAR file. I don't know about you, but distribution and referencing these files in the classpath is a pain. I haven't used SAX in any projects (up to now) but from what I've seen it looks pretty much like the same procedure as extending the HTMLEditorKit.ParserCallback and it's callback methods and therefore could lead to the same kind of problem: Can this parser be stopped once started? You say it would be possible to return from the method as soon as you hit the closing tag, but can you stop the parser or does it run in it's own thread like the Sun HTML parser does?
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
>>You say it would be possible to return from the method as soon as you hit the closing tag, but can you stop the parser or does it run in it's own thread like the Sun HTML parser does?

Maybe I was too hasty with that supposition. I'd have to check.

>>and referencing these files in the classpath is a pain

Actually that shouldn't be necessary [if you've got a resonably recent JDK] if you put them in the ext directory.
0
 

Author Comment

by:boomerang061797
Comment Utility
Sorry Objects, but the HTMLParser threaded solution isn't working for me and I'm not happy with having to download the whole file. Therefore I feel I have to award the points to CEHJ (close call). Anyway, thankyou both for your time and help.

CEHJ: Maybe you could post the code to UnprettyPrintReader so I can have a look?
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
> I'm not happy with having to download the whole file.

You don't need to?

Your question as I understood it was how to use HTMLEditorKit, which I answered (not my fault your not happy with it). And you then accept an answer that's basically what you were already doing?
0
 

Author Comment

by:boomerang061797
Comment Utility
If I could give points to everyone then I would, but I can't.
Yes, I asked how to use the HTMLEditorKit and you answered. However, the sollution you suggested doesn't work even when flushing the parser - so what can I do?
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
Here we are boomerang: maybe it will be of some use:

import java.io.*;

/**
 *Description of the Class
 *
 * @author     Charles Johnson
 * @created    04 July 2001
 */
public class UnprettyPrintReader extends FilterReader {
     /**
      * The last character read
      */
     protected int lastChar;
     /**
      * How many times characters have been skipped
      */
     protected int recurseCount;
     /**
      * How many characters have been read
      */
     protected int readCount;


     /**
      *Constructor for the UnprettyPrintReader object
      *
      * @param  in  The Reader to be chained
      */
     public UnprettyPrintReader(Reader in) {
          super(in);
     }


     /**
      * Read the stream, ignoring redundant characters
      *
      * @return                  See docs for FilterReader
      * @exception  IOException  See docs for FilterReader
      */
     public int read() throws IOException {
          /*
           *  If the character read is redundant, we
           *  recurse to read the next one
           */
          synchronized (lock) {
               int buf = in.read();
               readCount++;
               if (buf == -1) {
                    return -1;
               } else if (buf == 0x0D) {
                    // carriage return
                    if (lastChar == 0x0D || lastChar == 0x0A) {
                         recurseCount++;
                         return read();
                    } else {
                         lastChar = buf;
                         return buf;
                    }
               } else if (buf == 0x0A) {
                    if (lastChar == 0x0A) {
                         // line feed
                         recurseCount++;
                         return read();
                    } else {
                         lastChar = buf;
                         return buf;
                    }
               } else if (buf == 0x20) {
                    // only one space allowed
                    if (lastChar == 0x20) {
                         recurseCount++;
                         return read();
                    } else {
                         lastChar = buf;
                         return buf;
                    }
               } else {
                    lastChar = buf;
                    return buf;
               }
          }
          // end synchronized
     }


     /**
      *Description of the Method
      *
      * @param  buf              See docs for FilterReader
      * @param  off              See docs for FilterReader
      * @param  len              See docs for FilterReader
      * @return                  See docs for FilterReader
      * @exception  IOException  See docs for FilterReader
      */
     public int read(char[] buf, int off, int len) throws IOException {
          int charsRead = 0;
          int c = -1;
          synchronized (lock) {
               if (
                         (off < 0) ||
                         (off > buf.length) ||
                         (len < 0) ||
                         ((off + len) > buf.length) ||
                         ((off + len) < 0)
                         ) {
                    throw new IndexOutOfBoundsException();
               } else if (len == 0) {
                    return 0;
               }
               while ((c = this.read()) > -1 && charsRead < len) {
                    buf[off + charsRead] = (char) c;
                    charsRead++;
               }
               return charsRead > 0 ? charsRead : -1;
          }
     }

}
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
My solution works fine here w/out downloading all the file.

My apologies for answering the question as asked :-)
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

Introduction Java can be integrated with native programs using an interface called JNI(Java Native Interface). Native programs are programs which can directly run on the processor. JNI is simply a naming and calling convention so that the JVM (Java…
In this post we will learn how to connect and configure Android Device (Smartphone etc.) with Android Studio. After that we will run a simple Hello World Program.
Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:
This theoretical tutorial explains exceptions, reasons for exceptions, different categories of exception and exception hierarchy.

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now