Solved

HTMLEditorKit to extratct texts in <p> tag only

Posted on 2009-07-15
16
331 Views
Last Modified: 2012-05-07
Using HTMLEditorKit, it can extract only the text of web pages. Say: for this link:
http://sunsite.nus.edu.sg/SEAlinks/burma-info.html
it extracts like this: (Please see the attachment, extract.txt)
But I want to extract only like this: (Please see the attachment, description.txt)
What I mean is I only want the text of web pages, which are only in say, <p> tag, excluding other tags like <h>, etc.

Please see the code. This is not my own code. This is an example I got from my fromer questions. Thanks to objects!! Sorry that I posted this question again.

Any idea please.
Thanks!!
import javax.swing.text.*;

import java.io.*;

import javax.swing.text.html.*;

import java.net.*;

import java.util.*;
 
 

	public class HTMLTest extends HTMLDocument

	{

	    // stores any text found in document
 

		public StringBuilder text = new StringBuilder();
 

	    /**

	    *  Returns any text found in the document during parsing

	    */
 

	    public String getText()

	    {

	        return text.toString();

	    }
 

	    public HTMLEditorKit.ParserCallback getReader(int pos)

	    {

	        return new TextCallBack();

	    }

	    

	    public static void main(String args[]){

	    	ArrayList result=new ArrayList();

	    	try{

	    	URL url = new URL("http://www.informit.com/articles/article.aspx?p=31059");

	    	Reader reader = new InputStreamReader(

	    	   url.openConnection().getInputStream());

	    	EditorKit editorKit = new HTMLEditorKit();

	    	HTMLTest htmlText = new HTMLTest();

	    	htmlText.putProperty("IgnoreCharsetDirective", new Boolean(true));
 

	    	// Parse the HTML
 

	    	editorKit.read(reader, htmlText, 0);
 

	    	// Get the extracted text
 

	    	String text = htmlText.getText();

	    	System.out.println(text);

	    	}catch(Exception e){System.out.println(e);}

	    }

	    

	    class TextCallBack extends HTMLEditorKit.ParserCallback

	    {

	       /** Invoked when text is encounted during parsing */

	    	

	    	public TextCallBack(){}
 

	       public void handleText(char[] data, int pos)

	       {

	          text.append(data);

	          text.append('\n');

	       }

	    }

	}

Open in new window

extract.txt
description.txt
0
Comment
Question by:Juuno
  • 8
  • 4
  • 2
  • +1
16 Comments
 
LVL 17

Expert Comment

by:Thomas4019
ID: 24861309
I think using a real XML parser might be an easier solution. They are built into Java as well.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 24861330
Implement handleStartTag too, saving the current tag in an instance variable. Only append if the value is P
0
 
LVL 92

Expert Comment

by:objects
ID: 24866271
try this:

import javax.swing.text.*;

import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.HTML.Tag;

import java.net.*;
import java.util.*;

public class HTMLTest extends HTMLDocument {
      // stores any text found in document

      public StringBuilder text = new StringBuilder();

      /**
       * Returns any text found in the document during parsing
       */

      public String getText() {
            return text.toString();
      }

      public HTMLEditorKit.ParserCallback getReader(int pos) {
            return new TextCallBack();
      }

      public static void main(String args[]) {
            ArrayList result = new ArrayList();
            try {
                  URL url = new URL(
                              "http://sunsite.nus.edu.sg/SEAlinks/burma-info.html");
                  Reader reader = new InputStreamReader(url.openConnection()
                              .getInputStream());
                  EditorKit editorKit = new HTMLEditorKit();
                  HTMLTest htmlText = new HTMLTest();
                  htmlText.putProperty("IgnoreCharsetDirective", new Boolean(true));

                  // Parse the HTML

                  editorKit.read(reader, htmlText, 0);

                  // Get the extracted text

                  String text = htmlText.getText();
                  System.out.println(text);
            } catch (Exception e) {
                  System.out.println(e);
            }
      }

      class TextCallBack extends HTMLEditorKit.ParserCallback {
            /** Invoked when text is encounted during parsing */

            private boolean inP = false;
            
            public TextCallBack() {
            }

            
            @Override
            public void handleEndTag(Tag t, int pos) {
                  super.handleEndTag(t, pos);
                  if (t.equals(Tag.P)) {
                        inP = false;
                  }
            }


            @Override
            public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
                  super.handleStartTag(t, a, pos);
                  if (t.equals(Tag.P)) {
                        inP = true;
                  }
            }


            public void handleText(char[] data, int pos) {
                  if (inP) {
                        text.append(data);
                        text.append('\n');
                  }
            }
      }
}
0
 

Author Comment

by:Juuno
ID: 24866401
@ objects - Thanks! But it still generates the same results.
0
 
LVL 92

Expert Comment

by:objects
ID: 24866420
what url are you testing with?
0
 
LVL 92

Expert Comment

by:objects
ID: 24866425
the url in your question doesn't actually use p tags so you'll need more then just looking for p tags.
0
 

Author Comment

by:Juuno
ID: 24866434
http://sunsite.nus.edu.sg/SEAlinks/burma-info.html

It seems that it also generates <h> tags.
0
What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 92

Expert Comment

by:objects
ID: 24866456
so would you be better off stripping out tags instead of just including p tags?
0
 

Author Comment

by:Juuno
ID: 24866547
ya.. i think i need to strip out all other tags which are not p tags.
0
 
LVL 92

Expert Comment

by:objects
ID: 24866556
that will still strip pout most of the text on that page, as the majority of the text is not inside a p.
 
0
 

Author Comment

by:Juuno
ID: 24866563
ya.. it strips out all other text but not <h> i think that's how HTMLEditorKit works. is it right?
0
 
LVL 92

Expert Comment

by:objects
ID: 24866570
the code I posted does strip out the h tags (and all others), it only returns text that is inside a p tag
0
 
LVL 92

Expert Comment

by:objects
ID: 24866587
think I see the problem, the html in that page is a mess

http://validator.w3.org/check?uri=http%3A%2F%2Fsunsite.nus.edu.sg%2FSEAlinks%2Fburma-info.html&charset=%28detect+automatically%29&doctype=Inline&group=0

problem in your case is the p tag is being used to mark a break, and is not being used to 'wrap' para text as it should be.
0
 
LVL 92

Accepted Solution

by:
objects earned 500 total points
ID: 24866749
tyou may be better off stripping out items you don't want instead of looking for text inside p tags

import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URL;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

import javax.swing.text.EditorKit;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.HTML.Tag;

public class HTMLTest extends HTMLDocument {
      // stores any text found in document

      public StringBuilder text = new StringBuilder();

      /**
       * Returns any text found in the document during parsing
       */

      public String getText() {
            return text.toString();
      }

      public HTMLEditorKit.ParserCallback getReader(int pos) {
            return new TextCallBack();
      }

      public static void main(String args[]) {
            ArrayList result = new ArrayList();
            try {
                  URL url = new URL(
                              "http://sunsite.nus.edu.sg/SEAlinks/burma-info.html");
                  Reader reader = new InputStreamReader(url.openConnection()
                              .getInputStream());
                  EditorKit editorKit = new HTMLEditorKit();
                  HTMLTest htmlText = new HTMLTest();
                  htmlText.putProperty("IgnoreCharsetDirective", new Boolean(true));

                  // Parse the HTML

                  editorKit.read(reader, htmlText, 0);

                  // Get the extracted text

                  String text = htmlText.getText();
                  System.out.println(text);
            } catch (Exception e) {
                  System.out.println(e);
            }
      }


      class TextCallBack extends HTMLEditorKit.ParserCallback {
            /** Invoked when text is encounted during parsing */

            private boolean interested = false;
            private final Set notInterested = new HashSet(Arrays.asList(Tag.H1, Tag.H2, Tag.H3, Tag.H4, Tag.A));
            
            public TextCallBack() {
            }

            
            @Override
            public void handleEndTag(Tag t, int pos) {
                  super.handleEndTag(t, pos);
                  if (tagTest(t)) {
                        interested = true;
                  }
            }


            @Override
            public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
                  super.handleStartTag(t, a, pos);
                  if (tagTest(t)) {
                        interested = false;
                  }
            }

            private boolean tagTest(Tag t) {
                  return notInterested.contains(t);
            }
            
            public void handleText(char[] data, int pos) {
                  if (interested) {
                        text.append(data);
                        text.append('\n');
                  } else {
                        System.out.println("skipping "+new String(data));
                  }
            }
      }
}
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 24867156
You would be much better off using a proper API. The below produces the output in the attached file
import com.meterware.httpunit.WebConversation;

import com.meterware.httpunit.WebResponse;

import com.meterware.httpunit.HTMLElement;
 

public class Burma {

    public static void main(String[] args) throws Exception {

        WebConversation wc = new WebConversation();

        WebResponse wr = wc.getResponse("http://sunsite.nus.edu.sg/SEAlinks/burma-info.html");

        HTMLElement[] paras = wr.getElementsByTagName("p");

        for (HTMLElement para : paras) {

            System.out.println(para.getText());

        }
 

    }   

}

Open in new window

burma.txt
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

For customizing the look of your lightweight component and making it look lucid like it was made of glass. Or: how to make your component more Apple-ish ;) This tip assumes your component to be of rectangular shape and completely opaque. (COD…
This was posted to the Netbeans forum a Feb, 2010 and I also sent it to Verisign. Who didn't help much in my struggles to get my application signed. ------------------------- Start The idea here is to target your cell phones with the correct…
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:
Viewers will learn about if statements in Java and their use The if statement: The condition required to create an if statement: Variations of if statements: An example using if statements:

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now