HTMLEditorKit to extratct texts in <p> tag only

Using HTMLEditorKit, it can extract only the text of web pages. Say: for this link:
http://sunsite.nus.edu.sg/SEAlinks/burma-info.html
it extracts like this: (Please see the attachment, extract.txt)
But I want to extract only like this: (Please see the attachment, description.txt)
What I mean is I only want the text of web pages, which are only in say, <p> tag, excluding other tags like <h>, etc.

Please see the code. This is not my own code. This is an example I got from my fromer questions. Thanks to objects!! Sorry that I posted this question again.

Any idea please.
Thanks!!
import javax.swing.text.*;
import java.io.*;
import javax.swing.text.html.*;
import java.net.*;
import java.util.*;
 
 
	public class HTMLTest extends HTMLDocument
	{
	    // stores any text found in document
 
		public StringBuilder text = new StringBuilder();
 
	    /**
	    *  Returns any text found in the document during parsing
	    */
 
	    public String getText()
	    {
	        return text.toString();
	    }
 
	    public HTMLEditorKit.ParserCallback getReader(int pos)
	    {
	        return new TextCallBack();
	    }
	    
	    public static void main(String args[]){
	    	ArrayList result=new ArrayList();
	    	try{
	    	URL url = new URL("http://www.informit.com/articles/article.aspx?p=31059");
	    	Reader reader = new InputStreamReader(
	    	   url.openConnection().getInputStream());
	    	EditorKit editorKit = new HTMLEditorKit();
	    	HTMLTest htmlText = new HTMLTest();
	    	htmlText.putProperty("IgnoreCharsetDirective", new Boolean(true));
 
	    	// Parse the HTML
 
	    	editorKit.read(reader, htmlText, 0);
 
	    	// Get the extracted text
 
	    	String text = htmlText.getText();
	    	System.out.println(text);
	    	}catch(Exception e){System.out.println(e);}
	    }
	    
	    class TextCallBack extends HTMLEditorKit.ParserCallback
	    {
	       /** Invoked when text is encounted during parsing */
	    	
	    	public TextCallBack(){}
 
	       public void handleText(char[] data, int pos)
	       {
	          text.append(data);
	          text.append('\n');
	       }
	    }
	}

Open in new window

extract.txt
description.txt
JuunoAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Thomas4019Commented:
I think using a real XML parser might be an easier solution. They are built into Java as well.
0
CEHJCommented:
Implement handleStartTag too, saving the current tag in an instance variable. Only append if the value is P
0
Mick BarryJava DeveloperCommented:
try this:

import javax.swing.text.*;

import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.HTML.Tag;

import java.net.*;
import java.util.*;

public class HTMLTest extends HTMLDocument {
      // stores any text found in document

      public StringBuilder text = new StringBuilder();

      /**
       * Returns any text found in the document during parsing
       */

      public String getText() {
            return text.toString();
      }

      public HTMLEditorKit.ParserCallback getReader(int pos) {
            return new TextCallBack();
      }

      public static void main(String args[]) {
            ArrayList result = new ArrayList();
            try {
                  URL url = new URL(
                              "http://sunsite.nus.edu.sg/SEAlinks/burma-info.html");
                  Reader reader = new InputStreamReader(url.openConnection()
                              .getInputStream());
                  EditorKit editorKit = new HTMLEditorKit();
                  HTMLTest htmlText = new HTMLTest();
                  htmlText.putProperty("IgnoreCharsetDirective", new Boolean(true));

                  // Parse the HTML

                  editorKit.read(reader, htmlText, 0);

                  // Get the extracted text

                  String text = htmlText.getText();
                  System.out.println(text);
            } catch (Exception e) {
                  System.out.println(e);
            }
      }

      class TextCallBack extends HTMLEditorKit.ParserCallback {
            /** Invoked when text is encounted during parsing */

            private boolean inP = false;
            
            public TextCallBack() {
            }

            
            @Override
            public void handleEndTag(Tag t, int pos) {
                  super.handleEndTag(t, pos);
                  if (t.equals(Tag.P)) {
                        inP = false;
                  }
            }


            @Override
            public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
                  super.handleStartTag(t, a, pos);
                  if (t.equals(Tag.P)) {
                        inP = true;
                  }
            }


            public void handleText(char[] data, int pos) {
                  if (inP) {
                        text.append(data);
                        text.append('\n');
                  }
            }
      }
}
0
CompTIA Cloud+

The CompTIA Cloud+ Basic training course will teach you about cloud concepts and models, data storage, networking, and network infrastructure.

JuunoAuthor Commented:
@ objects - Thanks! But it still generates the same results.
0
Mick BarryJava DeveloperCommented:
what url are you testing with?
0
Mick BarryJava DeveloperCommented:
the url in your question doesn't actually use p tags so you'll need more then just looking for p tags.
0
JuunoAuthor Commented:
http://sunsite.nus.edu.sg/SEAlinks/burma-info.html

It seems that it also generates <h> tags.
0
Mick BarryJava DeveloperCommented:
so would you be better off stripping out tags instead of just including p tags?
0
JuunoAuthor Commented:
ya.. i think i need to strip out all other tags which are not p tags.
0
Mick BarryJava DeveloperCommented:
that will still strip pout most of the text on that page, as the majority of the text is not inside a p.
 
0
JuunoAuthor Commented:
ya.. it strips out all other text but not <h> i think that's how HTMLEditorKit works. is it right?
0
Mick BarryJava DeveloperCommented:
the code I posted does strip out the h tags (and all others), it only returns text that is inside a p tag
0
Mick BarryJava DeveloperCommented:
think I see the problem, the html in that page is a mess

http://validator.w3.org/check?uri=http%3A%2F%2Fsunsite.nus.edu.sg%2FSEAlinks%2Fburma-info.html&charset=%28detect+automatically%29&doctype=Inline&group=0

problem in your case is the p tag is being used to mark a break, and is not being used to 'wrap' para text as it should be.
0
Mick BarryJava DeveloperCommented:
tyou may be better off stripping out items you don't want instead of looking for text inside p tags

import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URL;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

import javax.swing.text.EditorKit;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.HTML.Tag;

public class HTMLTest extends HTMLDocument {
      // stores any text found in document

      public StringBuilder text = new StringBuilder();

      /**
       * Returns any text found in the document during parsing
       */

      public String getText() {
            return text.toString();
      }

      public HTMLEditorKit.ParserCallback getReader(int pos) {
            return new TextCallBack();
      }

      public static void main(String args[]) {
            ArrayList result = new ArrayList();
            try {
                  URL url = new URL(
                              "http://sunsite.nus.edu.sg/SEAlinks/burma-info.html");
                  Reader reader = new InputStreamReader(url.openConnection()
                              .getInputStream());
                  EditorKit editorKit = new HTMLEditorKit();
                  HTMLTest htmlText = new HTMLTest();
                  htmlText.putProperty("IgnoreCharsetDirective", new Boolean(true));

                  // Parse the HTML

                  editorKit.read(reader, htmlText, 0);

                  // Get the extracted text

                  String text = htmlText.getText();
                  System.out.println(text);
            } catch (Exception e) {
                  System.out.println(e);
            }
      }


      class TextCallBack extends HTMLEditorKit.ParserCallback {
            /** Invoked when text is encounted during parsing */

            private boolean interested = false;
            private final Set notInterested = new HashSet(Arrays.asList(Tag.H1, Tag.H2, Tag.H3, Tag.H4, Tag.A));
            
            public TextCallBack() {
            }

            
            @Override
            public void handleEndTag(Tag t, int pos) {
                  super.handleEndTag(t, pos);
                  if (tagTest(t)) {
                        interested = true;
                  }
            }


            @Override
            public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
                  super.handleStartTag(t, a, pos);
                  if (tagTest(t)) {
                        interested = false;
                  }
            }

            private boolean tagTest(Tag t) {
                  return notInterested.contains(t);
            }
            
            public void handleText(char[] data, int pos) {
                  if (interested) {
                        text.append(data);
                        text.append('\n');
                  } else {
                        System.out.println("skipping "+new String(data));
                  }
            }
      }
}
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
CEHJCommented:
You would be much better off using a proper API. The below produces the output in the attached file
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebResponse;
import com.meterware.httpunit.HTMLElement;
 
public class Burma {
    public static void main(String[] args) throws Exception {
        WebConversation wc = new WebConversation();
        WebResponse wr = wc.getResponse("http://sunsite.nus.edu.sg/SEAlinks/burma-info.html");
        HTMLElement[] paras = wr.getElementsByTagName("p");
        for (HTMLElement para : paras) {
            System.out.println(para.getText());
        }
 
    }   
}

Open in new window

burma.txt
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Java

From novice to tech pro — start learning today.