Solved

HTMLEditorKit to extratct texts in <p> tag only

Posted on 2009-07-15
16
343 Views
Last Modified: 2012-05-07
Using HTMLEditorKit, it can extract only the text of web pages. Say: for this link:
http://sunsite.nus.edu.sg/SEAlinks/burma-info.html
it extracts like this: (Please see the attachment, extract.txt)
But I want to extract only like this: (Please see the attachment, description.txt)
What I mean is I only want the text of web pages, which are only in say, <p> tag, excluding other tags like <h>, etc.

Please see the code. This is not my own code. This is an example I got from my fromer questions. Thanks to objects!! Sorry that I posted this question again.

Any idea please.
Thanks!!
import javax.swing.text.*;
import java.io.*;
import javax.swing.text.html.*;
import java.net.*;
import java.util.*;
 
 
	public class HTMLTest extends HTMLDocument
	{
	    // stores any text found in document
 
		public StringBuilder text = new StringBuilder();
 
	    /**
	    *  Returns any text found in the document during parsing
	    */
 
	    public String getText()
	    {
	        return text.toString();
	    }
 
	    public HTMLEditorKit.ParserCallback getReader(int pos)
	    {
	        return new TextCallBack();
	    }
	    
	    public static void main(String args[]){
	    	ArrayList result=new ArrayList();
	    	try{
	    	URL url = new URL("http://www.informit.com/articles/article.aspx?p=31059");
	    	Reader reader = new InputStreamReader(
	    	   url.openConnection().getInputStream());
	    	EditorKit editorKit = new HTMLEditorKit();
	    	HTMLTest htmlText = new HTMLTest();
	    	htmlText.putProperty("IgnoreCharsetDirective", new Boolean(true));
 
	    	// Parse the HTML
 
	    	editorKit.read(reader, htmlText, 0);
 
	    	// Get the extracted text
 
	    	String text = htmlText.getText();
	    	System.out.println(text);
	    	}catch(Exception e){System.out.println(e);}
	    }
	    
	    class TextCallBack extends HTMLEditorKit.ParserCallback
	    {
	       /** Invoked when text is encounted during parsing */
	    	
	    	public TextCallBack(){}
 
	       public void handleText(char[] data, int pos)
	       {
	          text.append(data);
	          text.append('\n');
	       }
	    }
	}

Open in new window

extract.txt
description.txt
0
Comment
Question by:Juuno
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 4
  • 2
  • +1
16 Comments
 
LVL 17

Expert Comment

by:Thomas4019
ID: 24861309
I think using a real XML parser might be an easier solution. They are built into Java as well.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 24861330
Implement handleStartTag too, saving the current tag in an instance variable. Only append if the value is P
0
 
LVL 92

Expert Comment

by:objects
ID: 24866271
try this:

import javax.swing.text.*;

import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.HTML.Tag;

import java.net.*;
import java.util.*;

public class HTMLTest extends HTMLDocument {
      // stores any text found in document

      public StringBuilder text = new StringBuilder();

      /**
       * Returns any text found in the document during parsing
       */

      public String getText() {
            return text.toString();
      }

      public HTMLEditorKit.ParserCallback getReader(int pos) {
            return new TextCallBack();
      }

      public static void main(String args[]) {
            ArrayList result = new ArrayList();
            try {
                  URL url = new URL(
                              "http://sunsite.nus.edu.sg/SEAlinks/burma-info.html");
                  Reader reader = new InputStreamReader(url.openConnection()
                              .getInputStream());
                  EditorKit editorKit = new HTMLEditorKit();
                  HTMLTest htmlText = new HTMLTest();
                  htmlText.putProperty("IgnoreCharsetDirective", new Boolean(true));

                  // Parse the HTML

                  editorKit.read(reader, htmlText, 0);

                  // Get the extracted text

                  String text = htmlText.getText();
                  System.out.println(text);
            } catch (Exception e) {
                  System.out.println(e);
            }
      }

      class TextCallBack extends HTMLEditorKit.ParserCallback {
            /** Invoked when text is encounted during parsing */

            private boolean inP = false;
            
            public TextCallBack() {
            }

            
            @Override
            public void handleEndTag(Tag t, int pos) {
                  super.handleEndTag(t, pos);
                  if (t.equals(Tag.P)) {
                        inP = false;
                  }
            }


            @Override
            public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
                  super.handleStartTag(t, a, pos);
                  if (t.equals(Tag.P)) {
                        inP = true;
                  }
            }


            public void handleText(char[] data, int pos) {
                  if (inP) {
                        text.append(data);
                        text.append('\n');
                  }
            }
      }
}
0
Revamp Your Training Process

Drastically shorten your training time with WalkMe's advanced online training solution that Guides your trainees to action.

 

Author Comment

by:Juuno
ID: 24866401
@ objects - Thanks! But it still generates the same results.
0
 
LVL 92

Expert Comment

by:objects
ID: 24866420
what url are you testing with?
0
 
LVL 92

Expert Comment

by:objects
ID: 24866425
the url in your question doesn't actually use p tags so you'll need more then just looking for p tags.
0
 

Author Comment

by:Juuno
ID: 24866434
http://sunsite.nus.edu.sg/SEAlinks/burma-info.html

It seems that it also generates <h> tags.
0
 
LVL 92

Expert Comment

by:objects
ID: 24866456
so would you be better off stripping out tags instead of just including p tags?
0
 

Author Comment

by:Juuno
ID: 24866547
ya.. i think i need to strip out all other tags which are not p tags.
0
 
LVL 92

Expert Comment

by:objects
ID: 24866556
that will still strip pout most of the text on that page, as the majority of the text is not inside a p.
 
0
 

Author Comment

by:Juuno
ID: 24866563
ya.. it strips out all other text but not <h> i think that's how HTMLEditorKit works. is it right?
0
 
LVL 92

Expert Comment

by:objects
ID: 24866570
the code I posted does strip out the h tags (and all others), it only returns text that is inside a p tag
0
 
LVL 92

Expert Comment

by:objects
ID: 24866587
think I see the problem, the html in that page is a mess

http://validator.w3.org/check?uri=http%3A%2F%2Fsunsite.nus.edu.sg%2FSEAlinks%2Fburma-info.html&charset=%28detect+automatically%29&doctype=Inline&group=0

problem in your case is the p tag is being used to mark a break, and is not being used to 'wrap' para text as it should be.
0
 
LVL 92

Accepted Solution

by:
objects earned 500 total points
ID: 24866749
tyou may be better off stripping out items you don't want instead of looking for text inside p tags

import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URL;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

import javax.swing.text.EditorKit;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.HTML.Tag;

public class HTMLTest extends HTMLDocument {
      // stores any text found in document

      public StringBuilder text = new StringBuilder();

      /**
       * Returns any text found in the document during parsing
       */

      public String getText() {
            return text.toString();
      }

      public HTMLEditorKit.ParserCallback getReader(int pos) {
            return new TextCallBack();
      }

      public static void main(String args[]) {
            ArrayList result = new ArrayList();
            try {
                  URL url = new URL(
                              "http://sunsite.nus.edu.sg/SEAlinks/burma-info.html");
                  Reader reader = new InputStreamReader(url.openConnection()
                              .getInputStream());
                  EditorKit editorKit = new HTMLEditorKit();
                  HTMLTest htmlText = new HTMLTest();
                  htmlText.putProperty("IgnoreCharsetDirective", new Boolean(true));

                  // Parse the HTML

                  editorKit.read(reader, htmlText, 0);

                  // Get the extracted text

                  String text = htmlText.getText();
                  System.out.println(text);
            } catch (Exception e) {
                  System.out.println(e);
            }
      }


      class TextCallBack extends HTMLEditorKit.ParserCallback {
            /** Invoked when text is encounted during parsing */

            private boolean interested = false;
            private final Set notInterested = new HashSet(Arrays.asList(Tag.H1, Tag.H2, Tag.H3, Tag.H4, Tag.A));
            
            public TextCallBack() {
            }

            
            @Override
            public void handleEndTag(Tag t, int pos) {
                  super.handleEndTag(t, pos);
                  if (tagTest(t)) {
                        interested = true;
                  }
            }


            @Override
            public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
                  super.handleStartTag(t, a, pos);
                  if (tagTest(t)) {
                        interested = false;
                  }
            }

            private boolean tagTest(Tag t) {
                  return notInterested.contains(t);
            }
            
            public void handleText(char[] data, int pos) {
                  if (interested) {
                        text.append(data);
                        text.append('\n');
                  } else {
                        System.out.println("skipping "+new String(data));
                  }
            }
      }
}
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 24867156
You would be much better off using a proper API. The below produces the output in the attached file
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebResponse;
import com.meterware.httpunit.HTMLElement;
 
public class Burma {
    public static void main(String[] args) throws Exception {
        WebConversation wc = new WebConversation();
        WebResponse wr = wc.getResponse("http://sunsite.nus.edu.sg/SEAlinks/burma-info.html");
        HTMLElement[] paras = wr.getElementsByTagName("p");
        for (HTMLElement para : paras) {
            System.out.println(para.getText());
        }
 
    }   
}

Open in new window

burma.txt
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

An old method to applying the Singleton pattern in your Java code is to check if a static instance, defined in the same class that needs to be instantiated once and only once, is null and then create a new instance; otherwise, the pre-existing insta…
Java had always been an easily readable and understandable language.  Some relatively recent changes in the language seem to be changing this pretty fast, and anyone that had not seen any Java code for the last 5 years will possibly have issues unde…
Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
Suggested Courses

615 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question