• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 367
  • Last Modified:

HTMLEditorKit to extratct texts in <p> tag only

Using HTMLEditorKit, it can extract only the text of web pages. Say: for this link:
http://sunsite.nus.edu.sg/SEAlinks/burma-info.html
it extracts like this: (Please see the attachment, extract.txt)
But I want to extract only like this: (Please see the attachment, description.txt)
What I mean is I only want the text of web pages, which are only in say, <p> tag, excluding other tags like <h>, etc.

Please see the code. This is not my own code. This is an example I got from my fromer questions. Thanks to objects!! Sorry that I posted this question again.

Any idea please.
Thanks!!
import javax.swing.text.*;
import java.io.*;
import javax.swing.text.html.*;
import java.net.*;
import java.util.*;
 
 
	public class HTMLTest extends HTMLDocument
	{
	    // stores any text found in document
 
		public StringBuilder text = new StringBuilder();
 
	    /**
	    *  Returns any text found in the document during parsing
	    */
 
	    public String getText()
	    {
	        return text.toString();
	    }
 
	    public HTMLEditorKit.ParserCallback getReader(int pos)
	    {
	        return new TextCallBack();
	    }
	    
	    public static void main(String args[]){
	    	ArrayList result=new ArrayList();
	    	try{
	    	URL url = new URL("http://www.informit.com/articles/article.aspx?p=31059");
	    	Reader reader = new InputStreamReader(
	    	   url.openConnection().getInputStream());
	    	EditorKit editorKit = new HTMLEditorKit();
	    	HTMLTest htmlText = new HTMLTest();
	    	htmlText.putProperty("IgnoreCharsetDirective", new Boolean(true));
 
	    	// Parse the HTML
 
	    	editorKit.read(reader, htmlText, 0);
 
	    	// Get the extracted text
 
	    	String text = htmlText.getText();
	    	System.out.println(text);
	    	}catch(Exception e){System.out.println(e);}
	    }
	    
	    class TextCallBack extends HTMLEditorKit.ParserCallback
	    {
	       /** Invoked when text is encounted during parsing */
	    	
	    	public TextCallBack(){}
 
	       public void handleText(char[] data, int pos)
	       {
	          text.append(data);
	          text.append('\n');
	       }
	    }
	}

Open in new window

extract.txt
description.txt
0
Juuno
Asked:
Juuno
  • 8
  • 4
  • 2
  • +1
1 Solution
 
Thomas4019Commented:
I think using a real XML parser might be an easier solution. They are built into Java as well.
0
 
CEHJCommented:
Implement handleStartTag too, saving the current tag in an instance variable. Only append if the value is P
0
 
objectsCommented:
try this:

import javax.swing.text.*;

import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.HTML.Tag;

import java.net.*;
import java.util.*;

public class HTMLTest extends HTMLDocument {
      // stores any text found in document

      public StringBuilder text = new StringBuilder();

      /**
       * Returns any text found in the document during parsing
       */

      public String getText() {
            return text.toString();
      }

      public HTMLEditorKit.ParserCallback getReader(int pos) {
            return new TextCallBack();
      }

      public static void main(String args[]) {
            ArrayList result = new ArrayList();
            try {
                  URL url = new URL(
                              "http://sunsite.nus.edu.sg/SEAlinks/burma-info.html");
                  Reader reader = new InputStreamReader(url.openConnection()
                              .getInputStream());
                  EditorKit editorKit = new HTMLEditorKit();
                  HTMLTest htmlText = new HTMLTest();
                  htmlText.putProperty("IgnoreCharsetDirective", new Boolean(true));

                  // Parse the HTML

                  editorKit.read(reader, htmlText, 0);

                  // Get the extracted text

                  String text = htmlText.getText();
                  System.out.println(text);
            } catch (Exception e) {
                  System.out.println(e);
            }
      }

      class TextCallBack extends HTMLEditorKit.ParserCallback {
            /** Invoked when text is encounted during parsing */

            private boolean inP = false;
            
            public TextCallBack() {
            }

            
            @Override
            public void handleEndTag(Tag t, int pos) {
                  super.handleEndTag(t, pos);
                  if (t.equals(Tag.P)) {
                        inP = false;
                  }
            }


            @Override
            public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
                  super.handleStartTag(t, a, pos);
                  if (t.equals(Tag.P)) {
                        inP = true;
                  }
            }


            public void handleText(char[] data, int pos) {
                  if (inP) {
                        text.append(data);
                        text.append('\n');
                  }
            }
      }
}
0
The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

 
JuunoAuthor Commented:
@ objects - Thanks! But it still generates the same results.
0
 
objectsCommented:
what url are you testing with?
0
 
objectsCommented:
the url in your question doesn't actually use p tags so you'll need more then just looking for p tags.
0
 
JuunoAuthor Commented:
http://sunsite.nus.edu.sg/SEAlinks/burma-info.html

It seems that it also generates <h> tags.
0
 
objectsCommented:
so would you be better off stripping out tags instead of just including p tags?
0
 
JuunoAuthor Commented:
ya.. i think i need to strip out all other tags which are not p tags.
0
 
objectsCommented:
that will still strip pout most of the text on that page, as the majority of the text is not inside a p.
 
0
 
JuunoAuthor Commented:
ya.. it strips out all other text but not <h> i think that's how HTMLEditorKit works. is it right?
0
 
objectsCommented:
the code I posted does strip out the h tags (and all others), it only returns text that is inside a p tag
0
 
objectsCommented:
think I see the problem, the html in that page is a mess

http://validator.w3.org/check?uri=http%3A%2F%2Fsunsite.nus.edu.sg%2FSEAlinks%2Fburma-info.html&charset=%28detect+automatically%29&doctype=Inline&group=0

problem in your case is the p tag is being used to mark a break, and is not being used to 'wrap' para text as it should be.
0
 
objectsCommented:
tyou may be better off stripping out items you don't want instead of looking for text inside p tags

import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URL;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

import javax.swing.text.EditorKit;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.HTML.Tag;

public class HTMLTest extends HTMLDocument {
      // stores any text found in document

      public StringBuilder text = new StringBuilder();

      /**
       * Returns any text found in the document during parsing
       */

      public String getText() {
            return text.toString();
      }

      public HTMLEditorKit.ParserCallback getReader(int pos) {
            return new TextCallBack();
      }

      public static void main(String args[]) {
            ArrayList result = new ArrayList();
            try {
                  URL url = new URL(
                              "http://sunsite.nus.edu.sg/SEAlinks/burma-info.html");
                  Reader reader = new InputStreamReader(url.openConnection()
                              .getInputStream());
                  EditorKit editorKit = new HTMLEditorKit();
                  HTMLTest htmlText = new HTMLTest();
                  htmlText.putProperty("IgnoreCharsetDirective", new Boolean(true));

                  // Parse the HTML

                  editorKit.read(reader, htmlText, 0);

                  // Get the extracted text

                  String text = htmlText.getText();
                  System.out.println(text);
            } catch (Exception e) {
                  System.out.println(e);
            }
      }


      class TextCallBack extends HTMLEditorKit.ParserCallback {
            /** Invoked when text is encounted during parsing */

            private boolean interested = false;
            private final Set notInterested = new HashSet(Arrays.asList(Tag.H1, Tag.H2, Tag.H3, Tag.H4, Tag.A));
            
            public TextCallBack() {
            }

            
            @Override
            public void handleEndTag(Tag t, int pos) {
                  super.handleEndTag(t, pos);
                  if (tagTest(t)) {
                        interested = true;
                  }
            }


            @Override
            public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
                  super.handleStartTag(t, a, pos);
                  if (tagTest(t)) {
                        interested = false;
                  }
            }

            private boolean tagTest(Tag t) {
                  return notInterested.contains(t);
            }
            
            public void handleText(char[] data, int pos) {
                  if (interested) {
                        text.append(data);
                        text.append('\n');
                  } else {
                        System.out.println("skipping "+new String(data));
                  }
            }
      }
}
0
 
CEHJCommented:
You would be much better off using a proper API. The below produces the output in the attached file
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebResponse;
import com.meterware.httpunit.HTMLElement;
 
public class Burma {
    public static void main(String[] args) throws Exception {
        WebConversation wc = new WebConversation();
        WebResponse wr = wc.getResponse("http://sunsite.nus.edu.sg/SEAlinks/burma-info.html");
        HTMLElement[] paras = wr.getElementsByTagName("p");
        for (HTMLElement para : paras) {
            System.out.println(para.getText());
        }
 
    }   
}

Open in new window

burma.txt
0

Featured Post

Never miss a deadline with monday.com

The revolutionary project management tool is here!   Plan visually with a single glance and make sure your projects get done.

  • 8
  • 4
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now