Solved

HTMLEditorKit to extratct texts in <p> tag only

Posted on 2009-07-15
16
334 Views
Last Modified: 2012-05-07
Using HTMLEditorKit, it can extract only the text of web pages. Say: for this link:
http://sunsite.nus.edu.sg/SEAlinks/burma-info.html
it extracts like this: (Please see the attachment, extract.txt)
But I want to extract only like this: (Please see the attachment, description.txt)
What I mean is I only want the text of web pages, which are only in say, <p> tag, excluding other tags like <h>, etc.

Please see the code. This is not my own code. This is an example I got from my fromer questions. Thanks to objects!! Sorry that I posted this question again.

Any idea please.
Thanks!!
import javax.swing.text.*;

import java.io.*;

import javax.swing.text.html.*;

import java.net.*;

import java.util.*;
 
 

	public class HTMLTest extends HTMLDocument

	{

	    // stores any text found in document
 

		public StringBuilder text = new StringBuilder();
 

	    /**

	    *  Returns any text found in the document during parsing

	    */
 

	    public String getText()

	    {

	        return text.toString();

	    }
 

	    public HTMLEditorKit.ParserCallback getReader(int pos)

	    {

	        return new TextCallBack();

	    }

	    

	    public static void main(String args[]){

	    	ArrayList result=new ArrayList();

	    	try{

	    	URL url = new URL("http://www.informit.com/articles/article.aspx?p=31059");

	    	Reader reader = new InputStreamReader(

	    	   url.openConnection().getInputStream());

	    	EditorKit editorKit = new HTMLEditorKit();

	    	HTMLTest htmlText = new HTMLTest();

	    	htmlText.putProperty("IgnoreCharsetDirective", new Boolean(true));
 

	    	// Parse the HTML
 

	    	editorKit.read(reader, htmlText, 0);
 

	    	// Get the extracted text
 

	    	String text = htmlText.getText();

	    	System.out.println(text);

	    	}catch(Exception e){System.out.println(e);}

	    }

	    

	    class TextCallBack extends HTMLEditorKit.ParserCallback

	    {

	       /** Invoked when text is encounted during parsing */

	    	

	    	public TextCallBack(){}
 

	       public void handleText(char[] data, int pos)

	       {

	          text.append(data);

	          text.append('\n');

	       }

	    }

	}

Open in new window

extract.txt
description.txt
0
Comment
Question by:Juuno
  • 8
  • 4
  • 2
  • +1
16 Comments
 
LVL 17

Expert Comment

by:Thomas4019
ID: 24861309
I think using a real XML parser might be an easier solution. They are built into Java as well.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 24861330
Implement handleStartTag too, saving the current tag in an instance variable. Only append if the value is P
0
 
LVL 92

Expert Comment

by:objects
ID: 24866271
try this:

import javax.swing.text.*;

import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.HTML.Tag;

import java.net.*;
import java.util.*;

public class HTMLTest extends HTMLDocument {
      // stores any text found in document

      public StringBuilder text = new StringBuilder();

      /**
       * Returns any text found in the document during parsing
       */

      public String getText() {
            return text.toString();
      }

      public HTMLEditorKit.ParserCallback getReader(int pos) {
            return new TextCallBack();
      }

      public static void main(String args[]) {
            ArrayList result = new ArrayList();
            try {
                  URL url = new URL(
                              "http://sunsite.nus.edu.sg/SEAlinks/burma-info.html");
                  Reader reader = new InputStreamReader(url.openConnection()
                              .getInputStream());
                  EditorKit editorKit = new HTMLEditorKit();
                  HTMLTest htmlText = new HTMLTest();
                  htmlText.putProperty("IgnoreCharsetDirective", new Boolean(true));

                  // Parse the HTML

                  editorKit.read(reader, htmlText, 0);

                  // Get the extracted text

                  String text = htmlText.getText();
                  System.out.println(text);
            } catch (Exception e) {
                  System.out.println(e);
            }
      }

      class TextCallBack extends HTMLEditorKit.ParserCallback {
            /** Invoked when text is encounted during parsing */

            private boolean inP = false;
            
            public TextCallBack() {
            }

            
            @Override
            public void handleEndTag(Tag t, int pos) {
                  super.handleEndTag(t, pos);
                  if (t.equals(Tag.P)) {
                        inP = false;
                  }
            }


            @Override
            public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
                  super.handleStartTag(t, a, pos);
                  if (t.equals(Tag.P)) {
                        inP = true;
                  }
            }


            public void handleText(char[] data, int pos) {
                  if (inP) {
                        text.append(data);
                        text.append('\n');
                  }
            }
      }
}
0
 

Author Comment

by:Juuno
ID: 24866401
@ objects - Thanks! But it still generates the same results.
0
 
LVL 92

Expert Comment

by:objects
ID: 24866420
what url are you testing with?
0
 
LVL 92

Expert Comment

by:objects
ID: 24866425
the url in your question doesn't actually use p tags so you'll need more then just looking for p tags.
0
 

Author Comment

by:Juuno
ID: 24866434
http://sunsite.nus.edu.sg/SEAlinks/burma-info.html

It seems that it also generates <h> tags.
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 92

Expert Comment

by:objects
ID: 24866456
so would you be better off stripping out tags instead of just including p tags?
0
 

Author Comment

by:Juuno
ID: 24866547
ya.. i think i need to strip out all other tags which are not p tags.
0
 
LVL 92

Expert Comment

by:objects
ID: 24866556
that will still strip pout most of the text on that page, as the majority of the text is not inside a p.
 
0
 

Author Comment

by:Juuno
ID: 24866563
ya.. it strips out all other text but not <h> i think that's how HTMLEditorKit works. is it right?
0
 
LVL 92

Expert Comment

by:objects
ID: 24866570
the code I posted does strip out the h tags (and all others), it only returns text that is inside a p tag
0
 
LVL 92

Expert Comment

by:objects
ID: 24866587
think I see the problem, the html in that page is a mess

http://validator.w3.org/check?uri=http%3A%2F%2Fsunsite.nus.edu.sg%2FSEAlinks%2Fburma-info.html&charset=%28detect+automatically%29&doctype=Inline&group=0

problem in your case is the p tag is being used to mark a break, and is not being used to 'wrap' para text as it should be.
0
 
LVL 92

Accepted Solution

by:
objects earned 500 total points
ID: 24866749
tyou may be better off stripping out items you don't want instead of looking for text inside p tags

import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URL;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

import javax.swing.text.EditorKit;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.HTML.Tag;

public class HTMLTest extends HTMLDocument {
      // stores any text found in document

      public StringBuilder text = new StringBuilder();

      /**
       * Returns any text found in the document during parsing
       */

      public String getText() {
            return text.toString();
      }

      public HTMLEditorKit.ParserCallback getReader(int pos) {
            return new TextCallBack();
      }

      public static void main(String args[]) {
            ArrayList result = new ArrayList();
            try {
                  URL url = new URL(
                              "http://sunsite.nus.edu.sg/SEAlinks/burma-info.html");
                  Reader reader = new InputStreamReader(url.openConnection()
                              .getInputStream());
                  EditorKit editorKit = new HTMLEditorKit();
                  HTMLTest htmlText = new HTMLTest();
                  htmlText.putProperty("IgnoreCharsetDirective", new Boolean(true));

                  // Parse the HTML

                  editorKit.read(reader, htmlText, 0);

                  // Get the extracted text

                  String text = htmlText.getText();
                  System.out.println(text);
            } catch (Exception e) {
                  System.out.println(e);
            }
      }


      class TextCallBack extends HTMLEditorKit.ParserCallback {
            /** Invoked when text is encounted during parsing */

            private boolean interested = false;
            private final Set notInterested = new HashSet(Arrays.asList(Tag.H1, Tag.H2, Tag.H3, Tag.H4, Tag.A));
            
            public TextCallBack() {
            }

            
            @Override
            public void handleEndTag(Tag t, int pos) {
                  super.handleEndTag(t, pos);
                  if (tagTest(t)) {
                        interested = true;
                  }
            }


            @Override
            public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
                  super.handleStartTag(t, a, pos);
                  if (tagTest(t)) {
                        interested = false;
                  }
            }

            private boolean tagTest(Tag t) {
                  return notInterested.contains(t);
            }
            
            public void handleText(char[] data, int pos) {
                  if (interested) {
                        text.append(data);
                        text.append('\n');
                  } else {
                        System.out.println("skipping "+new String(data));
                  }
            }
      }
}
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 24867156
You would be much better off using a proper API. The below produces the output in the attached file
import com.meterware.httpunit.WebConversation;

import com.meterware.httpunit.WebResponse;

import com.meterware.httpunit.HTMLElement;
 

public class Burma {

    public static void main(String[] args) throws Exception {

        WebConversation wc = new WebConversation();

        WebResponse wr = wc.getResponse("http://sunsite.nus.edu.sg/SEAlinks/burma-info.html");

        HTMLElement[] paras = wr.getElementsByTagName("p");

        for (HTMLElement para : paras) {

            System.out.println(para.getText());

        }
 

    }   

}

Open in new window

burma.txt
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

For customizing the look of your lightweight component and making it look lucid like it was made of glass. Or: how to make your component more Apple-ish ;) This tip assumes your component to be of rectangular shape and completely opaque. (COD…
Introduction This article is the last of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers our test design approach and then goes through a simple test case example, how …
Viewers learn about the “for” loop and how it works in Java. By comparing it to the while loop learned before, viewers can make the transition easily. You will learn about the formatting of the for loop as we write a program that prints even numbers…
Viewers learn about the scanner class in this video and are introduced to receiving user input for their programs. Additionally, objects, conditional statements, and loops are used to help reinforce the concepts. Introduce Scanner class: Importing…

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now