Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

HTMLEditorKit to extratct texts in <p> tag only

Posted on 2009-07-15
16
Medium Priority
?
351 Views
Last Modified: 2012-05-07
Using HTMLEditorKit, it can extract only the text of web pages. Say: for this link:
http://sunsite.nus.edu.sg/SEAlinks/burma-info.html
it extracts like this: (Please see the attachment, extract.txt)
But I want to extract only like this: (Please see the attachment, description.txt)
What I mean is I only want the text of web pages, which are only in say, <p> tag, excluding other tags like <h>, etc.

Please see the code. This is not my own code. This is an example I got from my fromer questions. Thanks to objects!! Sorry that I posted this question again.

Any idea please.
Thanks!!
import javax.swing.text.*;
import java.io.*;
import javax.swing.text.html.*;
import java.net.*;
import java.util.*;
 
 
	public class HTMLTest extends HTMLDocument
	{
	    // stores any text found in document
 
		public StringBuilder text = new StringBuilder();
 
	    /**
	    *  Returns any text found in the document during parsing
	    */
 
	    public String getText()
	    {
	        return text.toString();
	    }
 
	    public HTMLEditorKit.ParserCallback getReader(int pos)
	    {
	        return new TextCallBack();
	    }
	    
	    public static void main(String args[]){
	    	ArrayList result=new ArrayList();
	    	try{
	    	URL url = new URL("http://www.informit.com/articles/article.aspx?p=31059");
	    	Reader reader = new InputStreamReader(
	    	   url.openConnection().getInputStream());
	    	EditorKit editorKit = new HTMLEditorKit();
	    	HTMLTest htmlText = new HTMLTest();
	    	htmlText.putProperty("IgnoreCharsetDirective", new Boolean(true));
 
	    	// Parse the HTML
 
	    	editorKit.read(reader, htmlText, 0);
 
	    	// Get the extracted text
 
	    	String text = htmlText.getText();
	    	System.out.println(text);
	    	}catch(Exception e){System.out.println(e);}
	    }
	    
	    class TextCallBack extends HTMLEditorKit.ParserCallback
	    {
	       /** Invoked when text is encounted during parsing */
	    	
	    	public TextCallBack(){}
 
	       public void handleText(char[] data, int pos)
	       {
	          text.append(data);
	          text.append('\n');
	       }
	    }
	}

Open in new window

extract.txt
description.txt
0
Comment
Question by:Juuno
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 4
  • 2
  • +1
16 Comments
 
LVL 17

Expert Comment

by:Thomas4019
ID: 24861309
I think using a real XML parser might be an easier solution. They are built into Java as well.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 24861330
Implement handleStartTag too, saving the current tag in an instance variable. Only append if the value is P
0
 
LVL 92

Expert Comment

by:objects
ID: 24866271
try this:

import javax.swing.text.*;

import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.HTML.Tag;

import java.net.*;
import java.util.*;

public class HTMLTest extends HTMLDocument {
      // stores any text found in document

      public StringBuilder text = new StringBuilder();

      /**
       * Returns any text found in the document during parsing
       */

      public String getText() {
            return text.toString();
      }

      public HTMLEditorKit.ParserCallback getReader(int pos) {
            return new TextCallBack();
      }

      public static void main(String args[]) {
            ArrayList result = new ArrayList();
            try {
                  URL url = new URL(
                              "http://sunsite.nus.edu.sg/SEAlinks/burma-info.html");
                  Reader reader = new InputStreamReader(url.openConnection()
                              .getInputStream());
                  EditorKit editorKit = new HTMLEditorKit();
                  HTMLTest htmlText = new HTMLTest();
                  htmlText.putProperty("IgnoreCharsetDirective", new Boolean(true));

                  // Parse the HTML

                  editorKit.read(reader, htmlText, 0);

                  // Get the extracted text

                  String text = htmlText.getText();
                  System.out.println(text);
            } catch (Exception e) {
                  System.out.println(e);
            }
      }

      class TextCallBack extends HTMLEditorKit.ParserCallback {
            /** Invoked when text is encounted during parsing */

            private boolean inP = false;
            
            public TextCallBack() {
            }

            
            @Override
            public void handleEndTag(Tag t, int pos) {
                  super.handleEndTag(t, pos);
                  if (t.equals(Tag.P)) {
                        inP = false;
                  }
            }


            @Override
            public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
                  super.handleStartTag(t, a, pos);
                  if (t.equals(Tag.P)) {
                        inP = true;
                  }
            }


            public void handleText(char[] data, int pos) {
                  if (inP) {
                        text.append(data);
                        text.append('\n');
                  }
            }
      }
}
0
Build and deliver software with DevOps

A digital transformation requires faster time to market, shorter software development lifecycles, and the ability to adapt rapidly to changing customer demands. DevOps provides the solution.

 

Author Comment

by:Juuno
ID: 24866401
@ objects - Thanks! But it still generates the same results.
0
 
LVL 92

Expert Comment

by:objects
ID: 24866420
what url are you testing with?
0
 
LVL 92

Expert Comment

by:objects
ID: 24866425
the url in your question doesn't actually use p tags so you'll need more then just looking for p tags.
0
 

Author Comment

by:Juuno
ID: 24866434
http://sunsite.nus.edu.sg/SEAlinks/burma-info.html

It seems that it also generates <h> tags.
0
 
LVL 92

Expert Comment

by:objects
ID: 24866456
so would you be better off stripping out tags instead of just including p tags?
0
 

Author Comment

by:Juuno
ID: 24866547
ya.. i think i need to strip out all other tags which are not p tags.
0
 
LVL 92

Expert Comment

by:objects
ID: 24866556
that will still strip pout most of the text on that page, as the majority of the text is not inside a p.
 
0
 

Author Comment

by:Juuno
ID: 24866563
ya.. it strips out all other text but not <h> i think that's how HTMLEditorKit works. is it right?
0
 
LVL 92

Expert Comment

by:objects
ID: 24866570
the code I posted does strip out the h tags (and all others), it only returns text that is inside a p tag
0
 
LVL 92

Expert Comment

by:objects
ID: 24866587
think I see the problem, the html in that page is a mess

http://validator.w3.org/check?uri=http%3A%2F%2Fsunsite.nus.edu.sg%2FSEAlinks%2Fburma-info.html&charset=%28detect+automatically%29&doctype=Inline&group=0

problem in your case is the p tag is being used to mark a break, and is not being used to 'wrap' para text as it should be.
0
 
LVL 92

Accepted Solution

by:
objects earned 2000 total points
ID: 24866749
tyou may be better off stripping out items you don't want instead of looking for text inside p tags

import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URL;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

import javax.swing.text.EditorKit;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.HTML.Tag;

public class HTMLTest extends HTMLDocument {
      // stores any text found in document

      public StringBuilder text = new StringBuilder();

      /**
       * Returns any text found in the document during parsing
       */

      public String getText() {
            return text.toString();
      }

      public HTMLEditorKit.ParserCallback getReader(int pos) {
            return new TextCallBack();
      }

      public static void main(String args[]) {
            ArrayList result = new ArrayList();
            try {
                  URL url = new URL(
                              "http://sunsite.nus.edu.sg/SEAlinks/burma-info.html");
                  Reader reader = new InputStreamReader(url.openConnection()
                              .getInputStream());
                  EditorKit editorKit = new HTMLEditorKit();
                  HTMLTest htmlText = new HTMLTest();
                  htmlText.putProperty("IgnoreCharsetDirective", new Boolean(true));

                  // Parse the HTML

                  editorKit.read(reader, htmlText, 0);

                  // Get the extracted text

                  String text = htmlText.getText();
                  System.out.println(text);
            } catch (Exception e) {
                  System.out.println(e);
            }
      }


      class TextCallBack extends HTMLEditorKit.ParserCallback {
            /** Invoked when text is encounted during parsing */

            private boolean interested = false;
            private final Set notInterested = new HashSet(Arrays.asList(Tag.H1, Tag.H2, Tag.H3, Tag.H4, Tag.A));
            
            public TextCallBack() {
            }

            
            @Override
            public void handleEndTag(Tag t, int pos) {
                  super.handleEndTag(t, pos);
                  if (tagTest(t)) {
                        interested = true;
                  }
            }


            @Override
            public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
                  super.handleStartTag(t, a, pos);
                  if (tagTest(t)) {
                        interested = false;
                  }
            }

            private boolean tagTest(Tag t) {
                  return notInterested.contains(t);
            }
            
            public void handleText(char[] data, int pos) {
                  if (interested) {
                        text.append(data);
                        text.append('\n');
                  } else {
                        System.out.println("skipping "+new String(data));
                  }
            }
      }
}
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 24867156
You would be much better off using a proper API. The below produces the output in the attached file
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebResponse;
import com.meterware.httpunit.HTMLElement;
 
public class Burma {
    public static void main(String[] args) throws Exception {
        WebConversation wc = new WebConversation();
        WebResponse wr = wc.getResponse("http://sunsite.nus.edu.sg/SEAlinks/burma-info.html");
        HTMLElement[] paras = wr.getElementsByTagName("p");
        for (HTMLElement para : paras) {
            System.out.println(para.getText());
        }
 
    }   
}

Open in new window

burma.txt
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction This article is the first of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article explains our test automation goals. Then rationale is given for the tools we use to a…
Java functions are among the best things for programmers to work with as Java sites can be very easy to read and prepare. Java especially simplifies many processes in the coding industry as it helps integrate many forms of technology and different d…
Viewers will learn about if statements in Java and their use The if statement: The condition required to create an if statement: Variations of if statements: An example using if statements:
Viewers will learn about basic arrays, how to declare them, and how to use them. Introduction and definition: Declare an array and cover the syntax of declaring them: Initialize every index in the created array: Example/Features of a basic arr…
Suggested Courses

722 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question