Solved

HTMLDocument problem. How can I get HTML body ?

Posted on 2002-05-10
10
2,538 Views
Last Modified: 2008-03-10
I have HTML Document. How can I get as a String the stuff between <BODY> and </BODY>.
For example, if I have <BODY>Bla Bla</BODY>, then I need "Bla Bla".
Thanks in Advance!
Best Regards,
Valeri
0
Comment
Question by:Valeri
  • 5
  • 2
  • 2
  • +1
10 Comments
 
LVL 35

Expert Comment

by:girionis
ID: 7001121
 First of all load the document into a variable. If "body" is a variable that holds the "<HTML><BODY>blah blah</BODY></HTML>" string then the following will do:

System.out.println(body.substring((body.toLowerCase().indexOf("<body>") + 6), body.toLowerCase().lastIndexOf("</body>")));

  Hope it helps.
0
 
LVL 9

Expert Comment

by:Ovi
ID: 7003099
The HTMLDocument (and all implementations of Document interface) store the logic of the html as a tree like structure. All you have to do is to navigate thro that tree until you find the body element. I will post the methods for that soon.
0
 
LVL 16

Author Comment

by:Valeri
ID: 7003130
Hi Ovi,
You are right, but I was unable to navigate through that tree...
I'm waiting for your post! :-)
0
 
LVL 9

Accepted Solution

by:
Ovi earned 100 total points
ID: 7003136
This test class expects you to put a "x.html" file in the same directory as the compiled code. In rest is working perfectly.

import java.awt.*;
import java.io.*;
import java.net.*;
import java.util.*;
import javax.swing.*;
import javax.swing.text.*;
import javax.swing.text.html.*;

public class HTMLDocUtils {
 
  public static final Element getBodyElement(HTMLDocument doc) {
    return(findElement(doc.getRootElements()[0], HTML.Tag.BODY));
  }
 
  public static final Element findElement(Element root, HTML.Tag kind) {
    if(root == null) return(null);
    if(matchElementType(root,  kind)) {
      return(root);
    }
    int count = root.getElementCount();
    if(count > 0) {
      for(int i = 0; i<count; i++) {
        Element child = root.getElement(i);
        Element e = findElement(child, kind);
        if(e != null)
          return(e);
      }
    }
    return(null);
  }
 
  public static final boolean matchElementType(Element e, HTML.Tag type) {
    return(e.getAttributes().getAttribute(StyleConstants.NameAttribute) == type);
  }
 
  public static void main(String[] args) {
    HTMLEditorKit kit;
    HTMLDocument doc;
    kit = new HTMLEditorKit();
    doc = (HTMLDocument)kit.createDefaultDocument();
    try {
      URL file = (new HTMLDocUtils()).getClass().getResource("x.html");
      InputStream is = file.openStream();
      kit.read(is, doc, 0);
    } catch(Exception e) { e.printStackTrace(); }
    System.out.println("Document content : ");
    doc.dump(System.out);
    Element body = HTMLDocUtils.getBodyElement(doc);
    if(body != null) {
      System.out.println("Body element detected ************************************************** :");
      System.out.println("Starts at : " + body.getStartOffset());
      System.out.println("Ends at : " + body.getEndOffset());
    } else
      System.out.println("Body element not defined! ************************************************");
  }
}
0
 
LVL 16

Author Comment

by:Valeri
ID: 7003234
Hi Ovi :-)
I will test this class and probably I'll give you the points!!! But now I want to leave the question opened.
Thanks a lot!
Valeri
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 9

Expert Comment

by:Ovi
ID: 7020766
Did you test'it ?
0
 

Expert Comment

by:samuelvd
ID: 8145579
I did test your code, the output is:

Document content :
Body element detected ************************************************** :
Starts at : 1
Ends at : 85

I wrote a bean shell script to sumarize and to use it for my selft

import javax.swing.text.html.*;
import javax.swing.text.Element;

HTMLEditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();

fr = new FileReader("Test.html");

/*
 * Inserta el contenido de fr en el documento doc, iniciando en la posición 0.
 */
kit.read(fr, doc, 0);

// doc.dump(System.out); // Vaciar el contenido del documento en la salida estandar.

// Obten el elemento Raiz
element = doc.getDefaultRootElement();

void muestraElementos(Element root) {
     print(root.getName());
     int count = element.getElementCount();
     for (int i=0; i<count; i++) {
          Element child = root.getElement(i);
          if (child != null)
               muestraElementos(child);
     }
}


/*****************************************************/

The output of this script is
html
head
p-implied
content
body
p-implied
content
content
table
tr
td
p-implied
content
content
td
p-implied
content
content
tr
td
p-implied
content
content
td
p-implied
content
content


I'm a little confused with the javax.swing.text.Element "p-implied" what does this Elements mean?. Other than this elements the output seems fine
0
 
LVL 9

Expert Comment

by:Ovi
ID: 8151542
p-implied behaves like a normal <p> (paragraph) element, but is generated under some conditions, internally, by the HtmlDocument. If you'll save the html again or simply read'it using editorPane.getText(), you'll see that the p-implied elements will be omitted.
0
 

Expert Comment

by:samuelvd
ID: 8156195
Ovi, One more question, I have programed XML DOM documents using Xerces; What would be the issues involved to implement an XHTMLDocument class using the w3c DOM API????

Would this require a lot of efford?

Regards!
0
 
LVL 9

Expert Comment

by:Ovi
ID: 8157222
Yes and no, depending on what you really want to realize and if you are open to a considerable effort. The text package is the most bigger one and the most complicated too. As a starting point I suggest you to read the articles from sun regarding the text package, especially the one called "customizing a text editor" or something similar, in which Tim Yates (the guru of the text package there) implements a Java code editor.

http://java.sun.com/products/jfc/tsc/articles/text/editor_kit/index.html
http://java.sun.com/products/jfc/tsc/articles/

I've implemented myself a WYSIWYG html editor but there was hard work to be done, and the result is not so competitive.
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Java had always been an easily readable and understandable language.  Some relatively recent changes in the language seem to be changing this pretty fast, and anyone that had not seen any Java code for the last 5 years will possibly have issues unde…
Java Flight Recorder and Java Mission Control together create a complete tool chain to continuously collect low level and detailed runtime information enabling after-the-fact incident analysis. Java Flight Recorder is a profiling and event collectio…
Viewers learn how to read error messages and identify possible mistakes that could cause hours of frustration. Coding is as much about debugging your code as it is about writing it. Define Error Message: Line Numbers: Type of Error: Break Down…
Viewers will learn about arithmetic and Boolean expressions in Java and the logical operators used to create Boolean expressions. We will cover the symbols used for arithmetic expressions and define each logical operator and how to use them in Boole…

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now