asked on

How to remove whitespace text-nodes from XML DOM

Hello,

I have two DOM objects and because I would like to have them comparable, I need to remove any text-node that only contains whitespace characters.

What is the simplest way to remove these nodes from my DOM?

I think there have to be a routine out there for this.

I'm using Java 1.3 and Apache xerces 2.5

Thanks
mos

sudhakar_koundinya

// Parses an XML file and returns a DOM document.
// If validating is true, the contents is validated against the DTD
// specified in the file.
public static Document parseXmlFile(String filename, boolean validating) {
try {
// Create a builder factory
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(validating);

// Create the builder and parse the file
Document doc = factory.newDocumentBuilder().parse(new File(filename));
return doc;
} catch (SAXException e) {
// A parsing error occurred; the xml input is not valid
} catch (ParserConfigurationException e) {
} catch (IOException e) {
}
return null;
}

public void remove()
{

Document doc = parseXmlFile("infilename.xml", false);

// Remove all <junk> elements
removeAll(doc, Node.ELEMENT_NODE, "junk");

// Remove all comment nodes
removeAll(doc, Node.COMMENT_NODE, null);

// Normalize the DOM tree to combine all adjacent text nodes
doc.normalize();
}
// This method walks the document and removes all nodes
// of the specified type and specified name.
// If name is null, then the node is removed if the type matches.
public static void removeAll(Node node, short nodeType, String name) {
if (node.getNodeType() == nodeType &&
(name == null || node.getNodeName().equals(name))) {
node.getParentNode().removeChild(node);
} else {
// Visit the children
NodeList list = node.getChildNodes();
for (int i=0; i<list.getLength(); i++) {
removeAll(list.item(i), nodeType, name);
}
}
}

sudhakar_koundinya

Above code may help you a little bit. But you should find white space related nodes

ASKER CERTIFIED SOLUTION

sudhakar_koundinya

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

CEHJ

Try setting

http://java.sun.com/j2se/1.4.2/docs/api/javax/xml/parsers/DocumentBuilderFactory.html#setIgnoringElementContentWhitespace(boolean)

on the parser factory

mos

ASKER

CEHJ: That doesn't help, because I have a ready state DOM-Object and there can't use the DocumentBuilderFactory anymore, right?!

Sudhakar: Thanks for the code. This seems to me the manuell way that could be a little slow and makes problem for large XML-Docs. Isn't there a API call for this? I think a lot of people needs this functionality...

sudhakar_koundinya

AFAIK, that is the solution. Anyhow I try give the other solution if I get

Regards
Sudha

sudhakar_koundinya

setIgnoringElementContentWhitespace() simply does not work if no DTD is specified!

CEHJ

>>That doesn't help, because I have a ready state DOM-Object

I see. Then you'll have to visit the nodes as sudhakar has mentioned

mos

ASKER

Hi sudhakar,

I tried you code, but it doesn't work.

Reason:

You hold the child nodes with NodeList list = node.getChildNodes();

Then you remove nodes with node.getParentNode().removeChild(node);

When you come back to iterate the NodeList, one element of the list is removed and
the index pointed at a wrong document. :(

CEHJ

You could also try to turn them both into Strings and do something like

s = s .replaceAll(">\\s+|\\s+<", "");