Link to home
Start Free TrialLog in
Avatar of mos
mos

asked on

How to remove whitespace text-nodes from XML DOM

Hello,

I have two DOM objects and because I would like to have them comparable, I need to remove any text-node that only contains whitespace characters.

What is the simplest way to remove these nodes from my DOM?

I think there have to be a routine out there for this.

I'm using Java 1.3 and Apache xerces 2.5

Thanks
mos
Avatar of sudhakar_koundinya
sudhakar_koundinya



  // Parses an XML file and returns a DOM document.
        // If validating is true, the contents is validated against the DTD
        // specified in the file.
        public static Document parseXmlFile(String filename, boolean validating) {
            try {
                // Create a builder factory
                DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
                factory.setValidating(validating);
   
                // Create the builder and parse the file
                Document doc = factory.newDocumentBuilder().parse(new File(filename));
                return doc;
            } catch (SAXException e) {
                // A parsing error occurred; the xml input is not valid
            } catch (ParserConfigurationException e) {
            } catch (IOException e) {
            }
            return null;
        }




public void remove()
{

Document doc = parseXmlFile("infilename.xml", false);
   

   
    // Remove all <junk> elements
    removeAll(doc, Node.ELEMENT_NODE, "junk");
   
    // Remove all comment nodes
    removeAll(doc, Node.COMMENT_NODE, null);
   
    // Normalize the DOM tree to combine all adjacent text nodes
    doc.normalize();
 }  
    // This method walks the document and removes all nodes
    // of the specified type and specified name.
    // If name is null, then the node is removed if the type matches.
    public static void removeAll(Node node, short nodeType, String name) {
        if (node.getNodeType() == nodeType &&
                (name == null || node.getNodeName().equals(name))) {
            node.getParentNode().removeChild(node);
        } else {
            // Visit the children
            NodeList list = node.getChildNodes();
            for (int i=0; i<list.getLength(); i++) {
                removeAll(list.item(i), nodeType, name);
            }
        }
    }
Above code may help you a little bit. But you should find white space related nodes
ASKER CERTIFIED SOLUTION
Avatar of sudhakar_koundinya
sudhakar_koundinya

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of mos

ASKER

CEHJ: That doesn't help, because I have a ready state DOM-Object and there can't use the DocumentBuilderFactory anymore, right?!

Sudhakar: Thanks for the code. This seems to me the manuell way that could be a little slow and makes problem for large XML-Docs. Isn't there a API call for this? I think a lot of people needs this functionality...
AFAIK, that is the solution. Anyhow I try give the other solution if I get

Regards
Sudha
setIgnoringElementContentWhitespace() simply does not work if no DTD is specified!
>>That doesn't help, because I have a ready state DOM-Object

I see. Then you'll have to visit the nodes as sudhakar has mentioned
Avatar of mos

ASKER

Hi sudhakar,

I tried you code, but it doesn't work.

Reason:

You hold the child nodes with NodeList list = node.getChildNodes();

Then you remove nodes with node.getParentNode().removeChild(node);

When you come back to iterate the NodeList, one element of the list is removed and
the index pointed at a wrong document. :(
You could also try to turn them both into Strings and do something like

s = s .replaceAll(">\\s+|\\s+<", "");