Link to home
Start Free TrialLog in
Avatar of jstretch
jstretch

asked on

parsing xml string with DocumentBuilder

Using javax DocumentBuilder to parse an xml string. I'm getting unpredictable results and have no idea what is wrong. I'm passing the parse() call a ByteArrayInputStream..is this incorrect? The code below is printing the following results:

07/03/29 14:41:49 survey_c length: 13  <--- This should be 1
07/03/29 14:41:49 list_multimedia_c length: 3 <-- this is correct
07/03/29 14:41:49 multimedia_c length: 0 <---this should only print once and should be 3
07/03/29 14:41:49 multimedia_c length: 7
07/03/29 14:41:49 multimedia_c length: 0

The xml string:
"<SURVEY>
  <LIST_MULTIMEDIA>
    <MULTIMEDIA>
      <MULT_TYPE>JPG</MULT_TYPE>
      <MULT_REF>test.jpg</MULT_REF>
      <MULT_DESC>test logo</MULT_DESC>
    </MULTIMEDIA>
  </LIST_MULTIMEDIA>
</SURVEY>"

//the code
            // xmlstring = "<SURVEY>......";
            db = dbf.newDocumentBuilder();
            doc = db.parse(new ByteArrayInputStream(xmlstring.getBytes()));
            SiteReport siteReport = new SiteReport();          
           
            NodeList survey_c = doc.getChildNodes().item(0).getChildNodes();
            System.out.println("survey_c length: " + survey_c.getLength());
            for (int i = 0; i < survey_c.getLength(); i++) {
                Node thisNode = survey_c.item(i);
                // get multimedia references
                if (thisNode.getNodeName().equalsIgnoreCase("LIST_MULTIMEDIA")) {
                    NodeList list_multimedia_c = thisNode.getChildNodes();
                    System.out.println("list_multimedia_c length: " + list_multimedia_c.getLength());
                    for (int j = 0; j < list_multimedia_c.getLength(); j++) {
                        Node multimedia = list_multimedia_c.item(j);
                        NodeList multimedia_c = multimedia.getChildNodes();
                        System.out.println("multimedia_c length: " + multimedia_c.getLength());
                        String type = "";
                        String ref = "";
                        String desc = "";
                        for (int k = 0; k < multimedia_c.getLength(); k++) {
                            Node mediaNode = multimedia_c.item(k);
                            if (mediaNode.getNodeName().toUpperCase().equalsIgnoreCase("MULT_TYPE")) {
                                type = mediaNode.getNodeValue();
                            } else if (mediaNode.getNodeName().toUpperCase().equalsIgnoreCase("MULT_REF")) {
                                ref = mediaNode.getNodeValue();
                            } else if (mediaNode.getNodeName().toUpperCase().equalsIgnoreCase("MULT_DESC")) {
                                desc = mediaNode.getNodeValue();
                            }
                        }
                        siteReport.addMediaFile(new MediaFile(type, ref, desc));
                    }

                    // get survey coordinates
                } else if (thisNode.getNodeName().equalsIgnoreCase("")) {
                    // TODO
                }
            }            
            // add to site report vector
            siteReports.add(siteReport);

SOLUTION
Avatar of Mayank S
Mayank S
Flag of India image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of jstretch
jstretch

ASKER

the parse() method requires an InputStream, using StringReader wont compile.

Tried doc.normalize() but still printed out the same results. Perhaps this is an encoding issue? I am getting this file out of a zip file using java objects (ZipFile, ZipItem, etc..)

I tried running your code on my machine & it is providing me correct results..!
import java.io.StringReader;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;

/*
 * Created on Mar 30, 2007
 *
 * To change the template for this generated file go to
 * Window&gt;Preferences&gt;Java&gt;Code Generation&gt;Code and Comments
 */

/**
 * @author kchaturv
 *
 * To change the template for this generated type comment go to
 * Window&gt;Preferences&gt;Java&gt;Code Generation&gt;Code and Comments
 */
public class TestCase {

public void TryIt()throws Exception
{
      String xmlstring="<SURVEY><LIST_MULTIMEDIA><MULTIMEDIA><MULT_TYPE>JPG</MULT_TYPE><MULT_REF>test.jpg</MULT_REF>              <MULT_DESC>test logo</MULT_DESC></MULTIMEDIA></LIST_MULTIMEDIA></SURVEY>";

//        the code
                        // xmlstring = "<SURVEY>......";
                        DocumentBuilderFactory dbf=DocumentBuilderFactory.newInstance();
                        DocumentBuilder db = dbf.newDocumentBuilder();
                        Document doc = db.parse(new InputSource(new StringReader(xmlstring)));
                        //SiteReport siteReport = new SiteReport();          
           
                        NodeList survey_c = doc.getChildNodes().item(0).getChildNodes();
                        System.out.println("survey_c length: " + survey_c.getLength());
                        for (int i = 0; i < survey_c.getLength(); i++) {
                              Node thisNode = survey_c.item(i);
                              // get multimedia references
                              if (thisNode.getNodeName().equalsIgnoreCase("LIST_MULTIMEDIA")) {
                                    NodeList list_multimedia_c = thisNode.getChildNodes();
                                    System.out.println("list_multimedia_c length: " + list_multimedia_c.getLength());
                                    for (int j = 0; j < list_multimedia_c.getLength(); j++) {
                                          Node multimedia = list_multimedia_c.item(j);
                                          NodeList multimedia_c = multimedia.getChildNodes();
                                          System.out.println("multimedia_c length: " + multimedia_c.getLength());
                                          String type = "";
                                          String ref = "";
                                          String desc = "";
                                          for (int k = 0; k < multimedia_c.getLength(); k++) {
                                                Node mediaNode = multimedia_c.item(k);
                                                if (mediaNode.getNodeName().toUpperCase().equalsIgnoreCase("MULT_TYPE")) {
                                                      type = mediaNode.getNodeValue();
                                                } else if (mediaNode.getNodeName().toUpperCase().equalsIgnoreCase("MULT_REF")) {
                                                      ref = mediaNode.getNodeValue();
                                                } else if (mediaNode.getNodeName().toUpperCase().equalsIgnoreCase("MULT_DESC")) {
                                                      desc = mediaNode.getNodeValue();
                                                }
                                          }
                                    //      siteReport.addMediaFile(new MediaFile(type, ref, desc));
                                    }

                                    // get survey coordinates
                              } else if (thisNode.getNodeName().equalsIgnoreCase("")) {
                                    // TODO
                              }
                        }            
                        // add to site report vector
                  //      siteReports.add(siteReport);

}
public static void main(String args[])
{
      try{
            new TestCase().TryIt();
      }
      catch(Exception e)
      {
            e.printStackTrace();
      }
}

}


Following are the results that I got...

survey_c length: 1
list_multimedia_c length: 1
multimedia_c length: 4

so most probably the XML string that you are receiving in this method are not what you are thinking they should be...
>> the parse() method requires an InputStream, using StringReader wont compile.

Sorry it has an overload which needs an InputSource - you can use it the way kuldeep as suggested.
(the input source being from the string reader)
Well thats working, I got better numbers but still alittle off...however: The XML I provided was only one node (for simplicity)..some of the other nodes have special characters..

What special characters would blow up the parser? Apostrophe, comma? Should I use a regex to replace those characters? (of course I dont see any < or > which is obvious.)
The special characters have escape sequences available, e.g., &lt for < and &gt for >
Yeah its just more spaces, I just removed line returns, but spaces was messing it up also.
>>Yeah its just more spaces, I just removed line returns, but spaces was messing it up also.

As I said the XML Parser counts spaces as valid nodes... thats why normalize should be used..

or a pre parser which takes out the spaces & linefeeds from the source.
ASP.NET is much better this regards..:-)
I guess as per the DOM specification, it is actually supposed to count them :)
Well a simple regex replace with the whitespace char should fix it.

normalize() was not removing the white space...at least with my implementation.

I was going to try Xerces but it looks a little to bloated for what I need.