Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

Problem while reading xml file using SAX

Posted on 2006-04-18
6
Medium Priority
?
394 Views
Last Modified: 2013-11-19
Hi,

I have a huge xml file which is of size arround 500MB. I have to parse this file and store the node content in hash table. Since the file size is very big, i decided to go far SAX. When i parse the xml file using SAX, iam not getting entire node content for some nodes. For example consider the following xml.

<Employee_data>
<employee>
  <name>aaaa</name>
  <doj>2005-09-15 09:48:37.46</doj>
  .......
</employee>
 <employee>
  <name>bbbb</name>
  <doj>2005-09-15 09:48:37.46</doj>
  .......
</employee>
<employee>
  <name>cccc</name>
  <doj>2005-09-15 09:48:37.46</doj>
  .......
</employee>
<employee>
  <name>dddd</name>
  <doj>2005-09-15 09:48:37.46</doj>
  .......
</employee>

when i parse this file, for some <employee> record iam not getting the <doj> node content as it is. For some <employee> record, iam getting partial data, like ,2005-09-15 09:4 for <name> aaaa and 48:37.46 for next <employee> node with name bbb.

Does it mean that SAX parser doesnt handle the data in "characters" callback method if the data is lengthy one?

Following is the snapshot of my code:

import org.apache.xerces.parsers.SAXParser;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import org.xml.sax.InputSource;

public class ProcessSAX extends DefaultHandler{


public static void loadXML(String strFileName){
 SAXParser parser =null;
 try{
        parser = new SAXParser();
        parser.setContentHandler(new ProcessSAX());
        parser.setErrorHandler(new ProcessSAX());
        parser.parse(strDealFileName);  
}catch(Exception objException){
                  Parameters.logErrorMessage(" Error while parsing the xml file in ProecessSAX.loadXML(): "+objException.toString());
}
finally{
       parser = null;
    }
}
public void startElement(String uri, String localName, String qName, Attributes attr){}
      
public void endElement(String uri, String localName, String qName){}

public void characters(char[] chars, int start, int length){
      strValue = new String(chars, start, length);
      System.out.println("\n Value: "+strValue);      
}
      
public static void main(String args[]){
      loadXML("emp.xml");
}
}

How can i get the content of all nodes as it is?Can you suggest any solution?

Thanx
0
Comment
Question by:hemanexp26
  • 3
  • 3
6 Comments
 
LVL 6

Expert Comment

by:avinthm
ID: 16477350
you say that your xml is areound 500 MB. Its a very huge file.
May be the value in the xml is not proper. (its just a possibility)

did u verify the data that u got, by running your application, with the data in xml ?

here is a very good link for your reference
http://cafeconleche.org/books/xmljava/chapters/
on sax - http://cafeconleche.org/books/xmljava/chapters/ch06.html





0
 

Author Comment

by:hemanexp26
ID: 16484898
Yes. I verified the data that i got after running the application. It is not getting all node contents as it is.
0
 
LVL 6

Expert Comment

by:avinthm
ID: 16484928
are u just displaying the values or storing the value in some variable ?
can u paste the exact code ?

The behaviour is strange.  
u cant say "SAX parser doesnt handle the data in "characters" callback method if the data is lengthy one"
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:hemanexp26
ID: 16495359

>are u just displaying the values or storing the value in some variable ?

In "characters" callback iam assigning the char[] array to a string variable and printing the variable.

public void characters(char[] chars, int start, int length){
     strValue = new String(chars, start, length);
     System.out.println("\n Value: "+strValue);    
}


The refernce http://cafeconleche.org/books/xmljava/chapters/ch06s07.html

says "when there’s a large amount of text between two tags with no intervening markup, the parser may choose to call characters() multiple times even though it doesn’t need to. Xerces generally won’t pass more than 16K of text in one call. Crimson is limited to about 8K of text per call. At the extreme, I have even seen a parser pass a single character at a time to the characters() method. You must not assume that the parser will pass you the maximum contiguous run of text in a single call to characters(). "

0
 
LVL 6

Accepted Solution

by:
avinthm earned 750 total points
ID: 16495497
the site mentions abt the amount of text between 2 tags, and the limitation of xerces in 16K.
but the date or name given is much lesser than 16K i guess.
And more over, if this is the problem then the values printed for all doj tags should be of same length.

you have overriden characters(), try not overriding it.
http://java.sun.com/j2se/1.4.2/docs/api/org/xml/sax/ContentHandler.html#characters(char[],%20int,%20int)

for debugging purpose you can add one more line to your characters() method

>   System.out.println(chars + ", offset = " + start + ", length= " + length);
     strValue = new String(chars, start, length);
     System.out.println("\n Value: "+strValue);    
0
 

Author Comment

by:hemanexp26
ID: 16541173

Thanks for every one. I got the solution. The problem is that i did not clear the buffer "strValue" between each startElement() call. Once i cleared the buffer between each startElement() call, i got the tag values as it is.

0

Featured Post

Become an Android App Developer

Ready to kick start your career in 2018? Learn how to build an Android app in January’s Course of the Month and open the door to new opportunities.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The Confluence of Individual Knowledge and the Collective Intelligence At this writing (summer 2013) the term API (http://dictionary.reference.com/browse/API?s=t) has made its way into the popular lexicon of the English language.  A few years ago, …
JavaScript has plenty of pieces of code people often just copy/paste from somewhere but never quite fully understand. Self-Executing functions are just one good example that I'll try to demystify here.
Viewers will learn about arithmetic and Boolean expressions in Java and the logical operators used to create Boolean expressions. We will cover the symbols used for arithmetic expressions and define each logical operator and how to use them in Boole…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
Suggested Courses

564 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question