Problem while reading xml file using SAX

Hi,

I have a huge xml file which is of size arround 500MB. I have to parse this file and store the node content in hash table. Since the file size is very big, i decided to go far SAX. When i parse the xml file using SAX, iam not getting entire node content for some nodes. For example consider the following xml.

<Employee_data>
<employee>
  <name>aaaa</name>
  <doj>2005-09-15 09:48:37.46</doj>
  .......
</employee>
 <employee>
  <name>bbbb</name>
  <doj>2005-09-15 09:48:37.46</doj>
  .......
</employee>
<employee>
  <name>cccc</name>
  <doj>2005-09-15 09:48:37.46</doj>
  .......
</employee>
<employee>
  <name>dddd</name>
  <doj>2005-09-15 09:48:37.46</doj>
  .......
</employee>

when i parse this file, for some <employee> record iam not getting the <doj> node content as it is. For some <employee> record, iam getting partial data, like ,2005-09-15 09:4 for <name> aaaa and 48:37.46 for next <employee> node with name bbb.

Does it mean that SAX parser doesnt handle the data in "characters" callback method if the data is lengthy one?

Following is the snapshot of my code:

import org.apache.xerces.parsers.SAXParser;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import org.xml.sax.InputSource;

public class ProcessSAX extends DefaultHandler{


public static void loadXML(String strFileName){
 SAXParser parser =null;
 try{
        parser = new SAXParser();
        parser.setContentHandler(new ProcessSAX());
        parser.setErrorHandler(new ProcessSAX());
        parser.parse(strDealFileName);  
}catch(Exception objException){
                  Parameters.logErrorMessage(" Error while parsing the xml file in ProecessSAX.loadXML(): "+objException.toString());
}
finally{
       parser = null;
    }
}
public void startElement(String uri, String localName, String qName, Attributes attr){}
      
public void endElement(String uri, String localName, String qName){}

public void characters(char[] chars, int start, int length){
      strValue = new String(chars, start, length);
      System.out.println("\n Value: "+strValue);      
}
      
public static void main(String args[]){
      loadXML("emp.xml");
}
}

How can i get the content of all nodes as it is?Can you suggest any solution?

Thanx
hemanexp26Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

avinthmCommented:
you say that your xml is areound 500 MB. Its a very huge file.
May be the value in the xml is not proper. (its just a possibility)

did u verify the data that u got, by running your application, with the data in xml ?

here is a very good link for your reference
http://cafeconleche.org/books/xmljava/chapters/
on sax - http://cafeconleche.org/books/xmljava/chapters/ch06.html





0
hemanexp26Author Commented:
Yes. I verified the data that i got after running the application. It is not getting all node contents as it is.
0
avinthmCommented:
are u just displaying the values or storing the value in some variable ?
can u paste the exact code ?

The behaviour is strange.  
u cant say "SAX parser doesnt handle the data in "characters" callback method if the data is lengthy one"
0
Cloud Class® Course: C++ 11 Fundamentals

This course will introduce you to C++ 11 and teach you about syntax fundamentals.

hemanexp26Author Commented:

>are u just displaying the values or storing the value in some variable ?

In "characters" callback iam assigning the char[] array to a string variable and printing the variable.

public void characters(char[] chars, int start, int length){
     strValue = new String(chars, start, length);
     System.out.println("\n Value: "+strValue);    
}


The refernce http://cafeconleche.org/books/xmljava/chapters/ch06s07.html

says "when there’s a large amount of text between two tags with no intervening markup, the parser may choose to call characters() multiple times even though it doesn’t need to. Xerces generally won’t pass more than 16K of text in one call. Crimson is limited to about 8K of text per call. At the extreme, I have even seen a parser pass a single character at a time to the characters() method. You must not assume that the parser will pass you the maximum contiguous run of text in a single call to characters(). "

0
avinthmCommented:
the site mentions abt the amount of text between 2 tags, and the limitation of xerces in 16K.
but the date or name given is much lesser than 16K i guess.
And more over, if this is the problem then the values printed for all doj tags should be of same length.

you have overriden characters(), try not overriding it.
http://java.sun.com/j2se/1.4.2/docs/api/org/xml/sax/ContentHandler.html#characters(char[],%20int,%20int)

for debugging purpose you can add one more line to your characters() method

>   System.out.println(chars + ", offset = " + start + ", length= " + length);
     strValue = new String(chars, start, length);
     System.out.println("\n Value: "+strValue);    
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
hemanexp26Author Commented:

Thanks for every one. I got the solution. The problem is that i did not clear the buffer "strValue" between each startElement() call. Once i cleared the buffer between each startElement() call, i got the tag values as it is.

0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Web Languages and Standards

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.