[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Read and Modify a very Big CSV file

Posted on 2005-04-13
2
Medium Priority
?
246 Views
Last Modified: 2012-08-14
I have a task where there are a lot of bzip files containing big CSV files (to the tune of 50MB or more, when uncompressed). I have to uncompress them, parse them, change some of the tokens with new text taken from an excel sheet, compress it again.

Now I am calling Runtime process call for Zipping/Unzipping part and am doing it alright. I have problem with the following code that reads the CSV and makes changes to it. When run normally it exits due to an OutOfMemoryError. I then ran it as java -Xms50m -Xmx100m  com.mypkg.utility.CSVReader. It works, but is very slow.

Few questions hence,
(1) How can I manage it without increasing the heap size?
(2) How can the following code be improved?
(3) Is storing the stuff in a StringBuffer and then writing it in a single go a good approach?

Thanks for your time,

Debashish

Here is the code snippet:

[code]
BufferedReader reader = new BufferedReader(new FileReader(csvFileName));
int lineCount = 1;
String DELIMITER = ",";

//Iterate through all lines
for(String line = null; (line = reader.readLine()) != null; lineCount++) {
      int columnCount = 1;

      //Iterate through all tokens in the line
            for(StringTokenizer tokenizer = new StringTokenizer(line, DELIMITER); tokenizer.hasMoreTokens(); columnCount++) {
                    Object objElement =  tokenizer.nextElement();
                    String strToPrint = (String) objElement;
                    //System.out.println("[" + lineCount + DELIMITER + columnCount + "] = " + strToPrint);
                   
                    //Some processing to replace an existing tone with some new text
                    if(conditionSucceeds){
                        strToPrint =  "NewText";
                        sbResult.append(strToPrint);
                    }
                    else{
                        sbResult.append(strToPrint);
                    }
                   
                    //We need to take care if the string itself has a delimiter (we know it
                    //with the sorrounding quotes ""). If it does append the next token as well.
                    if((strToPrint.indexOf("\"") == 0) && hasDelimiter(strToPrint)) {
                        //System.out.println("This string has delimiter at " + strToPrint.indexOf(DELIMITER));
                        objElement =  tokenizer.nextElement();
                        strToPrint = (String) objElement;
                        sbResult.append(DELIMITER).append(strToPrint);
                    }
                    if(columnCount != LASTCOLUMN) sbResult.append(DELIMITER);
                }
                //Now append a new line char
                sbResult.append(LN);
 }
 String strTemp = sbResult.toString();
 strTemp.substring(0, strTemp.lastIndexOf(DELIMITER));

 //Write the CSV to a temporary location
 writeToFile(strTemp, strOutFile);
[/code]
0
Comment
Question by:debuchakrabarty
2 Comments
 
LVL 92

Accepted Solution

by:
objects earned 250 total points
ID: 13770516
Write the processed line to the output file immediateky after you have completed processing.
That way you only need one line in memory at a time.
0
 
LVL 15

Expert Comment

by:aozarov
ID: 13775029
>> (3) Is storing the stuff in a StringBuffer and then writing it in a single go a good approach?
Just adding the content to a string buffer suggests that you don't need to apply cross reference logic between the cvs lines (which might requires keeping values in memory).
Therefore for your case that is a wrong approach and you should use the technique suggested by objects above.
Also, if you have more memory and you prefer not to change your approach (want to add cross reference logic or any other reason) then -Xmx100m  is definitely not enough for 50M input file.
Try using -Xms512m -Xmx512m and you will probably see increase in performance (though again that is not the way to go for your simple case).

BTW,
Object objElement =  tokenizer.nextElement();
String strToPrint = (String) objElement;
can be replaced with
String strToPrint = tokenizer.nextToken();
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Java Flight Recorder and Java Mission Control together create a complete tool chain to continuously collect low level and detailed runtime information enabling after-the-fact incident analysis. Java Flight Recorder is a profiling and event collectio…
Introduction This article is the second of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers the basic installation and configuration of the test automation tools used by…
This theoretical tutorial explains exceptions, reasons for exceptions, different categories of exception and exception hierarchy.
How to fix incompatible JVM issue while installing Eclipse While installing Eclipse in windows, got one error like above and unable to proceed with the installation. This video describes how to successfully install Eclipse. How to solve incompa…
Suggested Courses
Course of the Month18 days, 8 hours left to enroll

825 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question