Read and Modify a very Big CSV file

I have a task where there are a lot of bzip files containing big CSV files (to the tune of 50MB or more, when uncompressed). I have to uncompress them, parse them, change some of the tokens with new text taken from an excel sheet, compress it again.

Now I am calling Runtime process call for Zipping/Unzipping part and am doing it alright. I have problem with the following code that reads the CSV and makes changes to it. When run normally it exits due to an OutOfMemoryError. I then ran it as java -Xms50m -Xmx100m  com.mypkg.utility.CSVReader. It works, but is very slow.

Few questions hence,
(1) How can I manage it without increasing the heap size?
(2) How can the following code be improved?
(3) Is storing the stuff in a StringBuffer and then writing it in a single go a good approach?

Thanks for your time,


Here is the code snippet:

BufferedReader reader = new BufferedReader(new FileReader(csvFileName));
int lineCount = 1;
String DELIMITER = ",";

//Iterate through all lines
for(String line = null; (line = reader.readLine()) != null; lineCount++) {
      int columnCount = 1;

      //Iterate through all tokens in the line
            for(StringTokenizer tokenizer = new StringTokenizer(line, DELIMITER); tokenizer.hasMoreTokens(); columnCount++) {
                    Object objElement =  tokenizer.nextElement();
                    String strToPrint = (String) objElement;
                    //System.out.println("[" + lineCount + DELIMITER + columnCount + "] = " + strToPrint);
                    //Some processing to replace an existing tone with some new text
                        strToPrint =  "NewText";
                    //We need to take care if the string itself has a delimiter (we know it
                    //with the sorrounding quotes ""). If it does append the next token as well.
                    if((strToPrint.indexOf("\"") == 0) && hasDelimiter(strToPrint)) {
                        //System.out.println("This string has delimiter at " + strToPrint.indexOf(DELIMITER));
                        objElement =  tokenizer.nextElement();
                        strToPrint = (String) objElement;
                    if(columnCount != LASTCOLUMN) sbResult.append(DELIMITER);
                //Now append a new line char
 String strTemp = sbResult.toString();
 strTemp.substring(0, strTemp.lastIndexOf(DELIMITER));

 //Write the CSV to a temporary location
 writeToFile(strTemp, strOutFile);
Who is Participating?
Write the processed line to the output file immediateky after you have completed processing.
That way you only need one line in memory at a time.
>> (3) Is storing the stuff in a StringBuffer and then writing it in a single go a good approach?
Just adding the content to a string buffer suggests that you don't need to apply cross reference logic between the cvs lines (which might requires keeping values in memory).
Therefore for your case that is a wrong approach and you should use the technique suggested by objects above.
Also, if you have more memory and you prefer not to change your approach (want to add cross reference logic or any other reason) then -Xmx100m  is definitely not enough for 50M input file.
Try using -Xms512m -Xmx512m and you will probably see increase in performance (though again that is not the way to go for your simple case).

Object objElement =  tokenizer.nextElement();
String strToPrint = (String) objElement;
can be replaced with
String strToPrint = tokenizer.nextToken();
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.