Read and Modify a very Big CSV file

Posted on 2005-04-13
Last Modified: 2012-08-14
I have a task where there are a lot of bzip files containing big CSV files (to the tune of 50MB or more, when uncompressed). I have to uncompress them, parse them, change some of the tokens with new text taken from an excel sheet, compress it again.

Now I am calling Runtime process call for Zipping/Unzipping part and am doing it alright. I have problem with the following code that reads the CSV and makes changes to it. When run normally it exits due to an OutOfMemoryError. I then ran it as java -Xms50m -Xmx100m  com.mypkg.utility.CSVReader. It works, but is very slow.

Few questions hence,
(1) How can I manage it without increasing the heap size?
(2) How can the following code be improved?
(3) Is storing the stuff in a StringBuffer and then writing it in a single go a good approach?

Thanks for your time,


Here is the code snippet:

BufferedReader reader = new BufferedReader(new FileReader(csvFileName));
int lineCount = 1;
String DELIMITER = ",";

//Iterate through all lines
for(String line = null; (line = reader.readLine()) != null; lineCount++) {
      int columnCount = 1;

      //Iterate through all tokens in the line
            for(StringTokenizer tokenizer = new StringTokenizer(line, DELIMITER); tokenizer.hasMoreTokens(); columnCount++) {
                    Object objElement =  tokenizer.nextElement();
                    String strToPrint = (String) objElement;
                    //System.out.println("[" + lineCount + DELIMITER + columnCount + "] = " + strToPrint);
                    //Some processing to replace an existing tone with some new text
                        strToPrint =  "NewText";
                    //We need to take care if the string itself has a delimiter (we know it
                    //with the sorrounding quotes ""). If it does append the next token as well.
                    if((strToPrint.indexOf("\"") == 0) && hasDelimiter(strToPrint)) {
                        //System.out.println("This string has delimiter at " + strToPrint.indexOf(DELIMITER));
                        objElement =  tokenizer.nextElement();
                        strToPrint = (String) objElement;
                    if(columnCount != LASTCOLUMN) sbResult.append(DELIMITER);
                //Now append a new line char
 String strTemp = sbResult.toString();
 strTemp.substring(0, strTemp.lastIndexOf(DELIMITER));

 //Write the CSV to a temporary location
 writeToFile(strTemp, strOutFile);
Question by:debuchakrabarty
    LVL 92

    Accepted Solution

    Write the processed line to the output file immediateky after you have completed processing.
    That way you only need one line in memory at a time.
    LVL 15

    Expert Comment

    >> (3) Is storing the stuff in a StringBuffer and then writing it in a single go a good approach?
    Just adding the content to a string buffer suggests that you don't need to apply cross reference logic between the cvs lines (which might requires keeping values in memory).
    Therefore for your case that is a wrong approach and you should use the technique suggested by objects above.
    Also, if you have more memory and you prefer not to change your approach (want to add cross reference logic or any other reason) then -Xmx100m  is definitely not enough for 50M input file.
    Try using -Xms512m -Xmx512m and you will probably see increase in performance (though again that is not the way to go for your simple case).

    Object objElement =  tokenizer.nextElement();
    String strToPrint = (String) objElement;
    can be replaced with
    String strToPrint = tokenizer.nextToken();

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    IT, Stop Being Called Into Every Meeting

    Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

    Suggested Solutions

    Title # Comments Views Activity
    linearIn  challenge 23 54
    wordsWithoutList  challenge 24 60
    countX 22 49
    Saving Tweets with emojis using Twitter4J in a Mysql table 2 17
    By the end of 1980s, object oriented programming using languages like C++, Simula69 and ObjectPascal gained momentum. It looked like programmers finally found the perfect language. C++ successfully combined the object oriented principles of Simula w…
    This was posted to the Netbeans forum a Feb, 2010 and I also sent it to Verisign. Who didn't help much in my struggles to get my application signed. ------------------------- Start The idea here is to target your cell phones with the correct…
    Viewers learn about the “for” loop and how it works in Java. By comparing it to the while loop learned before, viewers can make the transition easily. You will learn about the formatting of the for loop as we write a program that prints even numbers…
    Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:

    759 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    13 Experts available now in Live!

    Get 1:1 Help Now