JAVA Memory problem while parsing a large text file

I'm creating an application which will parse a text file and pick out appropriate lines.
I'm using the bufferreader to read in each line one at a time. I scan the line and if its what I'm looking for I add it to a vector.
I dont think the problem is to do with the vector getting too big, I think its to do with the BufferReader. Anyhow the problem definately lies within the while loop.


The method works fine untill I use a txt file greater than 20 MB I get the following error
I know you can increase the size of the VM however I'd like to avoid that.
Exception occurred during event dispatching:
java.lang.OutOfMemoryError
      <<no stack trace available>>


The method is below.

//
//       ExtractText             -      Extracts the text occuring bettween two strings within a line
//                                          and adds each of the strings found to a vector
//
//       Paramaters                  -       String : The file address. String: the string occurring before the word
//                                          String : The string occurring after the word
//
//      Returns                        -        A vector of the extracted strings
//
            
      public  Vector extractText(String localFile,String before, String after)
      {
            File targetfile = new File(localFile);
            //
            // Check if file is there
            //
            
            if(!targetfile.exists() || !targetfile.canRead())
            {
                  System.out.println("The file is either not there or cannot be read");
            }
            
            // Begin .....
            try {
                                            FileReader inFile = new FileReader(targetfile);
                        BufferedReader buf_IN = new BufferedReader(inFile);
                  
                        String line;
                        Vector vRetlines = new Vector();
                        int linesread=0;

                        // While ive not gotten to the end of the file
                        while ((line = buf_IN.readLine())!=null)
                        {
                              
                              linesread++;

                                                                                //The application reads up to 63877 lines and then outputs the error
                              System.out.println(" Lines scanned="+linesread);
                              // Make sure that the line is a valid format i.e contains at
                              // least 2 tagtypes and at least one charactor for each tag
                              // There is problem if there is no sub string.
                              // -1 is returned
                              
                              int min = before.length()+after.length()+2;
                              if(line.length()>=min)
                              {
                        
                                    //Determine the position of each substring
                                    //So I can extract the desired text from the line.
                                          int pos1 = line.indexOf(before);
                                          int pos2 = line.indexOf(after);
                                          if(pos1 == -1 )
                                          {
                                                continue;
                                          }
                                          if(pos2 == -1)
                                          {
                                                pos2= line.length();      
                                          }
                                          pos1+=before.length();
                                          
                                    String substring = line.substring(pos1,pos2);
                                                                        
                                    // So to avoid this you use CheckForDuplicates();                                    
                                    if(duplicateExists(vRetlines,substring)==false)
                                    {

                                          //Desired string should begin at the end of the first substring
                                          //and it should end at the start of the second substring
                                          System.out.println(pos1+" "+pos2);

                                          System.out.println("Match found on line " + linesread +"Added " + substring);
                                          vRetlines.add(substring);
                                    }
                              }
                        }
                        // Close the buffer when finished            
                        buf_IN.close();      
                        
                        if(vRetlines.size()!=0)
                        {
                              
                              return vRetlines;      
                        }
                        
                  }catch(Exception e){ e.printStackTrace();}
            
            return null;
            
      }
conorocallaghanAsked:
Who is Participating?
 
guitaristxConnect With a Mentor Commented:
I see two strings being created and discarded each time through the loop, which is costly in Java.  These are the only objects that would be continually consuming memory in this loop, which is where the problem lies.  OutOfMemoryExceptions occur when Java *gasp* runs out of memory.  By creating String objects for every line in the file, and discarding them, there is a MINIMUM of the input file size that is allocated as dynamic memory.

Rule of thumb:
In Java, if memory usage is a problem, avoid object creation -
* Reuse objects as much as possible
* Avoid using methods that return objects

If you use StringBuffers and chars instead of Strings (provided that you reuse the StringBuffers), your memory consumption will be kept to a minimum.  Increasing the Java heap size is a dodgy fix.
0
 
sciuriwareCommented:
I think you simple cross the 64Mb (total) limit that's default;
run java with a higher limit, say 100Mb.
;JOOP!
0
 
sciuriwareCommented:
In effect:   java -Xmx100M <what you used to add>

;JOOP!
0
The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

 
TimYatesCommented:
how many matches do you get?

I wouldn't have thought it should run out of memory that quickly :-(
0
 
conorocallaghanAuthor Commented:
TimYates

I changed the type of search .. One search got 270 matches and another one got 180 matches. These matches are full lines which are saved to the vector.
How ever I dont think it is to do with the amount of matches as the application jumps out after getting to line 63877  regardless of the amount of matches found...

Sciuriware
I was wondering if I change the Java VMs' limit will that affect other users when they try to run the application on their machine??
0
 
MogalManicCommented:
It might be in your duplicateExists() function.  I implemented mine like this:

    private boolean duplicateExists(Vector vRetlines, String substring)
    {
        return vRetlines.contains(substring);
    }


I am currently searching a 63000 byte file using a search string that results in a 10% hit ratio and so far I am at line 100448 with 11370 found lines.  The Java application has allocated approximatly 13K of memory so far.

In my test data the line size is small (less than 80 characters) so that also might be a factor.
0
 
aozarovCommented:
>> I was wondering if I change the Java VMs' limit will that affect other users when they try to run the application on their machine??
Changing -Xmx to -Xmx100m means that your process can consume up to 100MB before it will get OutOfMemoryException.
If those users will not have 100M available on their machine then Java will not start the application.
I also think as Sciuriware that you should increase your max memory limit.

Also change:
String substring = line.substring(pos1,pos2);
to
String substring = new String(line.substring(pos1,pos2));
// As weird as it sounds/looks it has an effect on your memory usage (look at the source code of java.lang.String )

Also why to keep the matches, vRetlines, in a Vector? To check match you are going to perform O(N) operations.
I think HashSet will perform much better in your case and instead of calling the duplicateExists function you can just call if (vRetlinesSet.contains(substring)) ....
0
 
sciuriwareCommented:
aozarov, I disagree: java will even start with a limit of 1400M on a 200M system
but will simply not grant requests over 200M.

conorocallaghan, always be aware of how much memory (peek) your process ought
to use. If it's much more, then you are not returning abandoned memory.

Whenever you run an application that grows bigger than the machine (virtual),
you are in trouble.
;JOOP!
0
 
aozarovCommented:
>> aozarov, I disagree: java will even start with a limit of 1400M on a 200M system

Small test to prove my claim:
G:\java-temp\jni>java -Xmx1024m Prompt
Class 10719912Class 10719916Class 10719924 (works fine)

G:\java-temp\jni>java -Xmx2024m Prompt
Error occurred during initialization of VM
Could not reserve enough space for object heap
Could not create the Java virtual machine.
0
 
sciuriwareCommented:
Right, 2024 doesn't work for me either.
;JOOP!
0
 
guitaristxCommented:
I would recommend re-architecting the code so that you're not using readLine().  Essentially, you're discarding a String object each iteration through the loop, which requires garbage collection.  Instead, read a char at a time, and process it accordingly using char and StringBuffer (or StringBuilder, if you're using Java 1.5), and avoid using String altogether.  That will maximize your memory efficiency.
0
 
sciuriwareCommented:
That can't be the real problem; I've processed terabytes using strings and the garbage collector
is probably the smartest thing in the VM.
It's the problem that objects are not returned to the pool that causes out-of-memory.
;JOOP!
0
 
aozarovCommented:
>> * Reuse objects as much as possible
Read http://www-128.ibm.com/developerworks/java/library/j-jtp01274.html and you might change your mind.

>> * Avoid using methods that return objects
Why? Where did you get that from?
0
 
guitaristxCommented:
>> Read http://www-128.ibm.com/developerworks/java/library/j-jtp01274.html and you might change your mind.
What, in particular, about that is supposed to change my mind?  I found nothing that even addresses this issue.

If memory is running out, that means that this algorithm is consuming too much memory.  The methods that I posted above will help keep memory consumption to a minimum.  Notice, I didn't talk about object pooling.

>> * Reuse objects as much as possible
Because discarding objects frivolously is wasteful, especially in loops that will be executed thousands of times.  THIS IS WHY THIS ALGORITHM IS RUNNING OUT OF MEMORY.

>> * Avoid using methods that return objects
Methods that return objects often create objects.  See above.
0
 
aozarovCommented:
>> I found nothing that even addresses this issue.
The concept of keep reusing the same objects for long duration is similar to object pooling with respect to the JVM G.C

>> THIS IS WHY THIS ALGORITHM IS RUNNING OUT OF MEMORY
I don't think so. Short lived references will not cause you out of memory. Long lived ones might.

>>  The methods that I posted above ...
Not sure how you want to implement that. Can you provide a code sample? Keep reusing a StringBuffer and clearing it (setLength(0)) between iteration will consume
as much short term memory as the above approach. It would be nice to see how you implement such logic and avoiding creating short lived objects (though I don't think
they are the problem).
0
 
guitaristxCommented:
Considering that there are no long-lived references, then the only references that are being consumed ARE the short-lived ones.

First, setLength(0) (unless the implementors of the StringBuffer class are morons, which I doubt) shouldn't change the _capacity_ of the buffer.  However, let's just assume that it does, since there's not an explicit guarantee that it doesn't.

char[] to the rescue!

I'm not going to write the code for you, but you're guaranteed to have control of memory allocations if you use a char[].
0
 
sciuriwareCommented:
Whatever links you address, take this:
last year I worked on a huge project running on a 8-processor Xeon system running W2003.
I had to process round 48Tb and I heavily 'leaned' on String creation,
so I must have created and discarded some 350Tb of objects.
Speed nor memory (1Gb) ever was a problem: the SQL database appeared to be the bottleneck.
So, you just can't run slow or outofmemory because you creat lots of objects.
If you don't believe me, do a decent project and see it for yourself.
;JOOP!
0
All Courses

From novice to tech pro — start learning today.