Solved

optimizing text file read in java

Posted on 2006-06-14
12
307 Views
Last Modified: 2012-08-13
hi all--

Using JProfiler, I've figured out a major bottleneck in a program i've written is how it reads files. i've tried 2 different versions of my read code without much difference in the results (see code below.)

 I'm wondering if the resources used to read each file are never being released, thus causing bad memory leaks. Perhaps i'm missing the trick to 'finalizing' a file read? can anyone advise me on how to optimize these file reads....

here are the 2 verisons of my file read
(note: there's some other functionality included here, namely, i need to create a list of 'dogIDs' from the files)

1 (pretty standard)---

                   BufferedReader in = new BufferedReader(_fileReader);
                   StringBuffer sbAllOfLog = new StringBuffer("");
                   String line;
                   String sNowDogID;
                   int intDogIDCnt= 0;
                                     
                   while ((line  = in.readLine()) != null) {
                         //basically, i'm trying to kill 2 birds with one stone here. while i have each line of the text file handy, i check it for a dogID--
                        if ( line.indexOf("DogID:") != -1){
                                int stPos = line.indexOf("DogID:") + 6;
                                String sNowDogID = line.substring(stPos);
                                lstDogIDs.add(sNowDogID);
                                intDogIDCnt++;
                        }
                        sbAllOfLog.append(line + '\n');
                    }
                   in.close();
                   _fileReader.close();


version 2---

                   File file = new File(sNowFile);
                   InputStream stream = new BufferedInputStream(new FileInputStream(file));
                   byte[] data = new byte[(int) file.length()];
                   stream.read(data);
                   sAllOfLog= new String(data);
                   
                 
                   boolean allDogsFnd = false;
                   int stPos = 0;
                   int fndPos;
                   String sNowDogID;
                   while (!allDogsFnd ){
                       fndPos = sAllOfLog.indexOf("DogID:",stPos);
                       if (fndPos > -1){
                           sNowDogID= sAllOfLog.substring(fndPos+6,sAllOfLog.indexOf("\n",fndPos)).trim();
                           lstDogIDs.add(sNowDogID);
                       } else {
                           allDogsFnd = true;
                       }
                       stPos = fndPos + 6;
                   }
0
Comment
Question by:jacobbdrew
  • 5
  • 4
  • 3
12 Comments
 
LVL 16

Expert Comment

by:suprapto45
ID: 16908277
Well,

You can try to use Regex to do it. I think that RegEx is fully optimized for this thing.

David
0
 
LVL 1

Author Comment

by:jacobbdrew
ID: 16908586
okay. I'll give it a whirl. can you point me any docs/tutorials with info on what you're taling about?
0
 
LVL 4

Expert Comment

by:v_karthik
ID: 16908682
How big is ur file? Is this memory hog staying for a while or does it peak during a certain period?

Regular expression should give you more speed and flexibility in string parsing, but I'm not sure about its memory savings. I see that you require a very simple string operation, so using the string class looks alright to me.

One quick optimization:

int ind = -1;
if ( (ind = line.indexOf("DogID:")) != -1){

                                int stPos = ind + 6;
                                String sNowDogID = line.substring(stPos);
                                lstDogIDs.add(sNowDogID);
                                intDogIDCnt++;
                        }

Basically, I've assigned your first indexOf call to a variable to reuse it later if required.

Is  lstDogIDs a vector?  One thing about using a Vector - a vector is thread-safe. Which means, it has in-built code to synchronize the access to its elements. Synchronization slows down your code drastically. If you are not worried about multiple threads accessing your code, using ArrayList should give you better performance.

0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 16

Expert Comment

by:suprapto45
ID: 16908704
I think that v_karthik has the points.

>>"I've figured out a major bottleneck in a program i've written is how it reads files"
As you said that the bottleneck is on how it reads file and not in the parsing of string

>>"Regular expression should give you more speed and flexibility in string parsing, but I'm not sure about its memory savings."
You are right v_karthik.

Yes, how big is your file?

David
0
 
LVL 16

Expert Comment

by:suprapto45
ID: 16908752
http://www.precisejava.com/javaperf/j2se/IO.htm

If your file is really big, try to use buffering. Honestly, I have not really tried the above URL ;)

David
0
 
LVL 4

Expert Comment

by:v_karthik
ID: 16908778
>>If your file is really big, try to use buffering.

The link suggests the use of BufferedXXX classes. I see that the author is using BufferedReader, so hes on the right track. To my knowledge, BufferedReader is better than BufferedInputStream if you are sure your data is character data and not a mix and match (like int, string etc.)  
0
 
LVL 1

Author Comment

by:jacobbdrew
ID: 16909055
the files are btween around 50k and 200k, but there are lots of them. up to 2000. the app runs a number of threads which parse the logs, but after a while i simply run out of memory. (in fact, i can't parse much more than a few hundred files in one go)

at this point, i'm testing out ways to limit the number of files each instance parses, letting the app 'finish' so that all the resources are released, then re-starting until all the files are parsed.

if there's nothing i'm missing with respect to 'closing'/'finalizing' how i read the text file, then maybe my memory leaks are coming from somewhere else.

i guess i'm thinking that a perfectly coded app woulf be able to keep reading files forever so long as the data in memory is released when no longer in use. is this a bogus assumption?

hmm. maybe i'll try reading all the files without actually parsing them and seeing how many i can get through.
0
 
LVL 4

Accepted Solution

by:
v_karthik earned 500 total points
ID: 16909069
The size is not much, apparently. You should probably try a thread pool model. You can google for a standard threadpool implementation in java and use that. Otherwise write your own. A thread pool is like this - a thread manager has control over, say n, threads.  As "jobs" come in, the manager picks one of the free threads and assigns the job to it. The thread finishes its job and returns to the free pool. This way the no. of threads spawned doesn't get out of control.
0
 
LVL 1

Author Comment

by:jacobbdrew
ID: 16913123
yeah, i'm running a thread pool, but things are still spinning out of contorl. the 'fix' i've implement (more a hack) is to only run the app on a limited number of files at a time, let it close, then restart it. not the most elegant solution but time's awastin'. thanks for your help. more points for ya.

but just to check one more time, based on the code above (version 1), there's nothing more i need to do to 'finalize' how i'm reading the files?
0
 
LVL 4

Expert Comment

by:v_karthik
ID: 16913264
It looks ok to me ... just add my indexOf fix to see if it helps... just to get it right, that string buffer u have has the COMPLETE data contained in all ur 2000 files? or is it like each stringbuffer (and each vector) is for a single file?

and, an aside .. its not that useful now, but its better u know -

sbAllOfLog.append(line + '\n');

is faster if used this way -

sbAllOfLog.append(line).append("\n");

0
 
LVL 1

Author Comment

by:jacobbdrew
ID: 16913680
cool. no, the string buffer does not contain all the data from all 2000 files. this read is within a spawned thread, and there's a thread for each file (limited to 25 by the pool)

i will add the indexOf fix.

0
 
LVL 4

Expert Comment

by:v_karthik
ID: 16913733
ok .. will post again if i have a brainwave later.
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Does the idea of dealing with bits scare or confuse you? Does it seem like a waste of time in an age where we all have terabytes of storage? If so, you're missing out on one of the core tools in every professional programmer's toolbox. Learn how to …
Whether you’re a college noob or a soon-to-be pro, these tips are sure to help you in your journey to becoming a programming ninja and stand out from the crowd.
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

697 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question