Link to home
Start Free TrialLog in
Avatar of the_b1ackfox
the_b1ackfoxFlag for United States of America

asked on

Java's FileVisitor memory problems

Hello Experts,

I have a Java applications that uses FileVisitor to walk the directories and read each file looking for a pattern using Scanner’s findWithinHorizon method. Once a match is made, the program stops searching the current directory and moves on to the next one. The issue I’m having is that it takes up too much memory and the program can’t complete a search without throwing an OutOfMemoryExeption. Does anyone have a way around the out of memory exception?

Thank you!

Fox

here is the code::
private java.util.List<String> searchFiles2(String FilePath, final String pattern) throws FileNotFoundException, AccessDeniedException
    {
        File file = new File(FilePath);
        if (!file.isDirectory()) {
            throw new IllegalArgumentException("file has to be a directory");
        }

        final List<String> FoundDirectories = new ArrayList<String>();
        try {
            Path startPath = Paths.get(FilePath);
            Files.walkFileTree(startPath, new SimpleFileVisitor<Path>() {
                @Override
                public FileVisitResult preVisitDirectory(Path file, final BasicFileAttributes attrs)  throws IOException {
                    try {
                        final String name = file.getFileName().toString();
                        // skip hidden files:
                        if (name.startsWith(".") || new File(file.toString()).isHidden()) {
                            return FileVisitResult.SKIP_SUBTREE;
                        }

                    return FileVisitResult.CONTINUE;
                }
                catch(NullPointerException ex)
                {
                    return FileVisitResult.CONTINUE;
                }
            }

                @Override
                public FileVisitResult visitFile(Path file, BasicFileAttributes attrs)
                        throws IOException {
                    File myFile =new File(file.toString());
                    boolean hidden = myFile.isHidden();

                    if(attrs.isRegularFile() && !hidden) {
                        Scanner scanner = new Scanner(file);
                        System.out.print(file + "\n");
                        if (scanner.findWithinHorizon(pattern, 0) != null) {
                            FoundDirectories.add(file.toString());
                            scanner.close();
                            return FileVisitResult.SKIP_SIBLINGS;
                        }
                        scanner.close();
                    }
                        return FileVisitResult.CONTINUE;
                }


                    @Override
                    public FileVisitResult visitFileFailed(Path file, IOException e)
                    throws IOException {
                        System.err.printf("Visiting failed for %s\n", file);

                        return FileVisitResult.SKIP_SUBTREE;
                    }


            });
        }
        catch(Exception ex)
        {
            System.out.print(ex.getMessage());
            return FoundDirectories;
        }
        return FoundDirectories;
    }

Open in new window

SOLUTION
Avatar of dpearson
dpearson

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
It could be the HeapCharBuffer, when the search encounters a large file.

So potentially you could test this theory by pointing the search at a directory containing the same number of files as the true target directory, but with nominally *small* files in.

(And in fact, without me looking at the API more closely, this line :

<<if (name.startsWith(".") || new File(file.toString()).isHidden()) {>>

is possibly treating the file as a file and not a directory, in which case the toString() call is the trouble). Doug ?
in which case the toString() call is the trouble
It certainly isn't necessary
if (name.startsWith(".") || file.toFile().isHidden()) {

Open in new window

is what you want
But your code is hard to read as
a. the naming is bad (see http://technojeeves.com/index.php/aliasjava1/106-java-style-conventions )
b. you didn't use code tags to post

What is 'pattern' - both its type and content? What kind of files are you reading? It's to be hoped they are all text files
And yes, depending on what's being done, lots of memory could be used with that approach. What is your GOAL?
The exception I get when running your code is this :

C:\EE_Q_CODE\BMBData.txt
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.nio.HeapCharBuffer.<init>(HeapCharBuffer.java:57)
        at java.nio.CharBuffer.allocate(CharBuffer.java:335)
        at java.util.Scanner.makeSpace(Scanner.java:840)
        at java.util.Scanner.readInput(Scanner.java:795)
        at java.util.Scanner.findWithinHorizon(Scanner.java:1685)
        at java.util.Scanner.findWithinHorizon(Scanner.java:1635)
        at FileVisitor$1.visitFile(FileVisitor.java:61)
        at FileVisitor$1.visitFile(FileVisitor.java:29)
        at java.nio.file.Files.walkFileTree(Files.java:2670)
        at java.nio.file.Files.walkFileTree(Files.java:2742)
        at FileVisitor.searchFiles2(FileVisitor.java:29)
        at FileVisitor.main(FileVisitor.java:14)

and it only happens when it encounters a very large file - in my case it's one the size of the Bible.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of the_b1ackfox

ASKER

Dpearson:
I like the println suggestion, It would be helpful to see what the code has been working on.

Krakatoa:
The very large file fail is an excellent observation.  The files to be evaluated are supposed to be text files only.  I haven't seen the error, or if it chokes on a specific file.

CEHJ:
This is new ground for another developer and I.  We found this code snippet lying somewhere in the internet, and thought it might address to problem we were having.
The goal is to search local drives to identify the location of text files that contain a specific marker inside.  (there are currently 2 types of these files, each with a specific marker) We do not want to list out all files but list the directories where the files are found in hopes to identify a root folder from where to start to process the files.  There could be multiple root folders.
The potential optimisation lies probably in the details you omit. Buffering files in memory is expensive, as are regexes
I haven't seen the error,
Err, right .   .   .   .  

Anyway, this question seems to be drifting off to another place. I thought the issue under consideration was
Does anyone have a way around the out of memory exception?

So increasing the heap space will almost certainly answer that, and it would be nice of you if you could give us some feedback on the effect of increasing it, cf : 'at java.util.Scanner.makeSpace(Scanner.java:840)'.

(And if there's a chance you could remember code tags next time, that would be very much .  .  .  .  appreciated).
Is there a better way she should be doing this?  I'll provide feedback on the changes when we get together on Monday.
Is there a better way she should be doing this?
I can only answer that after you've answered the questions i asked
So it turned out to be related to the size of the files it was scanning.  She put a size limit on the files and no more errors.
CEHJ:  I can only answer that after you've answered the questions i asked

Hardly true, I told you the goal and the situation, you could have answered and given us a better direction as we have had 1 day   (Plus the start of this question, minus the weekend) to look at FileVisitor.  Java isn't my forte, and everyone has their start somewhere, so that is something I can change over time.  Your answers were less than helpful and a waste of time for both of us (us being you and I).  I know far less than you on this subject but I still was trying my friend solve the issue.  Also I helped my friend without the condescending attitude displayed in your posts.  Help me or don't help when you see my posts, but don't offer help with strings attached, its not the point here.

Fox
Thank you for your assistance Krakatoa and Dpearson.  Not every scrap of code you find is the gem you hope for.  I am unsure about most of the posted code, as I have not run it personally.  But my friend was very thankful for your help.  I still don't think it works like she hopes it will work, but she will get better with some more practice.
Help me or don't help when you see my posts, but don't offer help with strings attached
That is a rather strange way of looking at my attempt at discovering possible optimisation paths ;) Anyway, if you prefer to throw more hardware at it than do the (potential) optimisation, then that's your choice to make. Bear in mind what works now might not work elsewhere or even in the same place at another time
I must say I don't really get why there might be a problem under other conditions . . . it seems to me that the Scanner is deciding how much space it needs to allocate for a particular file, and if that is met by what's available to the JVM then that ought to be enough ? From what I can see of the behaviour of the Scanner, it leaves using the extra space allocated at runtime above the default, until it encounters the file in question, and then acquires it. Which is why the code only falls over with the OOME when it gets to that file.
Well it's quite simple really: the problem was solved by allocating more stack memory, so it follows that if that allocation were not possible elsewhere with similar-sized files (of course they might be bigger or smaller) then the problem would remain
Right, but that is, and has always been, a feature of the finite nature of computers :- no-one would want to go back to 32K memory, just to satisfy an academic nicety.
When we looked at the size of the expected files, the max was well below 1 mb in size.  So she added somewhere in the code an if statement so if a file was larger than 1 mb, skip it.  I'll see if I can't get the exact change from her later on today.
There is an interesting side-effect of the code you have posted, which is that an horizon setting of >0 will limit the size of the data in the buffer, so that the search term given in the source code (.java) file, will be captured. This is an unlikely scenario admittedly, but it may have other implications. Otherwise it could mean that your horizon need not be set to 0, and so not require so much heap space in the first place.

Having said all that, I'm surprised that the system your friend is running all this on, cannot evidently cope with files larger than 1MB. What configuration *are* you running ?
It could be illuminating if you ran this :

>java -XX:+PrintFlagsFinal -version | findstr HeapSize
My comment here

should be ignored at least for now.
D:\Kira\seed-project>java -XX:+PrintFlagsFinal -version | findstr HeapSize
    uintx ErgoHeapSizeLimit                         = 0
          {product}
    uintx HeapSizePerGCThread                       = 87241520
          {product}
    uintx InitialHeapSize                          := 268435456
          {product}
    uintx LargePageHeapSizeThreshold                = 134217728
          {product}
    uintx MaxHeapSize                              := 4284481536
          {product}
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)