• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 300
  • Last Modified:

Windows 2000 file-caching bug

We are running Windows 2000 SP2 on several computers, and writing an application which seeks into and reads parts of several large files.  File size totals about 60GB.  A given "query" into these files typically reads a total of 500MB or so in large chunks.

Here's the problem: After performing 20 or so of these operations, the "Available Memory" in the system (as indicated by the task manager) drops by 500MB or so, and continues to drop steadily as more operations are performed.  On a system with 512MB of memory, operations grind to a halt.

I expect file-caching to take up memory, but it is normally released when processes need memory for other purposes.  However, this memory is not released until the process terminates (I haven't tested to see if it is released when the file is closed).

This is not a memory leak in the application!  We know this because:

1) Changing the file-read mode from buffered to unbuffered (in the CreateFile call) makes the problem go away.  But then we lose benefits of OS caching altogether.

2) The VM size reported for our process in the Task Manager hovers around 50MB consistently.

NT/2000 has a history of file-caching problems.  But here's another wrinkle: this issue only seems to show up on dual-CPU machines and not single-CPU machines.  

We have three systems like this exhibiting problems, which are very different.  First is a Dell 650 workstation with a SCSI stripe-set.  Second is a SuperMicro workstation with a 3Ware RAID controller.  Third is an Intel Pro mobo system woth a Mylex SAN controller.  Because of the independence of disk system, I conclude that this is not a disk-driver issue.

I'm looking for an OS patch or programming workaround to avoid the issue.
0
jlilley
Asked:
jlilley
  • 9
  • 4
  • 2
  • +2
1 Solution
 
jkrCommented:
Have you tried setting FILE_FLAG_SEQUENTIAL_SCAN (if applicable)?
0
 
jlilleyAuthor Commented:
That makes no difference.
0
 
jlilleyAuthor Commented:
That makes no difference.
0
What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

 
fl0ydCommented:
Are you using memory mapped files for reading your files?
0
 
jlilleyAuthor Commented:
I am using ReadFile()
0
 
fl0ydCommented:
Not the best approach by all means, or do you have a specific reason not to use CreateFile(...)/CreateFileMapping(...)/MapViewOfFile(...)? Are you also using OpenFile(...) which is only there for compatibility with older win-versions?
0
 
jlilleyAuthor Commented:
A memory-mapped approach is harder to manage because the total file size is 60GB.  It would require using a "special" memory model or moving-window mapping techniques.

But would it actually solve the problem?  I mean, have you seen these symptoms and know this solves it, or are you making suggestions for experiments that can be run as tests?
0
 
jlilleyAuthor Commented:
The other reason that memory-mapping makes the problem more complex is that our data is compressed, so its easier to read it as a stream and decompress it as it comes in, rather than map it and convert buffers, although a stream model can certainly be built over the memory-mapped model.  But would it help?
0
 
fl0ydCommented:
You're right, I'm basically suggesting experiments - as you call it - that can be run. I haven't seen these symptoms myself but whenever I had to deal with large files I used memory mapping. My code did run on SMP-systems with no problems whatsoever. Personally I don't think it makes it any more complex, but you will have to decide for yourself. Same is true for building a stream model on top of the memory mapped structure.
Would it help? I don't know, but it certainly helps to close in on the cause of the error. If it makes a difference then your original code was probably, but not necessarily, erroneous. If the symptoms remain your code was ok -- I'd think it's worth a try anyway.
My second point: If you actually use OpenFile() you should definately replace it with CreateFile() and see if it makes a difference. OpenFile() is from a time when 33MHz was about as fast as it gets...
0
 
jlilleyAuthor Commented:
fl0yd, thanks for the insights.  I was really hoping to find someone who had seen THE PROBLEM before and knew the answer.  Perhaps that is wishful thinking :-)

We do call CreateFile() with:
FILE_SHARE_READ | FILE_SHARE_WRITE
GENERIC_READ
and always read on 64k boundaries (because the same code is used for non-buffered mode).

Incidentally, simply adding FILE_FLAG_NO_BUFFERING completely erases the problem.  Go figure.

I'd like to leave the question open to see if I can get feedback from someone with direct experience on this problem.
0
 
fl0ydCommented:
No problem -- since I didn't provide you with a real answer that is perfectly ok.

Something that pops to my mind: Is it necessary for you to open the file with the FILE_SHARE_WRITE flag? Like I said before, I'm kinda lost here so I'm just guessing. But granting write access to the file while reading it could get the system to make a copy of portions of the file -- not sure how this is handled, though, since I have never stumbled across a similar situation.

On the other hand, the problem is very likely connected to the dual-cpu-environment. If you read through Intel's PIII white papers about 98% of the errors are concerned with 2 cpu's running in parallel. Two more things to test: A BIOS update might eliminate the problem (motherboard, not the SCSI BIOS). Check back with the manufacturer to be sure though. Using the newest compiler available is also highly recommended and probably less risky than a BIOS update.

Bear with me -- I've finally reached the point where I need to know. A psychologist might call it obsession, but then again, I don't really care ;)
0
 
jlilleyAuthor Commented:
I'll check on FILE_SHARE_WRITE to see if that matters.  My summary of the problem is currently this:
1) It is definitely linked to dual CPUs -- the problem does not appear at all on single-CPU machines.  
2) It is independent of disk driver and manufacturer because we have three very different computers (Intel, SuperMicro, Dell) all exhibiting the symptoms.  
3) Turning on FILE_FLAG_NO_BUFFERING eliminates the problem, so it is clearly related to file caching in the OS.
4) I've written a 50-line test case that just opens a bunch of big files using fopen, and randomly seeks and reads them using fseek/fread.  It also shows the symptoms, so this is not a subtle application issue.
5) The locked-down memory use seems to stabilize at around 0.7% of the total file size.  Perhaps there is some data structure used in file caching (page table?) that must remain locked in RAM for dual CPUs to access it.

I know that I can always turn on FILE_FLAG_NO_BUFFERING and build up some simple caching to make this work, so it is "solved" in a limited sense.

My current experiments:
1) See if SetProcessWorkingSetSize() makes any difference.
2) Turn off FILE_SHARE_WRITE
0
 
robpittCommented:
Sounds to me like you may well have found a genuine problem!

Have you tried it under XP?
If its an unknown bug it'll probably be present in XP as well (XP=NT5.1).

Anyway the following link may be of interest to you...
http://www.sysinternals.com/ntw2k/source/cacheset.shtml
0
 
jlilleyAuthor Commented:
I suspect this is more of a "feature" than a bug.  It is probably file-cache page tables or some such.  By the way, we have reproduced this on a single-CPU machine.  Unfortunately we have no XP box with enough disk to test.
0
 
cwreaCommented:
You'll find this interesting and relevant:

http://www.heise.de/ct/english/97/01/302/

0
 
cwreaCommented:
Given the length of time this problem has existed, and that Microsoft hasn't provided any remedy other than letting you turn caching entirely off (which kills performance), I suggest you write your own buffered reading and writing API.  Encapsulate your logic so that if/when the problem is ever fixed, you can change implementations easily.  Perhaps provide an override to users of your app so they can choose your implementation or the system implementation.

0
 
jlilleyAuthor Commented:
Excellent!  This is the "official" confirmation I was hoping for.  So it is indeed an NT/2000 bug!

I was already on the way to using unbuffered I/O and writing my own cache manager, so this confirms I'm on the right track.
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 9
  • 4
  • 2
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now