• C

What is the fastest io when you have the off_t offset and numbytes to read? (ignoring mmap)

What is the fastest io when you have the off_t offset and numbytes to read? (ignoring mmap - for reasons of complexity).  I want to write a function that takes an off_t offset and numbytes (its really the numchars) and be able to read that section into memory.  Since I will be doing this process over and over,  I need to know what the fastest is.  Right now it seems that fgets with 256 char buffer is very fast, but it takes some work and would be easier to read it all into memory.   I could just use fread and be done with it, but in tests it seemed like fgets was substantially faster (which really really matters in this program I am writing - over the process I might be dealing with file sizes up to several gigabytes in size, so this function might be called millions of times).

Also there exists the problem of reading everything into a char buffer and not knowing when end of file has been reached (if it has infact have)... (feof only tells you that the end of buffer has been reached in the case that the last read would fail....but if this last read succeds to read into memory most of the buffer...).

The amount of data per function call would range between 3000 and 11000 characters, so at a given time I would have to deal with reading 3000 to 11000 characters into memory.

I suggested to ignore mmap for now because of the complexity, but if mmap really is the fastest and if there is a simple way that can be showed to me how to get all the information into memory, given only the off_t offset, I will accept the answer.

Because this question is sort've multitiered and I need an answer to this quickly, I will put on the full 500 points.  Thank you all for any help you can give!
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Kent OlsenDBACommented:

You may find the open()/read() is faster than fopen()/fgets().  Depending on the data access.

Don't discount mmap() so quickly.  All it does is assign a region of memory over the file so that you can access the entire file as if it were one really big array.  Even better, the system's paging mechanism handles all of the I/O.  Once you're read data from part of the file you can access another part of the file and come back to the first part without having to reread it.  At worst, the system will reread the pages from the file into memory.  At best, it's already in local memory so there's no interrupt.

The mmap() call can be a bit intimidating, particularly understanding all of the options.  But it's a small front-end learning curve that can pay huge dividends down the road.


Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
As a first approach use fseek () to position the file pointer to the right spot the use the approprate function, perhaps fgets(), to read the data. Depending on what happens between the reads, there are possible optimisazions using cashing. I recommend first trying simply fseek () and fread (), fgets () depending on the data type.

Premature optimization is counterproductive.
Kent OlsenDBACommented:

Expounding just a bit, (I need to think about something other than the problem that work has put in front of me), managing a large volume of data can be a lot easier when the data is mapped.  The coding is definitely easier, and you often wind up not having to reread the same data over and over again.  If you're perusing a 1GB data file on a system with 2GB real memory, you may well be able to process the entire file repeatedly without ever having to actually read any of the data from disk more than once.

The only snag is strings.  You're dealing with raw data, not zero-terminated strings.  So you'll have to work with character counts instead of relying on the D string functions.  But that's a pretty small price to pay for making the rest of the work so easy!  If needed, you can get line lengths with a simple function.

Generally, the code goes like this:

fildes = open(...)

/*  Loop on desired data  */
lseek(fildes, some_offset)         //  repeat for each buffer or I/O
read(fildes, buf, len)                //  repeat for each buffer or I/O

/*  Use data in buf. */
/*  End Loop  */


fildes = open(...)
address = mmap(0, len, PROT_READ, MAP_PRIVATE, fildes, 0)   // Do once.

/*  Loop on desired data  */

/*  Use data at address. */
/*  End Loop  */

Your Guide to Achieving IT Business Success

The IT Service Excellence Tool Kit has best practices to keep your clients happy and business booming. Inside, you’ll find everything you need to increase client satisfaction and retention, become more competitive, and increase your overall success.

ctangentAuthor Commented:

Thank you so much, that helps out immensely.

Just for clarification on your example given:

address = mmap(0,len,PROT_READ, MAP_PRIVATE, fildes, 0)
address is a char ptr or does it matter?
0, len -> 0 is the starting point, or, for example off_t offset?

and in the case that when mapping from 0 to len, len goes past the end of the file, what happens?

If you don't have time or desire to answer these I can post more clarification questions as new questions.  Either way, this helps immensly, you get all the points.  I'll close the question tonight after work.

Kent OlsenDBACommented:

address is a char*.

  char *address;

You should probably call stat() to get the file length and pass that length to mmap().  If you pass a value greater than the file length, the results are undefined.  You'll probably get a buffer the size that you requested with the bytes past the file length initialized to 0.

The last parameter is the file offset.  If you wanted to, you could map part of the file beginning at any offset (as long as it's a multiple of the page size).  But it's usually easiest to just map the entire file.

Kent OlsenDBACommented:

Oh.  address can also be an unsigned char.

  unsigned char *address;

It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.