Solved

most effecient way to read/parse file?

Posted on 2003-11-12
28
2,989 Views
Last Modified: 2010-04-01
Hi,
wanted to know which is most effecient in terms of real-time performance for reading in a very large file?
should i use FILE/fscanf or ifstream/getline or any other?

Whats the best way to store such a huge file in terms of memory? dynamic array? (and how woudl i do that?) linked list?

thanks for ur input...

lostbits
0
Comment
Question by:lost_bits1110
  • 11
  • 9
  • 4
  • +3
28 Comments
 
LVL 6

Expert Comment

by:GloomyFriar
ID: 9733925
Try memory mapped file.
0
 

Expert Comment

by:Masurium
ID: 9734195
How large is your "Huge" file, and what do you intend to do with it?

Linked list should out-perform arrays at large sizes for most purposes, except searching through it.  If you intend to scan through the file from one end to another once you've stored it that wouldlikely be your best bet.

Unfortunately I don't know which reading function is fastest though.
0
 

Author Comment

by:lost_bits1110
ID: 9734885
Thanks for the comments,
I will be scannign through it - its an 8 Meg file of characters/numbers from which i need to search and create lookup tables from the numbers they contain

and what is a memory mapped file and how do i use that?

thanks again
0
 
LVL 30

Expert Comment

by:Axter
ID: 9736321
I agree with GloomyFriar.  A mapped file method is the most effecient.
However, it's implementations specific, so we woud need to know what OS are you useing.

A mapped file lets your program point to a location within the file like you would point to a string or buffer.

This allows you to treat the file like it was in memory (RAM).
The OS does all the hard work for you, and since the OS is usually better at handling files in memory, it generally makes for a more efficient method.
0
 
LVL 30

Expert Comment

by:Axter
ID: 9736336
Here's some example code for WIN32:

#include <windows.h>

void ClipFile(LPCTSTR FileName, DWORD Position, DWORD LenToClip)
{
    SECURITY_ATTRIBUTES sa;
    memset(&sa, 0, sizeof(sa));
    HANDLE h_file = CreateFile(FileName, GENERIC_WRITE|GENERIC_READ, FILE_SHARE_READ, &sa, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
    DWORD FileSize = GetFileSize(h_file, NULL);
    HANDLE map_file = CreateFileMapping(h_file, &sa, PAGE_READWRITE, 0, 0, NULL);
    char* Data = (char*)MapViewOfFile(map_file, FILE_MAP_ALL_ACCESS, 0, 0, 0);
    memmove(Data+Position, Data+Position+LenToClip, FileSize-Position);    
     UnmapViewOfFile(Data);
    CloseHandle(map_file);
    SetFilePointer(h_file, FileSize-LenToClip, NULL, FILE_BEGIN);
    SetEndOfFile(h_file);
    CloseHandle(h_file);
}

int main(int argc, char* argv[])
{
      ClipFile("readme.txt", 3, 5);
      
      return 0;
}

0
 
LVL 30

Expert Comment

by:Axter
ID: 9736384
The above code clips data from the middle of the file by using Win32 MapView API functions.

This method is very efficient because it doesn't have to read the data into memory manually.  Instead, the OS does the work of mapping a pointer to the contents of the file, and the program uses that pointer to move data, which in affect deletes the middle of the file.

You can easily use a similar method to point to the data, without having to read the data.
This especially good for large files.

If you're using UNIX or Linux, you can use POSIX map file method.
0
 
LVL 3

Expert Comment

by:guynumber5764
ID: 9736693
If you are just doing a single sequential pass through the entire file, good old read() and write() will be the fastest since everything else is layered on top of those.  As for how to map that file into memory, it depends on the structure of the file and how you will be using it.  Reading in the file and using "in place" will have virtually no overhead but may not fit your needs well (plus of course the fact that your program needs to be able to access 8+Mb of memory).  fscanf() will be very slow because it needs to parse the *format* string every time it is called.  fgets() + strtok() or a handrolled parser will almost certainly be far more efficient.

E
0
 
LVL 30

Expert Comment

by:Axter
ID: 9736756
>>good old read() and write() will be the fastest since everything else is layered on top of those.  

That's incorrect.  In fact, it's the opposite.

The open and write functions are an extra layer that sits on top of the OS implementation.

File mapping functions are much more efficient then using open(), write(), and read() functions.
0
 

Author Comment

by:lost_bits1110
ID: 9736856
Thanks for all the input..
I guess I will check out the memory mapped file way - though from the sample code Im' having a hard time understanding how I would use it.. Ultimately, real-time performance is what I want - so if you happen to know of good sites or something that could brief me on how to use these memory mapped files

BTW, i'm using windows 2000 , but i will probably switch to Redhat later on so i need to knwo both ways...

So what if i'm reading in various characters, numbers etc.. and I wanted to go through the file until I come upon a "$", then how could I do this using memory mapped file? Or lets say I wanted to read an integer after the dollar sign,  or read in some floats at some point, or set teh cursor at a certain position in the file and read from there.. etc..?  Would these all be easy to do with a memory mapped file?  If you could give me a basic example that I could follow that would really help me alot..!

p.s. by what factor do you think it will improve the speed, right now using fgetc, checking one by one for a dollar sign, it takes about five seconds before it shows the output..!! (My file is about 8 Megs of chars/numbers)
0
 
LVL 30

Accepted Solution

by:
Axter earned 125 total points
ID: 9737096
Here's an example using ReadOnlyBufferToFile wrapper class for MapView functions:

#include <windows.h>

#include <cstdlib>
#include <iostream>
using namespace std;


class ReadOnlyBufferToFile
{
public:
      ReadOnlyBufferToFile(LPCTSTR FileName) : m_FileHandle(NULL), m_MapViewHandle(NULL), m_Begin(NULL), m_End(NULL)
      {
            if (INVALID_HANDLE_VALUE != (m_FileHandle = CreateFile(FileName, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL)))
            {
                  m_SizeBuf_Lo = GetFileSize(m_FileHandle, &m_SizeBuf_Hi);
                  if ((m_MapViewHandle = CreateFileMapping(m_FileHandle, NULL, PAGE_READONLY, 0,0, NULL)) != NULL)
                  {
                        m_Begin = (LPCTSTR)MapViewOfFile(m_MapViewHandle, FILE_MAP_READ, 0, 0, 0);
                        m_End = m_Begin + m_SizeBuf_Lo;
                  }
            }
      }
      ~ReadOnlyBufferToFile()
      {
            if (m_FileHandle && m_FileHandle != INVALID_HANDLE_VALUE) CloseHandle(m_FileHandle);
            if (m_MapViewHandle && m_MapViewHandle != INVALID_HANDLE_VALUE) CloseHandle(m_MapViewHandle);
      }
      LPCTSTR Str(){return m_Begin;}
protected:
      HANDLE m_FileHandle;
      HANDLE m_MapViewHandle;
      LPCTSTR m_Begin;
      LPCTSTR m_End;
      DWORD m_SizeBuf_Lo;
      DWORD m_SizeBuf_Hi; //This class does not use Hi
};


int main(int argc, char* argv[])
{
      const char FileName[] = "c:\\largefile.dat";
      ReadOnlyBufferToFile MyMapViewMngr(FileName);

      const char * Ptr = strchr(MyMapViewMngr.Str(), '$');
      int Pos = Ptr - MyMapViewMngr.Str();

      cout << "Found $ at position " << Pos << endl;

      
      
      system("pause");
      return 0;
}

0
 
LVL 6

Expert Comment

by:GloomyFriar
ID: 9738220
The memory mapped files give you the same access as operation system use when load and execute some exe file or when the system works with virtual memory (page file). So I think it have to be rather fast.
0
 
LVL 3

Assisted Solution

by:guynumber5764
guynumber5764 earned 125 total points
ID: 9738707
If you plan to be portable between Windows and Linux, you should stick to ANSI C.  Ultimately the best you can do is one pass from from the disk and one pass through the data.  I would be curious to see Axter's soln benchmarked: without any evidence to support or contradict it I cannot say who's right there.  I'm reviewing the Linux kernel code but haven't come to any definitive conclusions (it's all pretty twisted together).  And now a word from some real experts...

"The combination of open(), close(), write(), read(), andlseek() provides the basic file access API for Linux. While there are numerous other functions which manipulate files, those described here are used most of the time.
Most programmers use the familiar ANSI C library file functions (such as fopen() and fread()), rather than the lower-level system calls described here. fopen() and fread() are, as you would expect, implemented on top of these system calls in a user-level library. "
- Erik Troan (Red Hat)

"For file input and output use the [ANSI] C library functions whenever possible."
- Charles Petzold

You can probably speed up your program by processing blocks at a time rather than char by char.  It could well be that the overhead of 8M fgetc() function calls is killing you.  

0
 
LVL 30

Expert Comment

by:Axter
ID: 9738790
>>If you plan to be portable between Windows and Linux, you should stick to ANSI C.  

By using wrapper functions, you can still make your code portable between Windows and Linux/UNIX using map file method.
Both Linux and UNIX support file mapping via the POSIX standard.

>>Axter's soln benchmarked: without any evidence to support or contradict it I cannot say who's right there.
Actually, I have benchmark the map file use in both Windows and in UNIX.  And there is no comparison.  A map file method far exceeds using ANSI C/C++ file functions.

>>And now a word from some real experts...
Those may be experts in C, but are they experts in POSIX or WIN32?  I don't thinks so.
That's why they make no mention of POSIX functions, or comparisons to POSIX functions.
Which means they're not a good reference point for the paticular topic we're discussing.
0
 

Author Comment

by:lost_bits1110
ID: 9740736
oh dear.. i guess i will just have to try it out - thanks for all your input..! i really appreciate it..
(p.s. guynumber, can u tell me how exactly i would process blocks at a time? thanks very much!)
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 3

Expert Comment

by:guynumber5764
ID: 9741510
Here's what I tried this morning.  I did not put in timing code because, over a 4 meg file,  just running it produced a clear winner.

main()
{
   FILE *fd1, *fd2;
   char ch, buf[1024];
   int  i, scanned, count, num_read;

   printf("Scanning with fgetc\n");
   scanned=0;
   count = 0;
   fd1 = fopen("./test1.dat", "r");
   while (!feof(fd1))
   {
       ch = fgetc(fd1);
       scanned++;
       if (ch=='$')
           count++;
   }
   fclose(fd1);
   printf("scanned=%d, count=%d\n",scanned, count);

   printf("Scanning with fread\n");
   scanned=0;
   count = 0;
   fd2 = fopen("./test2.dat", "r");
   while(num_read=fread(buf, 1, 1024, fd2))
   {
       for(i=0; i<num_read; i++)
       {
           scanned++;
           if (buf[i]=='$')
           count++;
       }
   }
   fclose(fd1);
   printf("scanned=%d, count=%d\n",scanned, count);
}
0
 
LVL 3

Expert Comment

by:guynumber5764
ID: 9741532
Axter...could you pls post the posix equiv?
0
 

Author Comment

by:lost_bits1110
ID: 9741537
Thats great!! Thanks so much for your help!
0
 
LVL 30

Expert Comment

by:Axter
ID: 9744751
>>Axter...could you pls post the posix equiv?

The following are the POSIX file map functions that can be used on Linux/UNIX.

shm_open
mmap
munmap
shm_unlink

You can use these functions on any OS that is POSIX compliant.

Check out your man page for more information.
0
 
LVL 3

Expert Comment

by:Sandra-24
ID: 9745105
Just to back up Axter there, memory mapped files are very fast to work with. They work like the virtual memory on your machine and are the second fastest you can get to real memory. Much faster than the file system. They load the file into memory a "page" (usually a slice of the file that's a few KB in size) at a time. It enables you to quickly access things in the file, even giving you a sort of random access to the file in that you don't need to read through x bytes to get to x+1. And if you modify it, the OS only writes the "dirty" pages of the file back to disk speeding even that process tremendously.

At least this is my understanding of memory mapped files.

-Sandra
0
 
LVL 3

Expert Comment

by:guynumber5764
ID: 9746237
>>>File mapping functions are much more efficient then using open(), write(), and read() functions.

Absolutely not.  The following code is actually slightly slower than the fread() code given above (23MBps vs 24MBps)  despite the elimination of the outside loop and the use of the lower level open() vs fopen().  Timing and printing changes were applied to both versions.
Memory mapping is a very effective IPC mechanism and is extremely efficient for in-memory operations but, when it comes to actual disk IO it is bound by exactly the same constraints as any other disk IO.

One alsways has to be very careful when benchmarking disk IO.  Due to caching, the 2nd and subsequent runs are very likely to produce impossibly high throughput.  For example, the tests above and below indicated 90MBps on the 2nd run--almost twice the internal transfer rate my HD.

Also, lost_bits, consider that even in the slowest case I was moving data about 8x faster than your current routine.  It may well be that your bottleneck is not disk IO.


   printf("Scanning with mmap()");
   go = showtime();
   scanned=0;
   count = 0;
   fdi = open("./test3.dat", O_RDONLY);
   fp = filebuf = mmap(NULL,4000000,PROT_READ,MAP_PRIVATE,fdi,0);
   for(li=0;li<4000000;li++,fp++)
   {
           scanned++;
           if (*fp=='$')
                   count++;
   }
   stop = showtime();
   printf("du=%ld bps=%ld\n", stop-go, scanned/(stop-go));
   printf("scanned=%d, count=%d\n",scanned, count);
0
 
LVL 30

Expert Comment

by:Axter
ID: 9746755
>>Absolutely not.  The following code is actually slightly slower than the fread() code given above (23MBps vs 24MBps)

It helps if you know what you're doing before makeing this assumption.
Your code is mapping via a open() function call.  Of course this is going to be slower.

Why do you think I post all 4 mapping functions?
shm_open
mmap
munmap
shm_unlink

I posted them in the order you need to use them.  I also suggested reading your man page for these functions.

Also, before conductiing a test, you actually have to reboot before each test.
Otherwise, your test is invalid, becuase the system's cache will invalidate the test.
A benchmark test is worthless without knowing OS, OS version, computer, quantity of trials, complete test method.
With out all the details, your benchmark has very little weight, because so many things can curupt a test to make it invalid.



0
 
LVL 3

Expert Comment

by:guynumber5764
ID: 9751224
I''m done being civil.  Frankly, I thought you posted all 4 functions I thought that you didn't know what the fsck you were talking about:  if you did, you would have posted a code fragment.  All I've seen so far is "you're wrong" with nothing but ignorance to back it up.  I've offered you repeated opportunities to put your money where your mouth is and all you do is stick your foot in it.  I've offered considered explanations and, from your last post, it is clear that you are not even reading them.  I'm listening for a reason or an explanation and all I see is crap like this:
>>> It helps if you know what you're doing before makeing this assumption.
>>> Your code is mapping via a open() function call.  Of course this is going to be slower.


Now that I have that off my chest:

>>> Your code is mapping via a open() function call.  Of course this is going to be slower.
Slower than what?  Why?

>>>Also, before conductiing a test, you actually have to reboot before each test.
Did you glean that from my previous post or did you figure that out yourself?

>>>A benchmark test is worthless without knowing OS..
It's not a great metric but it beats the snot out of an arbitrary statement with nothing to back it up.  The test is relative which means that, as long as the platform does not signficantly change in the 2 seconds it takes to run the program, one can draw some basic conclusions without knowing anything about the platform (Linux 2.4.21-0 secure running; Athlon 1.33 w/ 512Mb; ABit MB; Maxtor 5T060H6:  does that make the test more useful?).

Try to consider the blindingly obvious for a moment:
If memory mapped IO was signficantly faster than standard file IO for *DISK* operations, that's how standard disk IO would be implemented.  Do you think OS designers are idiots?
The test I ran above was within an order of magnitude of the sustained transfer rate for my drive:  in other words, unless your approach magically teleports the bits from the platter to the ram without going through the read head it's just not going to be much faster.  period.


0
 
LVL 3

Expert Comment

by:guynumber5764
ID: 9751511
Everyone else.
Apologies for the rant.  I really am interested in comparing the two approaches but, being a scientist, I need something empirical to look at.  Ad hominum logic is only going to p*** me off.

I recognize that memory mapped files can be MUCH faster where the data is moving both ways (both being read and written) since the data doesn't have go through the ide bus.  Since that's not the case here, I think the advantage is lost (see "blindingly obvious" above and parse out the annoyed bits).

I would love it if someone could provide code using Axter's approach for the single pass benchmark above.  Not to prove anything: I just want to see it.  OTOH,  if it actually turns out to be at least an order of magnitude faster (that is twice as fast), I will promptly and humbly apologize to him. Otherwise my statements stand.

Thanks,
E.
0
 
LVL 30

Expert Comment

by:Axter
ID: 9752573
>>>>> Your code is mapping via a open() function call.  Of course this is going to be slower.
>>Slower than what?  Why?

Slower then useing the API function I posted.
I'm sorry, but I just didn't think it would be difficult to figure out that you're suppose to use shm_open() instead of open().

If you're still using open() with mmap, then you're not gaining anything at all, moreover, you'll make your code slower.

I'm currently not working in a facility that has UNIX.

I do have a spare Linux switch drive I can use, and if you're really interested in a working example, then post a question.

I'm not going to go through the trouble of switching out my harddrive and creating code just to prove a point.  I rather spend my time helping questioners who have real world requirements.


I’m sorry for been obnoxious in my previous post, but I do get irritated when an expert uses words like “Absolutely not.” when the expert doesn’t know what he/she is talking about.
You came to an absolute conclusion with out knowing if you’re code was absolutely correct.

>>I''m done being civil.
Frankly, IMHO, saying "Absolutely not.” is not a civil way to present yourself.
Even when I’m completely and absolutely sure I’m right, I still do not contradict other experts by saying they’re absolutely not correct.


>>I would love it if someone could provide code using Axter's approach for the single pass benchmark above.
Post a question.
0
 
LVL 3

Expert Comment

by:guynumber5764
ID: 9753508
>>>Even when I’m completely and absolutely sure I’m right, I still do not contradict other experts by saying they’re absolutely not correct.
That's exactly what got my goat in the first place.  Read your post of 11/12/2003 06:55PM PST.
Now glance over the BSD implementation of shm_open below...

int
shm_open (name, flags, mode)
    char    *name;
    int          flags;
    mode_t  mode;
{
    struct stat          st;
    register int    fd;

    if (! name) {
      errno = EINVAL;

      return (-1);
    }

    if ((fd = open (name, flags, mode)) < 0)
      return (-1);

    if (fstat (fd, &st) < 0)
      return (-1);

    /*
     *      The whole difference to the plain open(2) is here.
     *
     *      We check that the file descriptor we just opened
     *      refers to a regular file. We also report invalid
     *      parameter, if the file name did not give us just
     *      a plain regular file.
     *      This is to guarantee that the opened descriptor
     *      really can be mmap()ed later on.
     */

    if (! S_ISREG (st.st_mode)) {
      close (fd);

      errno = EINVAL;

      return (-1);
    }

    return (fd);
}
0
 
LVL 30

Expert Comment

by:Axter
ID: 9753728
>>Now glance over the BSD implementation of shm_open below...

Have you taken a look at the Linux Implementation?

I'm not sure how compliant BSD is to the POSIX standard, so I really can't say much in reference to the above post.
0
 
LVL 30

Expert Comment

by:Axter
ID: 9753734
>>That's exactly what got my goat in the first place.  Read your post of 11/12/2003 06:55PM PST.

I think you would agree that what I said there sounds much better then saying the following:

"That's *absolutely* incorrect.  In fact, it's the opposite."

And I'm not sure how else I could have word it, so as to not to offend you (or get your goat..).


0
 
LVL 3

Expert Comment

by:guynumber5764
ID: 9754810
Axster...
You could have said something like: "...without any evidence to support or contradict it I cannot say who's right there."  or "I think...".  

I haven't yet looked over the Linux implementation because, like every other posix implementation I have ever encountered, IT IS NOT PART OF THE KERNEL and I have not yet installed the source for the relevant library.  That said, I expect they use the GNU implementation which I have looked at and which is very similar.
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Introduction This article is a continuation of the C/C++ Visual Studio Express debugger series. Part 1 provided a quick start guide in using the debugger. Part 2 focused on additional topics in breakpoints. As your assignments become a little more …
Basic understanding on "OO- Object Orientation" is needed for designing a logical solution to solve a problem. Basic OOAD is a prerequisite for a coder to ensure that they follow the basic design of OO. This would help developers to understand the b…
The goal of the tutorial is to teach the user how to use functions in C++. The video will cover how to define functions, how to call functions and how to create functions prototypes. Microsoft Visual C++ 2010 Express will be used as a text editor an…
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now