Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 431
  • Last Modified:

C efficiency: large file read, character substitution, file write.

To correct certain characters combinations in a large ASCII file.(Where a backslash is followed by TAB,EOF,EOL chars, we need to insert an additional space between \ and following character)

Current Plan:
Open file for read and file for write
Read char by char - using fgetc()
Look for character combination
Correct it
Write to output file - using putc()

The file writing is the slowest part. Does anyone know the most efficient way?

Need to write a C program to run on VAX/VMS. Alas, sed or awk not available.

Many thanks in advance!
  • 3
  • 2
1 Solution
Try writing to a temporary file instead of re-writing
your existing file. Once you're done, have the program unlink the old file and rename the temporary.

This would remove all the overhead for re-writing the file
while it is open. (but would take twice the disk space for
a small amount of time)
I've had to do a large amount of file processing on ASCII files in excess of 1GB.  Here are some tips/techniques I've found useful:

1) Open the file in binary mode (which is the only mode on VAX I believe, and read into memory by chunks of 16K (or multiples of 16K).  This will mean that you have to create a buffer large enough to store your manipulations (additional characeters).  Then write the entire buffer back out.  Use fread() and fwrite().  This will prevent as much disk thrashing as possible.  We usually work with buffers of 64K or so.  This is 90% of the issue...

2) As djacobsen said, write to a temp file (which it sounds like you are already doing).  Once processing is complete, close both files, delete the original file, and rename the temp file...

3) Play with the multiples of 16K - it is OS and system config specific to get the best speed.

Now, I know people will tell me that system cashing should be taking care of most of this, but we've empiracally found that it is not nearly as efficient as the above approach.
A few notes (pseudocode) on the above:

open files (one for reading- fIn, one for writing - fOut)

fread(bufIn, sizeof(char), sizeof(buf), fIn)

for (x==0; x<sizeof(buf); ++x)
  if x == escape sequence
    bufOut[y++] = bufIn[x++];
    bufOut[y++] = ' ';
    bufOut[y++] = bufIn[x];
    bufOut[y++] = bufIn[x];

fwrite(bufOut, sizeof(char), y, fOut);

That should get you most of the way there... Let me know if you have questions...
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

whoops - in the if statement, it should read

if bufIn[x] == escape sequence...

and not

if x == escape seqeunce

obviously, you may need to test 2 characters to see if it really is an escape sequence you want to trap.  Sounds like you already know how to do that...
PaulStevensAuthor Commented:
Not used this site before; will regrade when I've tested the result. Thanks!
PaulStevensAuthor Commented:

Although, gj62, this method works, the output files are no longer readable by EDT or TPU, neither do VMS commands like $diff work. The errors are of this form:-

4007392 byte record too large for user's buffer

This is despite newline characters being read and written to the file. (Tests with smaller files and buffer sizes reveal this)

I guess this is a limitation of C running on VMS.
Will use my original method if there is no other workaround.

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now