Combined String/File Handling

Posted on 2003-10-29
Last Modified: 2010-04-15
I've prototyped a program in Python that groups files (ie, tar replacement) and I'm about to start with the real version, which will be in C. For reference, this is the header for the archives:

0#(creation date)#(creation time)#(creator name)#
(number of files)#(file 1's size)#(file 2's size)#...#(file n's size)#
0#(file 1's path)#(file 2's path)#...#(file n's path)#

Then it's a straight file system of the files. And the footer is just "## (eof):".
What I need to do is split the archive by that weird line in the header then take the line with the file sizes and use that to split up the actual data of the archive itself. (Ex: if file 1 is 30 bytes long, it reads 30 bytes into the data and gets the filename from line 3 of the header and writes the file.)

What I need from everyone else is either links to a good site to learn string handling in C, or sample code on string handling. I'll accept C++, but I'm trying to stay C. I'll put everyone's name who helps in the credits. :) (I'm calling this mar and so far makes smaller, more efficient files than tar.)
Question by:Malevolyn
LVL 22

Accepted Solution

grg99 earned 84 total points
ID: 9641895
simple, use strchr() to find the position of a #, then strncpy to extract substrings, then atoi() to convert numbers to binary.

Also, you might want to think about these issues:

(1)  How are you going to ensure the file's integrity  (think: checksum)

(2) How are you going to handle files that may have had their end-of-line codes translated or mangled or space-trimmed in transmission.
LVL 45

Assisted Solution

Kdo earned 83 total points
ID: 9642229

Smaller files than tar, huh?  That's a pretty good motivation!  :)

You'll want to think very carefully about your header structure.  As you've indicated, you can place the directory structure at the beginning of the file, but no matter where you place it, there will be trade-offs.

*  The archive starts with the header/directory as you've indicated.

What happens when you want to mar(1) a lot of files?  You have to scan all of the directory entries, build the header, and then copy the files.  What if some of the files are dynamic?  If /var/log/messages is one of the files, it could easily be a different size when you go to copy the data than when you built the header.  There goes any hope of restoring the file or any other file on the archive that was written after this file.

*  The archive has an archive header, and a file header is written immediately prior to each file.

This solves the length mismatch issue.  But listing the contents of an archive or searching an archive for a particular file becomes very inefficient since you'll have to walk through the archive, potentially reading from disk for every header.

*  The directory is written at the end of the file, after all of the file contents are recorded.

If data integrity is important (and it should be) this is probably the easiest.  It also allows you to scan the directory just as quickly as if it were the first item on the file.

The archive contents could look something like this:

#archive header

dp is the random address (byte offset) of the directory.  To access the directory:

  handle = open(ArchiveName, O_RDONLY);
  seek(handle, 0l-(sizeof(long)), SEEK_END);

Now read and process it as if it were at the beginning of the file.

Are you going to compress these files as you record them?

LVL 45

Expert Comment

ID: 9642252

Sorry.  Got ahead of myself...

dp is the random address (byte offset) of the directory.  To access the directory:

  handle = open(ArchiveName, O_RDONLY);
  seek (handle, 0l-(sizeof(long)), SEEK_END);
  read (handle, &dp, sizeof (long));
  seek (handle, dp, SEEK_SET);


3 Use Cases for Connected Systems

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, testing some more, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us.


Assisted Solution

Ajar earned 83 total points
ID: 9642418
It seems that you are a good programmer ...
Open the header file  string.h  to find the functionalities that c string library provides

just as an example here is a sample code to extract a substring from a bigger string  as is your case


char  buffer [] ="0#date#date#myfile.txt#"
char  buffer_1[64];

//now suppose you know that you want to get the first filename
//use followin code

char * temp = buffer;
int     seperator_count;
int     start,end;

    if (*temp =='\0' ) break;//endo of string
    if (*temp == #)  {seperator_count++;

   if (seperator_count ==3)  start = temp -buffer;
   if  (seperator_count ==4)  {end = temp -buffer ; break;}

// now copy the required string into the buffer_1;

memcpy(buffer_1, buffer+start+1/*skip the beginning #*/,end -start -2/*-2 for the two #*/);
//now terminate the buffer_1 with '\0'

*(buffer_1 +start -end-2) ='\0';


Author Comment

ID: 9657889
Hm...thanks for the help. You see, I'm a big time Python programmer trying to move into compiled languages from interpreted (despite compiling Python and Perl scripts into exe's on a daily basis without embedding manually) and, of course, going from pretty syntax to angry syntax is a killer. What I don't understand is how I can bang out PHP no problem but can't do C at all...

Oh, and I changed the name to a3f.

And about the /var/log/messages thing, wouldn't you have similar problems with other file grouping algorithms? I could always make a3f cache the files before writing them...but that's not an issue anymore. I changed the format of the resulting archives. I was having trouble in the prototype with how it reads a given amount of bytes. Files are seperated with the same line as seperates the header from the data. So there's no need to worry about having the incorrect filesize in the header anymore, but I'm going to keep the code there for a few reasons. I'm too lazy to remove it and it makes the header look cooler. =D

My aim is to make the same exact code work on POSIX and Win32. At least on a prototype level. The C will obviously be different. As I type this, I'm realizing that I'm basically asking this community to write this program for me. Which is explainable in that I don't know C very well. But as I said, everyone will get credited for their help. Hopefully a3f will become a popular grouping format. My POSIX Python module distributions could be printnn-! =D

Author Comment

ID: 9657908
Forgot to answer one question: No, I'm not going to compress them. I might do some slight compression work later (replacing spaces at the beginning of lines in non-binary files), but that's for another day...
LVL 45

Expert Comment

ID: 9658195

If you open the file (with lock) and then stat() it, you'll be able to copy it intact and get the correct length without fear of other processes changing the file.  Of course, this has its own pitfalls in that you will have changed the file's access timestamp before you stat() it.  Perhaps two stat() calls are required?  Or maybe just the one prior to opening the file is sufficient.  (It's possible for another file to access the file between the stat() and the open().)  Then again, if you count the bytes as you copy them, the file length won't be a problem, huh?

StatBefore  = stat (FileName);
handle = open (FileName, O_RDONLY|O_BINARY);

With perl you can easily change file attributes such as the access time stamps.  With C it's not so easy.  And since you're opening the file for read, you may want to reset the timestamps to their "user" values.


Featured Post

Comprehensive Backup Solutions for Microsoft

Acronis protects the complete Microsoft technology stack: Windows Server, Windows PC, laptop and Surface data; Microsoft business applications; Microsoft Hyper-V; Azure VMs; Microsoft Windows Server 2016; Microsoft Exchange 2016 and SQL Server 2016.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
mixing C++ and C code elegantly 10 158
How to align numbers in C using the %d 2 98
C++ vs C compilers 13 159
List out all word 7 295
This tutorial is posted by Aaron Wojnowski, administrator at  To view more iPhone tutorials, visit This is a very simple tutorial on finding the user's current location easily. In this tutorial, you will learn ho…
Examines three attack vectors, specifically, the different types of malware used in malicious attacks, web application attacks, and finally, network based attacks.  Concludes by examining the means of securing and protecting critical systems and inf…
Video by: Grant
The goal of this video is to provide viewers with basic examples to understand and use for-loops in the C programming language.
Video by: Grant
The goal of this video is to provide viewers with basic examples to understand and use while-loops in the C programming language.

831 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question