Solved

Combined String/File Handling

Posted on 2003-10-29
10
217 Views
Last Modified: 2010-04-15
I've prototyped a program in Python that groups files (ie, tar replacement) and I'm about to start with the real version, which will be in C. For reference, this is the header for the archives:

0#(creation date)#(creation time)#(creator name)#
(number of files)#(file 1's size)#(file 2's size)#...#(file n's size)#
0#(file 1's path)#(file 2's path)#...#(file n's path)#
#:#:#:#:#:#

Then it's a straight file system of the files. And the footer is just "## (eof):".
What I need to do is split the archive by that weird line in the header then take the line with the file sizes and use that to split up the actual data of the archive itself. (Ex: if file 1 is 30 bytes long, it reads 30 bytes into the data and gets the filename from line 3 of the header and writes the file.)

What I need from everyone else is either links to a good site to learn string handling in C, or sample code on string handling. I'll accept C++, but I'm trying to stay C. I'll put everyone's name who helps in the credits. :) (I'm calling this mar and so far makes smaller, more efficient files than tar.)
0
Comment
Question by:Malevolyn
10 Comments
 
LVL 22

Accepted Solution

by:
grg99 earned 84 total points
ID: 9641895
simple, use strchr() to find the position of a #, then strncpy to extract substrings, then atoi() to convert numbers to binary.

Also, you might want to think about these issues:

(1)  How are you going to ensure the file's integrity  (think: checksum)

(2) How are you going to handle files that may have had their end-of-line codes translated or mangled or space-trimmed in transmission.
0
 
LVL 45

Assisted Solution

by:Kdo
Kdo earned 83 total points
ID: 9642229

Smaller files than tar, huh?  That's a pretty good motivation!  :)

You'll want to think very carefully about your header structure.  As you've indicated, you can place the directory structure at the beginning of the file, but no matter where you place it, there will be trade-offs.

*  The archive starts with the header/directory as you've indicated.

What happens when you want to mar(1) a lot of files?  You have to scan all of the directory entries, build the header, and then copy the files.  What if some of the files are dynamic?  If /var/log/messages is one of the files, it could easily be a different size when you go to copy the data than when you built the header.  There goes any hope of restoring the file or any other file on the archive that was written after this file.


*  The archive has an archive header, and a file header is written immediately prior to each file.

This solves the length mismatch issue.  But listing the contents of an archive or searching an archive for a particular file becomes very inefficient since you'll have to walk through the archive, potentially reading from disk for every header.


*  The directory is written at the end of the file, after all of the file contents are recorded.

If data integrity is important (and it should be) this is probably the easiest.  It also allows you to scan the directory just as quickly as if it were the first item on the file.

The archive contents could look something like this:

#archive header
file1
file2
file3
...
filen
#directory
dp

dp is the random address (byte offset) of the directory.  To access the directory:

  handle = open(ArchiveName, O_RDONLY);
  seek(handle, 0l-(sizeof(long)), SEEK_END);

Now read and process it as if it were at the beginning of the file.


Are you going to compress these files as you record them?
Kent

 
0
 
LVL 45

Expert Comment

by:Kdo
ID: 9642252

Sorry.  Got ahead of myself...


dp is the random address (byte offset) of the directory.  To access the directory:

  handle = open(ArchiveName, O_RDONLY);
  seek (handle, 0l-(sizeof(long)), SEEK_END);
  read (handle, &dp, sizeof (long));
  seek (handle, dp, SEEK_SET);

Kent

0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 6

Assisted Solution

by:Ajar
Ajar earned 83 total points
ID: 9642418
It seems that you are a good programmer ...
Open the header file  string.h  to find the functionalities that c string library provides

just as an example here is a sample code to extract a substring from a bigger string  as is your case

e.g

char  buffer [] ="0#date#date#myfile.txt#"
char  buffer_1[64];

//now suppose you know that you want to get the first filename
//use followin code

char * temp = buffer;
int     seperator_count;
int     start,end;

while(1)
{
    if (*temp =='\0' ) break;//endo of string
    if (*temp == #)  {seperator_count++;

   if (seperator_count ==3)  start = temp -buffer;
   if  (seperator_count ==4)  {end = temp -buffer ; break;}
    }
}

// now copy the required string into the buffer_1;

memcpy(buffer_1, buffer+start+1/*skip the beginning #*/,end -start -2/*-2 for the two #*/);
//now terminate the buffer_1 with '\0'

*(buffer_1 +start -end-2) ='\0';





 
0
 

Author Comment

by:Malevolyn
ID: 9657889
Hm...thanks for the help. You see, I'm a big time Python programmer trying to move into compiled languages from interpreted (despite compiling Python and Perl scripts into exe's on a daily basis without embedding manually) and, of course, going from pretty syntax to angry syntax is a killer. What I don't understand is how I can bang out PHP no problem but can't do C at all...

Oh, and I changed the name to a3f.

And about the /var/log/messages thing, wouldn't you have similar problems with other file grouping algorithms? I could always make a3f cache the files before writing them...but that's not an issue anymore. I changed the format of the resulting archives. I was having trouble in the prototype with how it reads a given amount of bytes. Files are seperated with the same line as seperates the header from the data. So there's no need to worry about having the incorrect filesize in the header anymore, but I'm going to keep the code there for a few reasons. I'm too lazy to remove it and it makes the header look cooler. =D

My aim is to make the same exact code work on POSIX and Win32. At least on a prototype level. The C will obviously be different. As I type this, I'm realizing that I'm basically asking this community to write this program for me. Which is explainable in that I don't know C very well. But as I said, everyone will get credited for their help. Hopefully a3f will become a popular grouping format. My POSIX Python module distributions could be printnn-2.0.6.5.a3f.bz2! =D
0
 

Author Comment

by:Malevolyn
ID: 9657908
Forgot to answer one question: No, I'm not going to compress them. I might do some slight compression work later (replacing spaces at the beginning of lines in non-binary files), but that's for another day...
0
 
LVL 45

Expert Comment

by:Kdo
ID: 9658195

If you open the file (with lock) and then stat() it, you'll be able to copy it intact and get the correct length without fear of other processes changing the file.  Of course, this has its own pitfalls in that you will have changed the file's access timestamp before you stat() it.  Perhaps two stat() calls are required?  Or maybe just the one prior to opening the file is sufficient.  (It's possible for another file to access the file between the stat() and the open().)  Then again, if you count the bytes as you copy them, the file length won't be a problem, huh?

StatBefore  = stat (FileName);
handle = open (FileName, O_RDONLY|O_BINARY);



With perl you can easily change file attributes such as the access time stamps.  With C it's not so easy.  And since you're opening the file for read, you may want to reset the timestamps to their "user" values.


Kent
0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
posix semaphore deadlock 13 108
Details to do the search 56 144
How to jump to matching brace in eclipse editor ? 1 236
Detect CR LF to each line 12 137
Preface I don't like visual development tools that are supposed to write a program for me. Even if it is Xcode and I can use Interface Builder. Yes, it is a perfect tool and has helped me a lot, mainly, in the beginning, when my programs were small…
Windows programmers of the C/C++ variety, how many of you realise that since Window 9x Microsoft has been lying to you about what constitutes Unicode (http://en.wikipedia.org/wiki/Unicode)? They will have you believe that Unicode requires you to use…
The goal of this video is to provide viewers with basic examples to understand and use structures in the C programming language.
Video by: Grant
The goal of this video is to provide viewers with basic examples to understand and use nested-loops in the C programming language.

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now