Solved

Extracting a file embedded in another binary file.

Posted on 2003-11-17
12
546 Views
Last Modified: 2013-11-15
Hej All,

I'm somewhat of a C++ beginner although I have quite a lot of experience in other C++ type languages.
Anyway, I'm using VC++ 6 at the moment and I'm trying to write a very small console application, non MFC
to perform a little mundane task.

I have a binary file, some 500Mb in size which contains a number of different file types all concantenated together.
Apparently this file onces had a corresponding index file but thats been lost and I need to get certain files out from
within this larger one.

So far I've written my very basic program and have written a function that will extract part of the larger file
to any other file if I supply the destination filename, offset in the source file and its length in bytes.

However, now comes the problem - and I've tried searching the web using various keywords but haven't gotten
anything close to what I want to do.

The files I want to pull out are standard TGA image files. The method I figured which is best is to search the
source file a TGA header and footer, derive the offset and length from that and extract.

Now, from looking at the file with a HEX editor I know the following - the TGA files start with the HEX string
"0000 2000 0000 0000" and end with the string "TRUEVISION-XFILE" followed by 0x00.  However, the same
HEX string that denotes a TGA header often occurs in the source file in other instances.

My idea was thus.

 - Search the source file for the TGA "footer" and and work out at what offset the file ends
 - From this point, search *backwards* through the source file until I hit the TGA header.
 - Note the position of the header and from the derived length, copy out the binary data to a new file.

Now, the problem is, ahem, how on *earth* can I do this in C++? Bearing in mind that there are 0 - n
TGA files inside this large binary.

I know about using fseek to move to and offset but I really am stumped on how to go about this!
As I said searching the net for 2 days hasn't unearthed any source doing something similar or
any clues on how to go about it.

Can anyone shed some light or give me some examples?

 - J

0
Comment
Question by:Wuderboy
  • 3
  • 3
  • 2
  • +2
12 Comments
 
LVL 49

Expert Comment

by:DanRollins
ID: 9768733
Do you really need a program to do this?  It sounds like a one-time thing.

If so, just open it in VC++ IDE.  Set it to open as a binary file.  Use the Search facility.  Select chunks of data and copy and paste.  There is a trick I've used... I create a dummy empty file and open *it* as binary, then I can paste to it and "save as" once for each chunk I've found.

-- Dan
0
 

Author Comment

by:Wuderboy
ID: 9768765
There are over 16500 TGA's inside this binary file and we get one of these files about once a week, sadly with no indexes. I did the initial file inspection with a Hex editor which is how I figured out what it was we needed to do to get the files out.

As to what the data is, I'm not allowed to say due to NDA's but inside the files are TGA image data plus ancillary files which contain data about them.

- J
0
 
LVL 3

Expert Comment

by:RJSoft
ID: 9768970
Sounds simple enough especially since you have knowledge of how to use fseek.

Semi-Pseudo


FILE *fptrIn=fopen("xxx","rb"); //read large file
FILE *fptrOt=fopen("zzz","wb");//output file

int   FlagBegin=0;
long BeginFseek=0;
long EndFseek=0;

// read 1 char at time

while(fread(&ch,sizeof(char),1,fptrIn)
{
    if(!FlagBegin)
    if(IsABeginningString(ch) && ch=='0') //0 is first char
    {
    FlagBegin=1;
    }

    if(IsAFooter(ch))
    {
      FlagBegin=0;
      CopyFromBeginningToEndAndOutpuAsFile();
    }

}//endwhile


int IsABeginningString(char ch)
{
int flag=0;//0==no
long save = ftell(...);
char String[100];
char Temp[2];
Temp[0]=ch;
Temp[1]='\0';
strcpy(String,Temp);
// reading one char at a time and tacking it on untill string is found
// or not found (exceeds length).
// you could also add other ways to speed up this process. See below
int Len=0;
while(fread(&ch,sizeof(char),1,fptrIn))
{
Temp[0]=ch;
Temp[1]='\0';
strcat(String,Temp);
if(strcmpi(String,"0000 2000 0000 0000")==0)
{
flag=1; //found
BeginFseek=ftell(...); //Found so store starting position
}//endif
Len++;
if(Len >= 20)break;
}//endwhile
fseek(fptrIn,save,0); // go back
return flag;
}//endfunc

IsAFooter would be basically the same thing but storing EndFSeek value.

So you read one char at a time. When you have the first condition met (the beginning string) AND the second condition met (the footer string) then you use the 2 values FseekBegin and FseekEnd to write an output file.

long Size = FseekEnd - FseekBegin;
fseek(fptrIn,FseekBegin,0);
long count=0;
while(fread(&ch,sizeof(char),1,fptrIn))
{
fwrite(&ch,sizeof(char),1,fptrOT);
count++;
if(count==Size)break;
}

Then as the main loop (fread) continues it gets the next file and the next file untill there are no more chars to read.

You can cut down on some of the redundancy by checking for the second and third etc.. known chars that make up the beginning string so that it skips much of the double reading process.

Once you have the beginning string found and you continue on reading you might also want to compare for a next string (middle) and if NOT found  would be a signal to return to the next char after the first beginning string.

That way if you have some other files as you say that have a similar beginning string you can avoid outputting them. Personally I would look for another known header item, that would be reliable to include in the beginning search. I would check within so  many chars for a middle string.

What your really want is all the similarities of header items that you can rely on.

To find similarities in similar image files simply write a function that reads in two or more similar files at once and if the char is the same printf it out (or whatever output you like) or send a blank char

Send an example to screen so you can see what number position the chars are at and then gather a few examples untill you know it is reliable. Or build a routine to compare against a giant list of them and see what remains.

If you build a console test app then you can pass file names as parametors through a batch file. Send the output as a text file. This would probably be more efficient.

Also in all of the testing of similarities you could truncate the comparison to a few hundred chars. That should be enough for header information.

while(fread(&ch1,sizeof(char),1,fptr1) && fread(&ch2,sizeof(char),1,fptr2))
{
if(ch1==ch2)printf("%d %d",ch1, count);
else printf("  %d",count);

count++;

getch(); // pause so you can see screen output.

Make the app so you can send command line parms and run a batch file adding the names of some files you wish to find consistant headers on.

Note the position number and then you could further speed up the file finding process by fseeking from the beginning position and adding what a known char should be located to the offset of fseek.


Hope this helps.

RJSoft
0
 

Author Comment

by:Wuderboy
ID: 9769083
Thanks for the tips, its going to take me a while to go through that and digest it all!

Just one question - I've read several examples that say use fgetc() and to try it out I wrote the following which loops through my file and just prints out a list of where it finds the start of each TGA footer string:

void FindTGA2 (const char *infile) {

      printf("Searching...\n\n");
      FILE *pFile;
      long lSize;
      char c;
      int count = 0;
      
      char search[] = "TRUEVISION-XFILE";

      pFile = fopen (infile, "rb" );
      if (pFile==NULL) exit (1);

      // obtain file size.
      fseek (pFile , 0 , SEEK_END);
      lSize = ftell (pFile);
      rewind (pFile);

      do {

            c = getc (pFile);
            if (c == 'T') {

                  long matchstart = (ftell(pFile) - 1);
                  char szStr[17];
                  
                  fseek(pFile, matchstart, SEEK_SET);
                  fgets(szStr, 17, pFile);

                  if (strstr(szStr,"TRUEVISION-XFILE") != NULL) {
                        printf("%s found at offset %lu\n", szStr, matchstart);
                        count++;
                  }
                  
            }
        
      } while (ftell(pFile) < lSize);

      printf("\n%d matches found.\n", count);
      
      // terminate
      fclose (pFile);
      
}

Now this works fine and I've confirmed the number it returns using a search in a Hex editor. However, its *PAINFULLY* slow - searching the 500Mb file took 15 minutes and the computer I'm searching on is not underpowered at all!

I wrote searched the C++ documentation and made another function to read the entire file into memory:

void FindTGA (const char *infile) {

      FILE *pFile;
      long lSize;
      unsigned long offset;
      char *buffer;

      pFile = fopen (infile, "rb" );
      if (pFile==NULL) exit (1);

      // obtain file size.
      fseek (pFile , 0 , SEEK_END);
      lSize = ftell (pFile);
      rewind (pFile);

      // allocate memory to contain the whole file.
      buffer = (char*) malloc (lSize);
      
      // copy the file into the buffer.
      fread (buffer,1,lSize,pFile);

      // terminate
      fclose (pFile);
      free (buffer);

}

Now this only takes about 5 seconds to load the file in but, obviously enough it uses 500Mb of memory so I would need to optimize it make to read the file in in say 10Mb chunks, search and then read the next chunk in with an overlap in case my header crosses a boundry.

But, and this is where me being a C++ novice comes in, whats the best way to search this buffer for the required string, again assuming that the string could appear 0 - n times within it. It seems my concept of arrays and strings are totally out of the window when it comes to C++ and Im not sure whats the right thing to do.

- J

0
 
LVL 13

Expert Comment

by:SteH
ID: 9770174
To read teh file in chunks of 10 MB you can take you second part of code

void FindTGA (const char *infile) {

     FILE *pFile;
     unsigned long offset;
     char *buffer;
     const long lSize = 10000000;
     long lSizeRead;

     pFile = fopen (infile, "rb" );
     if (pFile==NULL) exit (1);

     buffer = (char*) malloc (lSize); // just allocate 10 MB
     
     // copy the file into the buffer.
     while (!eof (pFile)) {
        lSizeRead = fread (buffer,1,lSize,pFile);  // each read get lSize bytes except the last which could read less.
        Search (buffer. lSizeRead);                    // only use valid part of buffer: This is only for last read.
     }

     fclose (pFile);
     free (buffer);

}

Search can be as before but you might have the termination chars split in 2 buffers. You need to take care of that.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 3

Expert Comment

by:RJSoft
ID: 9772522
I see what you mean. 500 mb. Come to think of it even 10mb would be slow reading from file. :)

You can use memcpy to copy from your handle into a link list.

But MFC VC6.0 has a class that automates creation and use of link list. I am not sure but I think CList or CArray is then what you would want to use.

You may even consider using a handle.

I have a routine where I needed to reverse the bits in a bitmap and I use GetDiBits to retrieve the bitmap bits (very large data, but not as large as what your dealing with).

Only point of this is to show use of memcpy and how it can access a handle and copy to something you can then use pointer arithmatic on and traverse through data (chars). Copy to a char pointer.

In this example the lpbvBits is a handle not a pointer so you cant ++ to increment to the next element of the memory stream, also you cant refference it  * to access the value. (No pointer arithmatic). The handle only being an address to an object in memory has no ideal of the structure of the memory.

lpbi is a long pointer to a bitmap info header, but that is irrelavent for you.

//reverse the bits
char *CharMem = new char[lpbi->bmiHeader.biSize];
::memcpy(CharMem,lpbvBits,lpbi->bmiHeader.biSize);
_strrev( CharMem);//reverse
::memcpy(lpbvBits,CharMem,lpbi->bmiHeader.biSize);
//
delete []CharMem;

Before the delete you could

CharMem++; //go to next element
char c = *CharMem; // copy value to c

//combined statement
char c =*(CharMem++);

You would adjust the memcpy size parm to fit your needs.

I would consider less than 10 mb unless you think you may have close to a 10 mb pic possibly future that will be embeded in the file. I would pick adaquate size to cover your largest file, but do you really need that much?

Come to think of it there is really no need for using that much memory anyway.

You have gone from one extreme (one char at a time) to another (10 meg).

Try coding your routine where you can adjust the size of the read/load memory and still output your files properly.

Read, load mem, construct file, ditch memory
Read, load mem, construct file, ditch memory
Read, load mem, construct file, ditch memory
etc...


I am sure you could then find some reasonable comprimize. You dont want your app to be a memory pig anyway. In case you have other apps running in the background. You might make a real sluggish environment also you can corrupt memory from accessing values beyond normal pointer scopes. The compiler/pointer wont complain it will just do a wrap arround thing where the address gets messed up.

I cant exactly remember but since a char * (like I used is normally signed and 8 bit) so you can see that an 8 bit memory addressing unit cant handle extreme values. So you would use a char FAR *.

But I admitt I get lost some here too. In the old days of 16 bit api one had to worry about addressing memory past segment boundries or a pointer would wrapp arround or something.
 
There is supposed to be in the 32 bit environment a way of protection against that kind of data segment boundrie worries. A much larger memory addressing capability.

Anyway you may want to experiment in the 10k to 100k range or so. You could avoid using CArray and CList with char FAR *. (I believe).

Hope this helps

RJSoft

RJSoft
0
 
LVL 49

Accepted Solution

by:
DanRollins earned 43 total points
ID: 9772748
How big are the TGA files?
If you can set an upper limit, there is a simple teechnique that will avoid some headaches.  Let's say that no TGA is larger than 3MB.

1) Allocate a 3MB buffer
2) Read 3MB from the file
3) Locate the start and end of the first TGA in that buffer.
4) Write it to disk
5) fseek to one byte beyond that and go to step 2

One might think that there is a lot of extra disk reading, but the OS will have cached it so most of the time it will just be a blazing-fast memory transfer.  Also there are many ways to optimize this if you find it takes too long -- E.g., try to locate two or more complete TGAs in each buffer.  But using this technique simplifies the logic -- avoids the chance of a 'signature' being split across a buffer-size boundary.

-- Dan
0
 

Author Comment

by:Wuderboy
ID: 9777274
Hmm I'll try some of these out and see what works best.

To answer Dan's question there are approx. 16500 TGA's inside this file between 2 - 250k in size.

Also, the one thing I have't yet covered is how to search for the HEX header to the TGA which is "0000 2000 0000 0000" whats the best way to get that into a char to use a strstr search on it?

- J
0
 
LVL 13

Assisted Solution

by:SteH
SteH earned 41 total points
ID: 9777324
strstr will fail since it assumes a 0 terminated string for the search expression. The header will be as char 0x0 0x0 0x20 0x0 0x0 0x0 0x0 0x0. So the first entry is a 0 already terminating the string for strstr.
memcmp (buf1, buf2, size) could be used to compare at a given position. Better would be to search for first char in full set and try memcmp at that place. If that fails search for first char again.
0
 
LVL 3

Assisted Solution

by:RJSoft
RJSoft earned 41 total points
ID: 9780993
Translate what the hex editor is showing to decimal. Then use those values
0
 
LVL 9

Expert Comment

by:tinchos
ID: 10249271
No comment has been added lately, so it's time to clean up this TA.
I will leave the following recommendation for this question in the Cleanup topic area:

Split: DanRollins {http:#9772748} & SteH {http:#9777324} & RJSoft {http:#9780993}

Please leave any comments here within the next seven days.
PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!

Tinchos
EE Cleanup Volunteer
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

A high-level exploration of how our ever-increasing access to information has changed the way we do our jobs.
I use more than 1 computer in my office for various reasons. Multiple keyboards and mice take up more than just extra space, they make working a little more complicated. Using one mouse and keyboard for all of my computers makes life easier. This co…
The goal of the tutorial is to teach the user how to use functions in C++. The video will cover how to define functions, how to call functions and how to create functions prototypes. Microsoft Visual C++ 2010 Express will be used as a text editor an…
XMind Plus helps organize all details/aspects of any project from large to small in an orderly and concise manner. If you are working on a complex project, use this micro tutorial to show you how to make a basic flow chart. The software is free when…

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now