Solved

Splitting a very large file

Posted on 2003-11-10
35
819 Views
Last Modified: 2012-06-21
Hope whoever reads this is up to the challenge, I am currently trying to solve this one but the answer keeps eluding me.

I have a file its size can be anywhere from 4GB to 36 GB and possibly above, but the size is not what is important.

I would like to take that file then chop it up into user definable sizes these sizes would range from 4kb to 512kb (I know very small). any ideas how to do this.

Once this part is out of the way I will post the second part.
also the better the answer the more points I will spend
0
Comment
Question by:pilley
  • 10
  • 7
  • 5
  • +5
35 Comments
 
LVL 45

Accepted Solution

by:
sunnycoder earned 100 total points
ID: 9719986
what language/platform are you working on?

with C, you can

int i = 0;
char fn[200];
FILE *infile;
FILE * outfile;

infile = fopen ( "input_file", "r");

while ( fread ( buffer, 1, user_defined_size, infile ) != 0 )
{
      sprintf ( fn, "outfile.%d", i );
      outfile =  fopen ( fn, "w" );
      fwrite ( buffer, 1, user_defined_size, outfile );
      fclose(outfile);
      i++;
}

fclose(infile);
0
 
LVL 9

Expert Comment

by:Dang123
ID: 9724002
pilley,
    Would you want to be output all the pieces at once or be able to ask the program for a specific piece? (The small files would take MUCH more disk space than the larger one depending on your OS and how you have your disk partitioned)

Dang123

0
 

Author Comment

by:pilley
ID: 9724379
Dang123,

It would be nice to output all the files at once. because the next part includes asking the program for a specific peice from many of these files.

The platform can be linux/windows so basically any programming language will be supported
0
 
LVL 5

Expert Comment

by:g0rath
ID: 9724623
It also depends on the Data and application....if this is just raw data samples for a stats program, trivial...you attempting to create a manageable index into this large data set.

But if this is lets say a MS SQL Table in MS format, then you need a different approach

I think we need a little more information on the type of data to get a clearer answer.
0
 

Author Comment

by:pilley
ID: 9724948
Okay G0rath,

but this is not going to involve a database *yet*. Here is an example scenario dont take it word for word just use it as a guide.

Say I have four files : 1 2 3 4 each are 36GB in size.
The data I want lies in the 4kb to 512kb range.
The files by them selves are not for the purpose of this example human readable.
each chunk (4kb to 512kb range) contains data that relates to the same chunk in the other file

ie. 1 (4kb) relates to 2 (4kb) which also relates to 3 and 4

The task set above is to take a file and split it into user definable chunks (4kb to 512kb range). (This means that there will be alot of files once the split is complete -- 576000 if the chunk size is 64 kb)The second part will be putting the chunks back into the order they should be in. but this task can't be set unless the first task is completed.

Hope this answers your question.

Pilley
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 9728360
Pilley .... did you try my siggestion ? any feedback ?
0
 
LVL 3

Assisted Solution

by:RJSoft
RJSoft earned 100 total points
ID: 9728378
The key is to read your files in the smallest addressable unit. A char (8 bit) does perfect.

The reason is that whatever files you chop into smaller sub files, you will want to be able to know the amount of bytes you have addressed.

That way you can avoid truncating a valuable piece of information OR you can also then re-assemble your file (any file binary, text, etc...) from the sub files to be exactly what it was before.

So you simply read in char by char at a time until you reach your desired length.

I use this all the time in a simple app that I made to chop large files and then copy them to 1.5 floppies. (This is easier when I want to test something that I built on someone else machine, rather than upload and download since most people only have 56k modem).

Example to make a sub file from main file of 100 bytes.

int x=0,Requested=100;
FILE *fptrIN=NULL;
FILE *fptrOUT=NULL;
fptrIN=fopen("file1.xxx","rb");
fptrOUT=fopen("result.xxx","wb");
if(fptrIN==NULL || fptrOUT==NULL)
{
Error();
return;
}
char ch;
while(fread(&ch,sizeof(char),1,fptr))
{
fwrite(&ch,sizeof(char),1,fptrOUT);
x++;if(x==Requested)break;
}//endwhile
fclose(fptrIN);
fclose(fptrOUT);


RJSoft

P.S. I need the points.
Thanks.
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 9728433
>The key is to read your files in the smallest addressable unit. A char (8 bit) does perfect
Does not sound convincing to me...
If you need to read in 1024 bytes, why execute a read statement 1024 times ... why not read in all 1024 bytes at one go ?
0
 
LVL 1

Assisted Solution

by:meff
meff earned 100 total points
ID: 9749873
Actually for unix/linux I think it's easy - 'split' command may help.
It takes the size for pieces.
I don't know exactly it's behaviour when the result number of files is more than 28*28 result files ('man' tells the example for not more than that), you may test this (it's even interesting).

For windows - I don't know - It's harder. That's a point Unix/Linux is loved for! - There is a big likelyhood, that you'll find any tiny utility that was once written by some good guy and it'll make all the job for you (may'be you'll need a second and a script ;-).
0
 
LVL 5

Assisted Solution

by:g0rath
g0rath earned 100 total points
ID: 9750346
If your on a unix system, splitting into 576000 and then doing it multiple times, is not desirable due to inode issues.

So another way would be to create a data structure to read your chunk, and then just use pointer math to access it, and not have to create many files.

// 64K
#define CHUNK_SIZE 65536

typedef struct CHUNK {
    unsigned char data[CHUNK_SIZE];
};

void load_chunks( )
{
     int nChunks=0;

     nChunks = filesize / CHUNK_SIZE;

     // whatever
}

CHUNK *getChunk( int n )
{
     CHUNK *c = NULL;
     c = malloc(CHUNK_SIZE);
     lseek(file_fd, n * CHUNK_SIZE, SEEK_SET);
     read(file_fd, c, CHUNK_SIZE);
     return c;
}

give me chunk 5700

myChunk = getChunk(5700)

do whatever you want with that chunk

reorder chunks and write back in different order

skys the limit at this point

pseudo-c code so don't try to compile, unexpected things may occur... :)
0
 

Expert Comment

by:cgross
ID: 9751235
Do not know if you want to do this programmatically or as a user interface.  If GUI, the power tools program chop.exe will do what you need.

http://www.powertoolsforum.com/modules/mylinks/singlelink.php?cid=18&lid=281
0
 

Author Comment

by:pilley
ID: 9760036
G0rath,

I think you are on to something there.

The fact that there is going to be more than one file, would it be more efficient to read say all the files at once. i.e. read the first 64k of each file and out put it on the fly.

If so how would I go about doing this
0
 
LVL 5

Expert Comment

by:g0rath
ID: 9763449
what is the final goal? to read 64K blocks of data from various sources to have human comparison of them? as in output to the screen or to another file?

Or are you going to apply some program rules to them, and that that doesn't fit the rules, have marked for human assistance?

It would be intresting to know how many files you need to read at once as you say....

if you have to look at 100 files and 64K chunks at a time, you would need only 6.25 megs of physical memory, or 62.5megs for 1000 files of 64k chunks.

But 1000 files of 4gb would be around 3.9terrabytes....if you would have that many files I don't know....
but with a smaller memory footprint you could traverse these large files easily.....

Is this all binary data as in packet traces looking for a specific TCP sequences? that may impact how you want this to work.
0
 
LVL 9

Expert Comment

by:Dang123
ID: 9763533
Just bumped into this, thought it may help you

http://www.freevbcode.com/ShowCode.Asp?ID=449

0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 9778808
>The task set above is to take a file and split it into user definable chunks (4kb to 512kb range). (This means that there will be
>alot of files once the split is complete -- 576000 if the chunk size is 64 kb)

Is it all right if you split the files only in the 4-512Kb range and let the remaining portion untouched ? ...
If yes, then would you like to have original files unaltered or the splitted portions removed ?
0
 
LVL 4

Assisted Solution

by:void_main
void_main earned 100 total points
ID: 9831933
Hey, pilley! Try this:

// partwise Pseudo code

#include <io.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <iostream.h>
#include <stdlib.h>
#include <stdio.h>
#include <mem.h>

typedef unsigned long int ulong;

bool nextFile(int handle, char *newName, ulong fileSize)
{
       int destHandle;
       int remaining = fileSize;
       int newRead;
       int lastRead;

       destHandle = open(newName, O_CREAT | O_BINARY | O_WRONLY | O_TRUNC, S_IREAD | S_IWRITE | S_IFREG);
       if (destHandle < 1)
          return false;

       char buffer[1024];
       while (remaining > 0)
       {
               newRead = remaining;
               if (newRead > 1024)
                     newRead = 1024;

               // You can add an error routine here...
               read(handle, buffer, newRead);
               write(destHandle, buffer, newRead);

               remaining -= newRead;
       }
       close(destHandle);
       return true;
}

void main()         // no one writes VOID here (thats why I call myself void_main)
{
       char *openName;
       char *destFile;
       char *ext = ".dat";       // whatever you want
       ulong fileNum = 0;      // You want a lot of files. Here I count where we are...
       ulong openSize;
       ulong chunkSize;
       int openHandle;

       cout << "Enter the name of the HUGE file: ";
       cin >> openName;

       openHandle = open(openName, O_BINARY | O_RDONLY);
       if (openHandle < 1)
       {
          cout << "Openfile does not exist!";
          return;
       }

       cout << "What filesize do you want? In Kbytes: ";
       cin >> chunkSize;

       chunkSize *= 1024;

       seek(openHandle, 0, SEEK_END);        // This may contain errors, it could be "lseek"
       openSize = tell(openHandle);
       seek(openHandle, 0, SEEK_SET);         // Jump to filestart
       
       while (tell(openHandle) < openSize)
       {
            sprintf(destName, "%i%s\0", fileNum, ext);
            if (!nextFile(openHandle, destName, chunkSize))            
            {
                 cout << "Error while creating file " << destName << " !!!!!!!";
                 getch();
                 abort();
            }
       }

       cout << "Thats it";
}

// End of code

greetings from
void_main
0
Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

 
LVL 4

Expert Comment

by:void_main
ID: 9831947
ooooops!!! Please modify the code!!! I FORGOT SOMETHING!!!

change this:
//--------------------------------------------------------------------------------------------
       while (tell(openHandle) < openSize)
       {
            sprintf(destName, "%i%s\0", fileNum, ext);
            if (!nextFile(openHandle, destName, chunkSize))            
            {
                 cout << "Error while creating file " << destName << " !!!!!!!";
                 getch();
                 abort();
            }
       }
//--------------------------------------------------------------------------------------------

to this:
//--------------------------------------------------------------------------------------------
       while (tell(openHandle) < openSize)
       {
            sprintf(destName, "%i%s\0", fileNum, ext);
            if (!nextFile(openHandle, destName, chunkSize))            
            {
                 cout << "Error while creating file " << destName << " !!!!!!!";
                 getch();
                 abort();
            }

            fileNum++;

       }
       close(openHandle);
//--------------------------------------------------------------------------------------------
(this happens when I am sleepy)
0
 

Author Comment

by:pilley
ID: 9833326
I would hate to see what would happen when you are not sleepy void.

I am now testing it but I have a feeling we have a winner

0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 9836967
Pilley,
>I am now testing it but I have a feeling we have a winner
Isn't this solution similar (just that I used C and this includes C++) to the solution I posted in my first post which you conveniently chose to ignore?

int i = 0;
char fn[200];
FILE *infile;
FILE * outfile;

infile = fopen ( "input_file", "r");

while ( fread ( buffer, 1, user_defined_size, infile ) != 0 )
{
     sprintf ( fn, "outfile.%d", i );
     outfile =  fopen ( fn, "w" );
     fwrite ( buffer, 1, user_defined_size, outfile );
     fclose(outfile);
     i++;
}

fclose(infile);
0
 
LVL 4

Expert Comment

by:void_main
ID: 9848668
@sunnycoder: It is similar but you only have 999 files available. (or FFFh if you use hex)
(maybe I am missing, because I don't know if Windows supports suffixes with more than 3 chars)
Concerning C++: Just replace the cout << "bla"; to printf("bla"); (I am commonly using C++ at home and I don't know exactly, how to do the scanf)

@pilley: You can modify the code once more (but thats only to see the progress) if you, for example, write out a dot for every file, but this fills the screen immeadetly.
0
 
LVL 4

Expert Comment

by:void_main
ID: 9848684
Here I am again!
Please modify this:

//----------------------------------------------------------------------------
bool nextFile(int handle, char *newName, ulong fileSize)
{
       int destHandle;
       int remaining = fileSize;
       int newRead;
       int lastRead;
//----------------------------------------------------------------------------
to this
//----------------------------------------------------------------------------
bool nextFile(int handle, char *newName, ulong fileSize)
{
       int destHandle;
       ulong remaining = fileSize;
       ulong newRead;
       ulong lastRead;
//----------------------------------------------------------------------------


void_main
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 9848736
>(maybe I am missing, because I don't know if Windows supports suffixes with more than 3 chars)
I just created one with name asd.asdsdssdsdsd and it works fine ... Even if it created a limitation of number of files, modifying
sprintf ( fn, "outfile.%d", i );
to
sprintf ( fn, "outfile%d.xtn", i );
is trivial
0
 
LVL 4

Expert Comment

by:void_main
ID: 9865545
that is correct!
And I do not claim that your code doesn't work but I think it is not as clear as mine, because you kept it short. And you have to recompile, if the chunk size changes or you want to change the filename (that's not sooo bad, but I wouldn't like it)
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 9865594
>I think it is not as clear as mine, because you kept it short
I would say that it conveys the idea more cleanly and is more efficient ... efficiency counts when you need to execute same thing over and over again.

>And you have to recompile, if the chunk size changes or you want to change the filename (that's not sooo bad, but I
> wouldn't like it)
No, you can always read it as command line argument or user input ... The question is how to split .... These issues are trivial and secondary ...

Anyway, I had nothing against you or your method ... I am looking forward to a reply by pilley
0
 
LVL 4

Expert Comment

by:void_main
ID: 9867068
@sunnycoder

>I had nothing against you or your method
I believe you. I have not assumed you would...

>These issues are trivial and secondary
Yes, you're right. But not quite unimportant.

I hope the problem is solved soon! (I am waiting for the 2nd part!)

greets to everyone
0
 

Author Comment

by:pilley
ID: 9869088
Well,

Thats quite a dialogue goin on there. Originally all I wanted was some ideas or a direction to help me solve this problem but then came void and sunnycoder. I hope you two have resolved your differences.

Now as to whos code is better, the main points here are speed ,reliability and ease of putting into practice.

At this point I am still testing BOTH sets of code ( i was playing with your original code sunnycoder at first) give me some time and I will post the results as there is various testing and remodding I have to do. Once this is completed I will then accept the answer/s I think is best.

and don't worry the 2nd part is coming.  
0
 
LVL 3

Expert Comment

by:RJSoft
ID: 9870096
void probably has the better code because his adjust the read

{
               newRead = remaining;
               if (newRead > 1024)
                     newRead = 1024;

But also I would make a parametor as to chunk size instead of using a constant. User adjustable but constrained to available sizes.

In my previous post I thought I had a reason for reading in the smallest unit addressable. But I guess not. Could have sworn there was some reason "oh well, forgotten".

So when your done with your split files you can just concatinate them with append mode for the first section and then adding your other files back.

I guess for that matter you could simply decide how many times you want to divide the original file and create a string on the heap and then read to and write from it.

long BeginSize = ftell(...);

long Size = BeginSize / 4; (4 files)

//calculate some for possible remainder

long Rem = (BeginSize % 4) + 4; //(+1 for some slack)

Size +=(Rem/4);

char *St = new[Size];

//fread returns amount read or a lesser count
long AmountRead=0;

fopen(//one file for read and 4 to write
fopen(...etc...

AmountRead=fread(St,sizeof(char)*Size,1,fptrIn);
fwrite(St,sizeof(char)*AmountRead,1,fptrOut1);

AmountRead=fread(St,sizeof(char)*Size,1,fptrIn);
fwrite(St,sizeof(char)*AmountRead,1,fptrOut2);

AmountRead=fread(StIn,sizeof(char)*Size,1,fptrIn);
fwrite(StOt,sizeof(char)*AmountRead,1,fptrOut3);

AmountRead=fread(StIn,sizeof(char)*Size,1,fptrIn);
fwrite(StOt,sizeof(char)*AmountRead,1,fptrOut4);

fclose(..
fclose(.. etc....

delete [] St;

RJ
0
 
LVL 4

Expert Comment

by:void_main
ID: 9873103
@RJsoft
//----------------------------------------------------------------
AmountRead=fread(St,sizeof(char)*Size,1,fptrIn);
fwrite(St,sizeof(char)*AmountRead,1,fptrOut1);

AmountRead=fread(St,sizeof(char)*Size,1,fptrIn);
fwrite(St,sizeof(char)*AmountRead,1,fptrOut2);

AmountRead=fread(StIn,sizeof(char)*Size,1,fptrIn);
fwrite(StOt,sizeof(char)*AmountRead,1,fptrOut3);

AmountRead=fread(StIn,sizeof(char)*Size,1,fptrIn);
fwrite(StOt,sizeof(char)*AmountRead,1,fptrOut4);

fclose(..
fclose(.. etc....
//----------------------------------------------------------------
you should use a loop for this to minimize your typing work!
By the way: is it safe to allocate one quarter of 36gigs ?
But the whole idea is good!!!

Concerning the   remaining > 1024:
of course you can use a variable here, but here are some reasons why I used 1024:
- it is a power of 2
- it is a good divisor of 4kb and the other desired sizes
- I like this number   =;-)
- I was seeking for an alternative, not to read too less bytes and not to read too much bytes... this number does it

greetings
0
 
LVL 3

Expert Comment

by:RJSoft
ID: 9873274
void_main

The above post lack of loop was easy to post here. You get the ideal.

I am wondering how you come up with 1/4 of 36 gig (9 gig).

Are you saying this size is due from the "new" creating a memory buffer or is it the fread/fwrite (or any other file read write api) or both?

I am wanting to see some official documentation.

I have a routine that over-writes a file and then deletes it. It is for security purposes so that no data recovery team could access the file.

In my routine I also do something similar to what you have done with the 1024 but with if statements

With an unknown file size I do this

I write one char ch='z'; WriteSize amount of times with fwrite.
So as the remainder gets smaller I pick the most efficient amount to
increase the speed of writing. The if statements are fast. The last being the smallest addressable unit (8 bit) char.

Also I can increase my file reading with the reverse logic. But first I need the file size so I use findfirst and the struct which gives me the file size.

This is where I now question what is needed for reading into a string as I would like to know just how far I can push the speed barrier. Obviously the faster I read and write the faster my app runs. So I would like to know what the boundries are and where you got the info from. In short I need to know how much fread and fwrite can handle and possibly "new" or how much addressing can a char* address. I don't want to go beyond the scope of a variables addressing capability and wind up with junk addressing.

Here is my fwrite routine. It is fairly fast.

struct _finddata_t ptffblk;
long fhandle = _findfirst(FileName,&ptffblk); //getting first file or dir
if(fhandle==-1)
{
MessageBox("Failed to find this directory, check spelling!",FileName,MB_OK);
return;
}
FileSize=ptffblk.size;
_findclose(fhandle);

fptr=fopen(FileName,"wb");
if(fptr==NULL)
{
MessageBox("Error opening file to write","Error");
return;
}//endif
      
int WriteSize=0;

while(FileSize > 0)
{
      
if(FileSize < 5)   WriteSize =1;
if(FileSize > 5)   WriteSize =5;
if(FileSize > 10)  WriteSize =10;
if(FileSize > 50)  WriteSize =50;
if(FileSize > 100) WriteSize =100;
if(FileSize > 500) WriteSize =500;
if(FileSize > 1000)WriteSize =1000;
if(FileSize > 5000)WriteSize =5000;
if(FileSize > 10000)WriteSize=10000;
            
fwrite(&ch,sizeof(char),WriteSize,fptr);
      
FileSize-=WriteSize;

}//endwhile


RJ
0
 
LVL 4

Expert Comment

by:void_main
ID: 9880591
>Are you saying this size is due from the "new" creating a memory buffer or is it the fread/fwrite (or any other file read write api) or both?
- I mentioned the memory buffer.

You are dividing the "very large file" by 4 and doing a new with this quarter of, possible 36Gigabytes, (which is an extreme amount of memory).
// ---------------------------------------------------------------------
long BeginSize = ftell(...);
long Size = BeginSize / 4; (4 files)
//calculate some for possible remainder
long Rem = (BeginSize % 4) + 4; //(+1 for some slack)
Size +=(Rem/4);
char *St = new[Size];                                                  // Here you allocate possible 9 gigabytes
// ---------------------------------------------------------------------


>In short I need to know how much fread and fwrite can handle and possibly "new" or how much addressing can a char* address.
Theoretically you can allocate 2^32 bytes (which are 4Gigabytes), but I don't know if windows will let you, or even the compiler! Neither I have some official documentation nor I know from somewhere else. This is what I figured out for myself... To test the speed with different chunk sizes you can use the microtimer of the pc while copying a file. The speed will incerase between 1kb and somewhat like 1mb-10mb. And If the chunk grows even larger you may not notice that the speed increases any further. (It could slow down even)

And:
Your routine is fast! I am sure it is!
Personally I would never read more than 64kBytes at once.

regards
0
 
LVL 3

Expert Comment

by:RJSoft
ID: 9887341
void;

In a way this is important to me for the performance of some of my applications.

I guess I will start plugging in some larger numbers for my if statements and check for processing speed.

I guess I will have to consider that the amount of resource memory will be different for different machines so I will need a safe smaller amount. Maybe 64k as you suggest. Although I need to push the speed boundry because of the design of a few of my apps.

BTW, how do you check your functions processing speed. I found this link but was wondering your opinion.

http://www.experts-exchange.com/Programming/Programming_Languages/Cplusplus/Q_20788205.html#9684683

RJ
0
 
LVL 4

Expert Comment

by:void_main
ID: 9895207
@RJSoft

In Borland C++ Builder you could do this:

    unsigned short int Year, Month, Day, Hour, Min, Sec, MSec;

    TDateTime dtPresent = Now();
    DecodeDate(dtPresent, Year, Month, Day);
    DecodeTime(dtPresent, Hour, Min, Sec, MSec);

//---------
But you have to include the correct header files.

MSec are the milliseconds. I don't know how to get the microseconds in Win32...
0
 
LVL 4

Expert Comment

by:void_main
ID: 10763225
I'm waiting for "concatenating a lot of very small files to one large file".....

@Venabili: your suggestion is okay!
0

Featured Post

Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

Join & Write a Comment

Windows Script Host (WSH) has been part of Windows since Windows NT4. Windows Script Host provides architecture for building dynamic scripts that consist of a core object model, scripting hosts, and scripting engines. The key components of Window…
Whether you’re a college noob or a soon-to-be pro, these tips are sure to help you in your journey to becoming a programming ninja and stand out from the crowd.
The goal of the video will be to teach the user the concept of local variables and scope. An example of a locally defined variable will be given as well as an explanation of what scope is in C++. The local variable and concept of scope will be relat…
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now