Link to home
Start Free TrialLog in
Avatar of pilley
pilley

asked on

Splitting a very large file

Hope whoever reads this is up to the challenge, I am currently trying to solve this one but the answer keeps eluding me.

I have a file its size can be anywhere from 4GB to 36 GB and possibly above, but the size is not what is important.

I would like to take that file then chop it up into user definable sizes these sizes would range from 4kb to 512kb (I know very small). any ideas how to do this.

Once this part is out of the way I will post the second part.
also the better the answer the more points I will spend
ASKER CERTIFIED SOLUTION
Avatar of sunnycoder
sunnycoder
Flag of India image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Dang123
Dang123

pilley,
    Would you want to be output all the pieces at once or be able to ask the program for a specific piece? (The small files would take MUCH more disk space than the larger one depending on your OS and how you have your disk partitioned)

Dang123

Avatar of pilley

ASKER

Dang123,

It would be nice to output all the files at once. because the next part includes asking the program for a specific peice from many of these files.

The platform can be linux/windows so basically any programming language will be supported
It also depends on the Data and application....if this is just raw data samples for a stats program, trivial...you attempting to create a manageable index into this large data set.

But if this is lets say a MS SQL Table in MS format, then you need a different approach

I think we need a little more information on the type of data to get a clearer answer.
Avatar of pilley

ASKER

Okay G0rath,

but this is not going to involve a database *yet*. Here is an example scenario dont take it word for word just use it as a guide.

Say I have four files : 1 2 3 4 each are 36GB in size.
The data I want lies in the 4kb to 512kb range.
The files by them selves are not for the purpose of this example human readable.
each chunk (4kb to 512kb range) contains data that relates to the same chunk in the other file

ie. 1 (4kb) relates to 2 (4kb) which also relates to 3 and 4

The task set above is to take a file and split it into user definable chunks (4kb to 512kb range). (This means that there will be alot of files once the split is complete -- 576000 if the chunk size is 64 kb)The second part will be putting the chunks back into the order they should be in. but this task can't be set unless the first task is completed.

Hope this answers your question.

Pilley
Pilley .... did you try my siggestion ? any feedback ?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>The key is to read your files in the smallest addressable unit. A char (8 bit) does perfect
Does not sound convincing to me...
If you need to read in 1024 bytes, why execute a read statement 1024 times ... why not read in all 1024 bytes at one go ?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Do not know if you want to do this programmatically or as a user interface.  If GUI, the power tools program chop.exe will do what you need.

http://www.powertoolsforum.com/modules/mylinks/singlelink.php?cid=18&lid=281
Avatar of pilley

ASKER

G0rath,

I think you are on to something there.

The fact that there is going to be more than one file, would it be more efficient to read say all the files at once. i.e. read the first 64k of each file and out put it on the fly.

If so how would I go about doing this
what is the final goal? to read 64K blocks of data from various sources to have human comparison of them? as in output to the screen or to another file?

Or are you going to apply some program rules to them, and that that doesn't fit the rules, have marked for human assistance?

It would be intresting to know how many files you need to read at once as you say....

if you have to look at 100 files and 64K chunks at a time, you would need only 6.25 megs of physical memory, or 62.5megs for 1000 files of 64k chunks.

But 1000 files of 4gb would be around 3.9terrabytes....if you would have that many files I don't know....
but with a smaller memory footprint you could traverse these large files easily.....

Is this all binary data as in packet traces looking for a specific TCP sequences? that may impact how you want this to work.
Just bumped into this, thought it may help you

http://www.freevbcode.com/ShowCode.Asp?ID=449

>The task set above is to take a file and split it into user definable chunks (4kb to 512kb range). (This means that there will be
>alot of files once the split is complete -- 576000 if the chunk size is 64 kb)

Is it all right if you split the files only in the 4-512Kb range and let the remaining portion untouched ? ...
If yes, then would you like to have original files unaltered or the splitted portions removed ?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ooooops!!! Please modify the code!!! I FORGOT SOMETHING!!!

change this:
//--------------------------------------------------------------------------------------------
       while (tell(openHandle) < openSize)
       {
            sprintf(destName, "%i%s\0", fileNum, ext);
            if (!nextFile(openHandle, destName, chunkSize))            
            {
                 cout << "Error while creating file " << destName << " !!!!!!!";
                 getch();
                 abort();
            }
       }
//--------------------------------------------------------------------------------------------

to this:
//--------------------------------------------------------------------------------------------
       while (tell(openHandle) < openSize)
       {
            sprintf(destName, "%i%s\0", fileNum, ext);
            if (!nextFile(openHandle, destName, chunkSize))            
            {
                 cout << "Error while creating file " << destName << " !!!!!!!";
                 getch();
                 abort();
            }

            fileNum++;

       }
       close(openHandle);
//--------------------------------------------------------------------------------------------
(this happens when I am sleepy)
Avatar of pilley

ASKER

I would hate to see what would happen when you are not sleepy void.

I am now testing it but I have a feeling we have a winner

Pilley,
>I am now testing it but I have a feeling we have a winner
Isn't this solution similar (just that I used C and this includes C++) to the solution I posted in my first post which you conveniently chose to ignore?

int i = 0;
char fn[200];
FILE *infile;
FILE * outfile;

infile = fopen ( "input_file", "r");

while ( fread ( buffer, 1, user_defined_size, infile ) != 0 )
{
     sprintf ( fn, "outfile.%d", i );
     outfile =  fopen ( fn, "w" );
     fwrite ( buffer, 1, user_defined_size, outfile );
     fclose(outfile);
     i++;
}

fclose(infile);
@sunnycoder: It is similar but you only have 999 files available. (or FFFh if you use hex)
(maybe I am missing, because I don't know if Windows supports suffixes with more than 3 chars)
Concerning C++: Just replace the cout << "bla"; to printf("bla"); (I am commonly using C++ at home and I don't know exactly, how to do the scanf)

@pilley: You can modify the code once more (but thats only to see the progress) if you, for example, write out a dot for every file, but this fills the screen immeadetly.
Here I am again!
Please modify this:

//----------------------------------------------------------------------------
bool nextFile(int handle, char *newName, ulong fileSize)
{
       int destHandle;
       int remaining = fileSize;
       int newRead;
       int lastRead;
//----------------------------------------------------------------------------
to this
//----------------------------------------------------------------------------
bool nextFile(int handle, char *newName, ulong fileSize)
{
       int destHandle;
       ulong remaining = fileSize;
       ulong newRead;
       ulong lastRead;
//----------------------------------------------------------------------------


void_main
>(maybe I am missing, because I don't know if Windows supports suffixes with more than 3 chars)
I just created one with name asd.asdsdssdsdsd and it works fine ... Even if it created a limitation of number of files, modifying
sprintf ( fn, "outfile.%d", i );
to
sprintf ( fn, "outfile%d.xtn", i );
is trivial
that is correct!
And I do not claim that your code doesn't work but I think it is not as clear as mine, because you kept it short. And you have to recompile, if the chunk size changes or you want to change the filename (that's not sooo bad, but I wouldn't like it)
>I think it is not as clear as mine, because you kept it short
I would say that it conveys the idea more cleanly and is more efficient ... efficiency counts when you need to execute same thing over and over again.

>And you have to recompile, if the chunk size changes or you want to change the filename (that's not sooo bad, but I
> wouldn't like it)
No, you can always read it as command line argument or user input ... The question is how to split .... These issues are trivial and secondary ...

Anyway, I had nothing against you or your method ... I am looking forward to a reply by pilley
@sunnycoder

>I had nothing against you or your method
I believe you. I have not assumed you would...

>These issues are trivial and secondary
Yes, you're right. But not quite unimportant.

I hope the problem is solved soon! (I am waiting for the 2nd part!)

greets to everyone
Avatar of pilley

ASKER

Well,

Thats quite a dialogue goin on there. Originally all I wanted was some ideas or a direction to help me solve this problem but then came void and sunnycoder. I hope you two have resolved your differences.

Now as to whos code is better, the main points here are speed ,reliability and ease of putting into practice.

At this point I am still testing BOTH sets of code ( i was playing with your original code sunnycoder at first) give me some time and I will post the results as there is various testing and remodding I have to do. Once this is completed I will then accept the answer/s I think is best.

and don't worry the 2nd part is coming.  
void probably has the better code because his adjust the read

{
               newRead = remaining;
               if (newRead > 1024)
                     newRead = 1024;

But also I would make a parametor as to chunk size instead of using a constant. User adjustable but constrained to available sizes.

In my previous post I thought I had a reason for reading in the smallest unit addressable. But I guess not. Could have sworn there was some reason "oh well, forgotten".

So when your done with your split files you can just concatinate them with append mode for the first section and then adding your other files back.

I guess for that matter you could simply decide how many times you want to divide the original file and create a string on the heap and then read to and write from it.

long BeginSize = ftell(...);

long Size = BeginSize / 4; (4 files)

//calculate some for possible remainder

long Rem = (BeginSize % 4) + 4; //(+1 for some slack)

Size +=(Rem/4);

char *St = new[Size];

//fread returns amount read or a lesser count
long AmountRead=0;

fopen(//one file for read and 4 to write
fopen(...etc...

AmountRead=fread(St,sizeof(char)*Size,1,fptrIn);
fwrite(St,sizeof(char)*AmountRead,1,fptrOut1);

AmountRead=fread(St,sizeof(char)*Size,1,fptrIn);
fwrite(St,sizeof(char)*AmountRead,1,fptrOut2);

AmountRead=fread(StIn,sizeof(char)*Size,1,fptrIn);
fwrite(StOt,sizeof(char)*AmountRead,1,fptrOut3);

AmountRead=fread(StIn,sizeof(char)*Size,1,fptrIn);
fwrite(StOt,sizeof(char)*AmountRead,1,fptrOut4);

fclose(..
fclose(.. etc....

delete [] St;

RJ
@RJsoft
//----------------------------------------------------------------
AmountRead=fread(St,sizeof(char)*Size,1,fptrIn);
fwrite(St,sizeof(char)*AmountRead,1,fptrOut1);

AmountRead=fread(St,sizeof(char)*Size,1,fptrIn);
fwrite(St,sizeof(char)*AmountRead,1,fptrOut2);

AmountRead=fread(StIn,sizeof(char)*Size,1,fptrIn);
fwrite(StOt,sizeof(char)*AmountRead,1,fptrOut3);

AmountRead=fread(StIn,sizeof(char)*Size,1,fptrIn);
fwrite(StOt,sizeof(char)*AmountRead,1,fptrOut4);

fclose(..
fclose(.. etc....
//----------------------------------------------------------------
you should use a loop for this to minimize your typing work!
By the way: is it safe to allocate one quarter of 36gigs ?
But the whole idea is good!!!

Concerning the   remaining > 1024:
of course you can use a variable here, but here are some reasons why I used 1024:
- it is a power of 2
- it is a good divisor of 4kb and the other desired sizes
- I like this number   =;-)
- I was seeking for an alternative, not to read too less bytes and not to read too much bytes... this number does it

greetings
void_main

The above post lack of loop was easy to post here. You get the ideal.

I am wondering how you come up with 1/4 of 36 gig (9 gig).

Are you saying this size is due from the "new" creating a memory buffer or is it the fread/fwrite (or any other file read write api) or both?

I am wanting to see some official documentation.

I have a routine that over-writes a file and then deletes it. It is for security purposes so that no data recovery team could access the file.

In my routine I also do something similar to what you have done with the 1024 but with if statements

With an unknown file size I do this

I write one char ch='z'; WriteSize amount of times with fwrite.
So as the remainder gets smaller I pick the most efficient amount to
increase the speed of writing. The if statements are fast. The last being the smallest addressable unit (8 bit) char.

Also I can increase my file reading with the reverse logic. But first I need the file size so I use findfirst and the struct which gives me the file size.

This is where I now question what is needed for reading into a string as I would like to know just how far I can push the speed barrier. Obviously the faster I read and write the faster my app runs. So I would like to know what the boundries are and where you got the info from. In short I need to know how much fread and fwrite can handle and possibly "new" or how much addressing can a char* address. I don't want to go beyond the scope of a variables addressing capability and wind up with junk addressing.

Here is my fwrite routine. It is fairly fast.

struct _finddata_t ptffblk;
long fhandle = _findfirst(FileName,&ptffblk); //getting first file or dir
if(fhandle==-1)
{
MessageBox("Failed to find this directory, check spelling!",FileName,MB_OK);
return;
}
FileSize=ptffblk.size;
_findclose(fhandle);

fptr=fopen(FileName,"wb");
if(fptr==NULL)
{
MessageBox("Error opening file to write","Error");
return;
}//endif
      
int WriteSize=0;

while(FileSize > 0)
{
      
if(FileSize < 5)   WriteSize =1;
if(FileSize > 5)   WriteSize =5;
if(FileSize > 10)  WriteSize =10;
if(FileSize > 50)  WriteSize =50;
if(FileSize > 100) WriteSize =100;
if(FileSize > 500) WriteSize =500;
if(FileSize > 1000)WriteSize =1000;
if(FileSize > 5000)WriteSize =5000;
if(FileSize > 10000)WriteSize=10000;
            
fwrite(&ch,sizeof(char),WriteSize,fptr);
      
FileSize-=WriteSize;

}//endwhile


RJ
>Are you saying this size is due from the "new" creating a memory buffer or is it the fread/fwrite (or any other file read write api) or both?
- I mentioned the memory buffer.

You are dividing the "very large file" by 4 and doing a new with this quarter of, possible 36Gigabytes, (which is an extreme amount of memory).
// ---------------------------------------------------------------------
long BeginSize = ftell(...);
long Size = BeginSize / 4; (4 files)
//calculate some for possible remainder
long Rem = (BeginSize % 4) + 4; //(+1 for some slack)
Size +=(Rem/4);
char *St = new[Size];                                                  // Here you allocate possible 9 gigabytes
// ---------------------------------------------------------------------


>In short I need to know how much fread and fwrite can handle and possibly "new" or how much addressing can a char* address.
Theoretically you can allocate 2^32 bytes (which are 4Gigabytes), but I don't know if windows will let you, or even the compiler! Neither I have some official documentation nor I know from somewhere else. This is what I figured out for myself... To test the speed with different chunk sizes you can use the microtimer of the pc while copying a file. The speed will incerase between 1kb and somewhat like 1mb-10mb. And If the chunk grows even larger you may not notice that the speed increases any further. (It could slow down even)

And:
Your routine is fast! I am sure it is!
Personally I would never read more than 64kBytes at once.

regards
void;

In a way this is important to me for the performance of some of my applications.

I guess I will start plugging in some larger numbers for my if statements and check for processing speed.

I guess I will have to consider that the amount of resource memory will be different for different machines so I will need a safe smaller amount. Maybe 64k as you suggest. Although I need to push the speed boundry because of the design of a few of my apps.

BTW, how do you check your functions processing speed. I found this link but was wondering your opinion.

https://www.experts-exchange.com/questions/20788205/need-a-high-resolution-timer.html#9684683

RJ
@RJSoft

In Borland C++ Builder you could do this:

    unsigned short int Year, Month, Day, Hour, Min, Sec, MSec;

    TDateTime dtPresent = Now();
    DecodeDate(dtPresent, Year, Month, Day);
    DecodeTime(dtPresent, Hour, Min, Sec, MSec);

//---------
But you have to include the correct header files.

MSec are the milliseconds. I don't know how to get the microseconds in Win32...
I'm waiting for "concatenating a lot of very small files to one large file".....

@Venabili: your suggestion is okay!