I just learned about using buffer space to improve external merge sort, I don't quite understand the algorithm though, but here it is:
Say I want to sort this:
6 4 16 11 7 5 9 2 8 12 10 3 1 13 14 15 18 17 20 19 23 21 22
we have a block each with 3 Â capacity and the buffer has 3 blocks
so first we should put the numbers in blocks:
6 4 16 Â Â Â 11 7 5 Â Â Â Â 9 2 8
what's next?
Given: Unordered file of records; B = # of available buffersPass 0: repeat Fill B buffers with blocks from file Sort the records in the buffers Output the sorted run of records until input file is exhausted // now use B-1 buffers for input and 1 for outputPasses 1-n While the number of sorted runs is > 1: While the input file still has unread runs: Repeat Bring in one block each from the next B-1 runs Merge those blocks into the output buffer (dumping as it fills) until the longest of those B-1 runs is exhausted end while end while
Merge sort algorithm is one of a number of 'divide and conquer' approaches to sorting. Â It is generally implemented as a recursive fashion.
The trick with the merge sort (or really any sort algorithm) is that, if the amount of data you are trying to sort will not fit into memory on the machine you are using, you can break the work up into bite sized chunks, sort each chunk, then bring all the chunks together at the end. Â This works because you can merge two sorted lists together to come up with the consolidated list in sorted order with a minimal amount of effort.
The way merge sort works is to recursively split the list into two until there is only one element in each bucket. Â You then merge the buckets of 1 into buckets of 2 and the 2s into 4s as you unwind the recursion. Â If you are doing this externally because your data set is to large for RAM, you merge sort each segment of the the file (buffer), write them out into separate files, then merge the files together.
How large is your typical file that needs to be sorted and what kind of machine and O/S are you running on?
Probably the best single source of information on sorting is in Knuth's Algorithms, the volume entitled Sorting and Searching. Â Knuth does an excellent job of laying out the history of sorting algorithms over time as the hardware changed. Â He also explains how to calculate the performance of a given algorithm. Â BTW, the best sorting algorithms (including merge sort) complete in nLog(n) time.
Regards,
Bill
kinghalls
ASKER
well I am just trying to solve a written problem using this algorithm... I did not understand how this works though.. I know how a normal mergesort works.. but this new concept of using the buffer space is confusing.. I don't know what to do next after I already put the numbers in the blocks..
grant300
Well you are only going to use one buffer at a time. Â It is based on the maximum amount of heap space you can get in the environment and has to take into account the additional space you need to perform the sort of that buffer.
You read one buffers worth into memory, sort it using a regular merge sort algorithm, then write that buffer out to it's own file. Â You do that until there are no more input records. Â You then have a collection of files that have sorted contents and you merge them. Â You do a balanced merge where you take the first to files and merge them, the second two, etc. Â You wind up with half as many output files. Â Then you merge them two at a time and keep doing this until you wind up with the one final output file.
To be honest, I don't recognize why you would break the input file up into multiple small buffers first.
hmm..that kind of confuses me, can you just do with my examples above
kinghalls
ASKER
do you mean you don't understand the first pass where we took all the numbers and then put it into the buffer and sort them out first?
Infinity08
>> that kind of confuses me, can you just do with my examples above
I'm not entirely sure what the difference is between blocks and records in the algorithm you showed.
But it seems that the first part of the algorithm is just a parallellized sort ... ie. different sections of the file are sorted at the same time. Say you have 3 buffers, then you can fill those three buffers from the file, and sort them separately (in parallel), giving 3 sorted buffers at the end. Those 3 sorted buffers are then written to temporary files, and the buffers are filled again with data from the file, etc. until the whole file has been processed.
It seems that the second part of the algorithm uses a (B-1)-way merge ... ie. (B-1) sorted buffers are merged into 1 buffer. And this process is repeated until only one sorted buffer remains. Note that the buffers are filled and emptied as needed - ie. they never exceed the size of the buffer.
I guess in pass 0 all we do is just to read the file into the buffer and sort them separately right? so,
6 4 16 Â Â Â 11 7 5 Â Â Â Â 9 2 8
becomes
4 6 16 Â Â Â 5 7 11 Â Â Â 2 8 9
then we'll write it back again, but where do we write this to? back to the disk? we only have 9 space in the buffer for the integer and we already used them all
and do this until there is no more file to read
so records here mean that each record can store 3 integers and in one block we can store 3 records, which mean we can bring up to 9 integers at a time from the disk. Do you get what I mean?
Infinity08
>> I guess in pass 0 all we do is just to read the file into the buffer and sort them separately right?
Yes.
>>Â so,
>>Â
>>Â 6 4 16 Â Â Â 11 7 5 Â Â Â Â 9 2 8
>>Â
>>Â becomes
>>Â
>>Â 4 6 16 Â Â Â 5 7 11 Â Â Â 2 8 9
If there are 3 buffers of size 3 each, then yes (at least if my understanding of the algorithm you posted is correct).
>>Â then we'll write it back again, but where do we write this to? back to the disk?
Indeed : back to disk (to one or more temporary files)
>>Â and do this until there is no more file to read
Correct.
>>Â so records here mean that each record can store 3 integers and in one block we can store 3 records, which mean we can bring up to 9 integers at a time from the disk. Do you get what I mean?
I'm not too clear about the definition of record and block as it was used in the algorithm that you posted. Where did you get it from ?
kinghalls
ASKER
I got it from my class actually, my instructor gave me this handout algorithm.. I am not too clear itself of what he means by blocks... he mentions that when we read a number from a disk it gives an extra space that we can use to save other number..
so for example if we read 9 from the disk and want to bring it to the memory, usually the disk gives us more space than what 4bytes (9 is an integer in this case).. so therefore we can allocate those extra space to store another number.. not too clear myself, but I hope you guys can help me out
Ah, so he's talking about disk blocks ... Ok. It's not really vital to the algorithm, so you can just "ignore" it ...
kinghalls
ASKER
yeah disc blocks, so we have a buffer pool with three disc blocks.. Â what I still don't understand is that I am writing the sorted list to one disc block (it you would like to say), so therefore if that block to store the sorted list is full, where do we put it ? write it back to disk?
Infinity08
No, no. A disk block is a unit of storage/transfer in a hard disk. It is the minimum amount that is stored/transferred.
okay I got it now.. so how can we bring those blocks into memory? Blocks is in the hard disks right?
Infinity08
>> so how can we bring those blocks into memory?
Just read the file in chunks of the same size as the block size ...
kinghalls
ASKER
okay, and because there are some extra space in the blocks that can be used to store another number then we just take in another number from the disk and put it into that block?
okay I think I am going to try to sort the below example, please do correct me if I am wrong. I am sorting these with 3 buffer, which each buffer can hold 3 integers.
91 16 3 21 46 18 31 71 63 82 12 85 6 42 8
after pass 0 I have:
3, 16, 91 Â Â Â 18, 21, 46 Â Â Â 31, 63, 71 Â
12, 82, 85 Â Â 6, 8, 42
3 16 18 21 31 46 63 71 91
6 8 12 42 82 85
and I am stuck after this pass
kinghalls
ASKER
is the first pass like this:
3 6 18 Â Â Â 6 8 12 Â Â ______ (this is the 1 buffer for output)
I just merge 3 6 and 18 with 6 8 12 and output one by one to the buffer output?
Infinity08
>> and I am stuck after this pass
Just continue the algorithm. It's a merge sort, so you're merging 2 or more runs into one run. And you're repeating that until there's only one (sorted) run left.
What do you mean ? They go on "While the number of sorted runs is >Â 1:" (again quoted from the algorithm).
kinghalls
ASKER
yeah I don't understand this "While the number of sorted runs is >Â 1"?
In the example above:
3 16 18 21 31 46 63 71 91 this is one sorted run
6 8 12 42 82 85 this is another one..
so there are two sorted runs
I first merge 3 16 18 with 6 8 12 and put it into a an output buffer... so when does this first pass ends??
after 3 16 18 21 31 46 63 71 and 91 is exhausted?
explaining from this example would really help me as it clears me up better, I learn better from examples
Infinity08
>> so when does this first pass ends??
I assume you mean the second pass ?
If you have two sorted runs, then you merge them into one sorted run. And that's the end of the second pass ... You've now got one sorted run, which is the output of the algorithm.
oh you know what I think I just understand this, so basically there are two big pass here pass 0 and the other one is called pass 1-n, not literally pass 2 pass 3 pass 4 but pass 1-n.. right?
Infinity08
>> and the other one is called pass 1-n, not literally pass 2 pass 3 pass 4 but pass 1-n.. right?
You could see it as one big pass, or you could see every iteration of the outside loop ("While the number of sorted runs is >Â 1:") as one pass.
Infinity08
Time for me to get some sleep ... see you tomorrow ;)
so you mean 1 2 3 becomes an input to be merged with 4 5 6?
By the way I made a mistake above:
4 should be merged in with 6 8 therefore the output buffer is only 4 then we pull out another block 5 7 9. all this is in the second phase still? am I right?
Infinity08
You're needlessly complicating things. Just read the algorithm as it is.
The first phase creates sorted runs, and the second phase merges those sorted runs into one big sorted run by repeatedly merging 2 or more runs until no more runs are left (ie. all runs are sorted).
kinghalls
ASKER
Yes I know what you mean.. I am just not quite sure of the line between phases
If I have these sorted runs:
1 3 4 / 5 7 9/ 12 15 16
2 6 8 /10 11 13/ 14 17 18
all I have to do is do a two way merge.. bringing 1 3 4 first and merge them with 2 6 8 until the sorted runs is exhausted.. correct?
I am just not sure when is the first pass, second pass, third pass, and so on..
As far as I know first pass is merging 13 4 with 2 6 8 and then when we write the ouput to the output buffer that's the end of the first pass. When we bring a new block from the sorted run as the input then that's the second pass. I just wan't to clarify this.
>> I am just not sure when is the first pass, second pass, third pass, and so on..
The first pass comes first, then the second. Etc. You continue until you're left with only one sorted run. I really don't understand what you have a problem with ... One pass means merging 2 or more runs into one run.
kinghalls
ASKER
>>One pass means merging 2 or more runs into one run.
is this until we write the output to the output buffer or until we bring in a new block to be merged if one is exhausted?
>> so although we bring one block from a run it's still considered as a phase??
It's considered as joining one block of a run with one block of another run. The phase is only over if the entire runs are merged.
kinghalls
ASKER
hmm..hold on are we talking about pass or phase here?? cause I am talking about pass.. if you said there are only 1 phase until all are merge then I think that's not pass
Infinity08
I'm just trying to use your terminology. Define pass and phase as you understand it, and I'll tell you which word I meant ;)
I don't know what phase means here.. but a pass is considered when it checks the outer loop if the number of sorted runs is >Â 1.. this is considered as one pass right? I think you mentioned this somewhere above...
What I have in mind is this.. the inner loop says while the input file still has unread runs.. the input file always have unread runs.. it doesn't have an unread runs if all the data is sorted right? so therefore the outer loop is only checked once??
sorry if I am stupid asking this over and over again
Infinity08
>> the input file always have unread runs
No it doesn't. If you read a run, it's no longer unread.
kinghalls
ASKER
what is meant by input file here is the sorted runs right? but if there are 2 sorted runs then which one do we consider?
okay now that we got this int the same page.. when the algorithm says t=while the input file still has unread runs, does this hold for all the sorted runs we have?
Infinity08
Of course ... you want to sort everything - that's the point isn't it :)
kinghalls
ASKER
so the input is from each sorted run ew have.. if we have k sorted runs then it will be k-way merge.. Â so when does it goes up to the outer loop? From my point of view.. it will stay inside the inner loop as long as there is a file that needs to be merged.. therefore the outer loop is only executed once?
When you merge the next set of runs ... Note that you only have a limited number of buffers and a limited amount of memory. So you can't merge all runs at the same time. You have to do it in steps ... which is exactly why the loops are there.
kinghalls
ASKER
hmm... so let me get the terms clear first:
1 3 4 / 5 7 9/ 12 15 16 ==> this is one sorted run
2 6 8 /10 11 13/ 14 17 18 Â ==>this is the second sorted run
right?
if say I only have 3 buffer space then one is for the output.. I can bring two of these sorted runs directly into those two buffers? in this case doing a 2 way merge.. and I only need one pass to do this? right?
gotcha man.. I think I am understanding this much better.. I forgot that the memory has a limit, so say if we have 8 sorted runs and the memory only have 3 buffer and one for the input.. we still have to do a 2 way merge.. we can't do a 8 merge at a time..
kinghalls
ASKER
one more question.. after this I fully understand this concept:
>> so say if we have 8 sorted runs and the memory only have 3 buffer and one for the input.. we still have to do a 2 way merge.. we can't do a 8 merge at a time..
Correct. You'd merge the first two runs first. And when that's done, you'd pick two more runs to merge, etc. until there is only one run left, which is then completely sorted.
The trick with the merge sort (or really any sort algorithm) is that, if the amount of data you are trying to sort will not fit into memory on the machine you are using, you can break the work up into bite sized chunks, sort each chunk, then bring all the chunks together at the end. Â This works because you can merge two sorted lists together to come up with the consolidated list in sorted order with a minimal amount of effort.
The way merge sort works is to recursively split the list into two until there is only one element in each bucket. Â You then merge the buckets of 1 into buckets of 2 and the 2s into 4s as you unwind the recursion. Â If you are doing this externally because your data set is to large for RAM, you merge sort each segment of the the file (buffer), write them out into separate files, then merge the files together.
How large is your typical file that needs to be sorted and what kind of machine and O/S are you running on?
Probably the best single source of information on sorting is in Knuth's Algorithms, the volume entitled Sorting and Searching. Â Knuth does an excellent job of laying out the history of sorting algorithms over time as the hardware changed. Â He also explains how to calculate the performance of a given algorithm. Â BTW, the best sorting algorithms (including merge sort) complete in nLog(n) time.
Regards,
Bill