Link to home
Create AccountLog in
Avatar of kinghalls
kinghalls

asked on

improved external merge sort

I just learned about using buffer space to improve external merge sort, I don't quite understand the algorithm though, but here it is:

Say I want to sort this:

6 4 16 11 7 5 9 2 8 12 10 3 1 13 14 15 18 17 20 19 23 21 22
we have a block each with 3  capacity and the buffer has 3 blocks

so first we should put the numbers in blocks:

6 4 16      11 7 5        9 2 8

what's next?
Given: Unordered file of records; B = # of available buffers
 
Pass 0: repeat
           Fill B buffers with blocks from file
           Sort the records in the buffers
           Output the sorted run of records
        until input file is exhausted
        // now use B-1 buffers for input and 1 for output
 
Passes 1-n
        While the number of sorted runs is > 1:
            While the input file still has unread runs:
              Repeat
                Bring in one block each from the next B-1 runs
                Merge those blocks into the output buffer (dumping as it fills)
              until the longest of those B-1 runs is exhausted
           end while
        end while

Open in new window

Avatar of grant300
grant300

Merge sort algorithm is one of a number of 'divide and conquer' approaches to sorting.  It is generally implemented as a recursive fashion.

The trick with the merge sort (or really any sort algorithm) is that, if the amount of data you are trying to sort will not fit into memory on the machine you are using, you can break the work up into bite sized chunks, sort each chunk, then bring all the chunks together at the end.  This works because you can merge two sorted lists together to come up with the consolidated list in sorted order with a minimal amount of effort.

The way merge sort works is to recursively split the list into two until there is only one element in each bucket.  You then merge the buckets of 1 into buckets of 2 and the 2s into 4s as you unwind the recursion.  If you are doing this externally because your data set is to large for RAM, you merge sort each segment of the the file (buffer), write them out into separate files, then merge the files together.

How large is your typical file that needs to be sorted and what kind of machine and O/S are you running on?

Probably the best single source of information on sorting is in Knuth's Algorithms, the volume entitled Sorting and Searching.  Knuth does an excellent job of laying out the history of sorting algorithms over time as the hardware changed.  He also explains how to calculate the performance of a given algorithm.  BTW, the best sorting algorithms (including merge sort) complete in nLog(n) time.

Regards,
Bill
Avatar of kinghalls

ASKER

well I am just trying to solve a written problem using this algorithm... I did not understand how this works though.. I know how a normal mergesort works.. but this new concept of using the buffer space is confusing.. I don't know what to do next after I already put the numbers in the blocks..
Well you are only going to use one buffer at a time.  It is based on the maximum amount of heap space you can get in the environment and has to take into account the additional space you need to perform the sort of that buffer.

You read one buffers worth into memory, sort it using a regular merge sort algorithm, then write that buffer out to it's own file.  You do that until there are no more input records.  You then have a collection of files that have sorted contents and you merge them.  You do a balanced merge where you take the first to files and merge them, the second two, etc.  You wind up with half as many output files.  Then you merge them two at a time and keep doing this until you wind up with the one final output file.

To be honest, I don't recognize why you would break the input file up into multiple small buffers first.

Regards,
Bill
hmm..that kind of confuses me, can you just do with my examples above
do you mean you don't understand the first pass where we took all the numbers and then put it into the buffer and sort them out first?
Avatar of Infinity08
>> that kind of confuses me, can you just do with my examples above

I'm not entirely sure what the difference is between blocks and records in the algorithm you showed.

But it seems that the first part of the algorithm is just a parallellized sort ... ie. different sections of the file are sorted at the same time. Say you have 3 buffers, then you can fill those three buffers from the file, and sort them separately (in parallel), giving 3 sorted buffers at the end. Those 3 sorted buffers are then written to temporary files, and the buffers are filled again with data from the file, etc. until the whole file has been processed.

It seems that the second part of the algorithm uses a (B-1)-way merge ... ie. (B-1) sorted buffers are merged into 1 buffer. And this process is repeated until only one sorted buffer remains. Note that the buffers are filled and emptied as needed - ie. they never exceed the size of the buffer.
I guess in pass 0 all we do is just to read the file into the buffer and sort them separately right? so,

6 4 16      11 7 5        9 2 8

becomes

4 6 16       5 7 11       2 8 9

then we'll write it back again, but where do we write this to? back to the disk? we only have 9 space in the buffer for the integer and we already used them all

and do this until there is no more file to read

so records here mean that each record can store 3 integers and in one block we can store 3 records, which mean we can bring up to 9 integers at a time from the disk. Do you get what I mean?


>> I guess in pass 0 all we do is just to read the file into the buffer and sort them separately right?

Yes.


>> so,
>> 
>> 6 4 16      11 7 5        9 2 8
>> 
>> becomes
>> 
>> 4 6 16       5 7 11       2 8 9

If there are 3 buffers of size 3 each, then yes (at least if my understanding of the algorithm you posted is correct).


>> then we'll write it back again, but where do we write this to? back to the disk?

Indeed : back to disk (to one or more temporary files)


>> and do this until there is no more file to read

Correct.


>> so records here mean that each record can store 3 integers and in one block we can store 3 records, which mean we can bring up to 9 integers at a time from the disk. Do you get what I mean?

I'm not too clear about the definition of record and block as it was used in the algorithm that you posted. Where did you get it from ?
I got it from my class actually, my instructor gave me this handout algorithm.. I am not too clear itself of what he means by blocks... he mentions that when we read a number from a disk it gives an extra space that we can use to save other number..

so for example if we read 9 from the disk and want to bring it to the memory, usually the disk gives us more space than what 4bytes (9 is an integer in this case).. so therefore we can allocate those extra space to store another number.. not too clear myself, but I hope you guys can help me out
Ah, so he's talking about disk blocks ... Ok. It's not really vital to the algorithm, so you can just "ignore" it ...
yeah disc blocks, so we have a buffer pool with three disc blocks..  what I still don't understand is that I am writing the sorted list to one disc block (it you would like to say), so therefore if that block to store the sorted list is full, where do we put it ? write it back to disk?
No, no. A disk block is a unit of storage/transfer in a hard disk. It is the minimum amount that is stored/transferred.
okay.. so?
>> okay.. so?

So, it's not what you understand by it :) Re-read what your instructor said with that knowledge ...
I don't get your explanation by a unit of storage/transfer in a hard disk.. can you explain this more
it's an invalid link
>> it's an invalid link

Just copy the whole line into your address bar (the last ) was dropped in the clickable link)
okay I got it now.. so how can we bring those blocks into memory? Blocks is in the hard disks right?
>> so how can we bring those blocks into memory?

Just read the file in chunks of the same size as the block size ...
okay, and because there are some extra space in the blocks that can be used to store another number then we just take in another number from the disk and put it into that block?
If the block size is larger than the record size, then you'll read more than one record when reading a block ...
yeah that's actually what I mean.. but that does not really matters with the algorithm above..

I am just confused in this part now:

While the number of sorted run is > 1

what is sorted run?
>> but that does not really matters with the algorithm above..

It's not vital for the algorithm no - it's just a way of optimizing the reads/writes from/to disk.


>> what is sorted run?

It's a run that is sorted :) In the first part of the algorithm, you created several sorted runs.
oh so you mean:

6 4 16 is one sorted run and 5 7 11 is another sorted run?
If your runs are of size 3, then yes (except that 6 4 16 isn't really sorted is it ;) )
haha..sorry another silly mistake.. yeah my runs are of size 3.. what is actually runs here? a space allocated in the memory?
Just an amount of data to be processed
another thing:

until input file is exhausted

but on the next step:
While the input file still has unread runs:

the input file is already exhausted after pass 0
Did you read the last line of the first stage (the comment) ?
oh so that means that as long as there is a sorted buffer list?
and those are in a file, so : "as long as there is a run in the input file".
hmm..can you clarify this more?
>> oh so that means that as long as there is a sorted buffer list?

The "sorted buffer lists" as you call them are in file(s), so in the second stage of the algorithm you treat those file(s) as the input file(s).
oh so those sorted buffer lists are still inside the file, not in the memory somewhere?
>> oh so those sorted buffer lists are still inside the file

this is an external merge sort - the intermediary states don't fit in memory - so they will be in files.
which means that files are in the disk right?
>> which means that files are in the disk right?

Probably ... If you use the disk as external medium :)
yeah the disk is the external medium..
okay I think I am going to try to sort the below example, please do correct me if I am wrong. I am sorting these with 3 buffer, which each buffer can hold 3 integers.

91 16 3 21 46 18 31 71 63 82 12 85 6 42 8

after pass 0 I have:

3, 16, 91      18, 21, 46      31, 63, 71  

12, 82, 85     6, 8, 42

3 16 18 21 31 46 63 71 91
6 8 12 42 82 85

and I am stuck after this pass




is the first pass like this:

3 6 18      6 8 12     ______ (this is the 1 buffer for output)

I just merge 3 6 and 18 with 6 8 12 and output one by one to the buffer output?
>> and I am stuck after this pass

Just continue the algorithm. It's a merge sort, so you're merging 2 or more runs into one run. And you're repeating that until there's only one (sorted) run left.
okay the thing I am confused with is when should the 1st pass end??

3 16 18 21 31 46 63 71 91 this is one sorted run
6 8 12 42 82 85 this is another one..

am I right?
>> when should the 1st pass end??

quoted from the algorithm :

        "until input file is exhausted"
isn't that just for pass 0?

what about pass 1-n, that's kind of confuses me
>> isn't that just for pass 0?

Well yes. That's the first pass.


>> what about pass 1-n,

What do you mean ? They go on "While the number of sorted runs is > 1:" (again quoted from the algorithm).
yeah I don't understand this "While the number of sorted runs is > 1"?

In the example above:

3 16 18 21 31 46 63 71 91 this is one sorted run
6 8 12 42 82 85 this is another one..

so there are two sorted runs

I first merge 3 16 18 with 6 8 12 and put it into a an output buffer... so when does this first pass ends??

after 3 16 18 21 31 46 63 71 and 91 is exhausted?

explaining from this example would really help me as it clears me up better, I learn better from examples
>> so when does this first pass ends??

I assume you mean the second pass ?


If you have two sorted runs, then you merge them into one sorted run. And that's the end of the second pass ... You've now got one sorted run, which is the output of the algorithm.
I call the very first pass where I create a sorted runs pass 0, just to clear up the confusion..

pass 0 is the first pass, and pass 1 is the second pass. Do we agree on that ?
yes for the sake of argument let's just use pass 0, pass 1, etc and from this point use the example I provided to explain it much better
okay let's talk about blocks now, the algorithm says to bring one block from the next B-1 runs
merge these blocks into the output buffer:

this is :

3 6 18      6 8 12     ______

right?

and this is pass 1
>> this is :
>> 
>> 3 6 18      6 8 12     ______
>> 
>> right?

And merge them in the output buffer ...
hold on, I should have merge this:

3 16 18

with

6 8 12

so the output buffer has

3 6 8

after this, the first block is left with

16 18

and second block:

12

output buffer is empty again because we dump it as it fills

it becomes 12 16 18
>> it becomes 12 16 18

If there's more than one block in the runs, then moving 12 to the output buffer means that the next block for the first run can be used.
so after writing 12 16 18 to the output buffer, that's the end of pass 1? or not?
>> so after writing 12 16 18 to the output buffer, that's the end of pass 1? or not?

When you've written 12 to the output buffer, you should bring in the next block from the first run, before continuing.
oh yeah silly me, so then it becomes

16 18

and merge with

42 82 85

right??

is this still pass 1??
>> is this still pass 1??

Yes.
oh you know what I think I just understand this, so basically there are two big pass here pass 0 and the other one is called pass 1-n, not literally pass 2 pass 3 pass 4 but pass 1-n.. right?

>> and the other one is called pass 1-n, not literally pass 2 pass 3 pass 4 but pass 1-n.. right?

You could see it as one big pass, or you could see every iteration of the outside loop ("While the number of sorted runs is > 1:") as one pass.
Time for me to get some sleep ... see you tomorrow ;)
okay I'll try it out first
one question,. so say that I have three sorted run when one sorted run is exhausted then that's considered to be one pass?
one more thing my instruction mentions that the output will be the input, how is this the possible?
The output of the first phase is the input of the second phase.
Say I have the following sorted runs with 3 buffers of each can contain 3 elements

1 3 4 / 5 7 9/ 12 15 16
2 6 8 /10 11 13/ 14 17 18

so first pass I merge 1 3 4 with 2 6 8 and then I got 1 2 3, the rest of the buffer becomes

4 and 2 6 8 right? then pass 2 I merge it which then I got 2 4 in the output buffer and I bring in one more buffer

5 7 9 and merge with 8... am I right?

so where's the part where the output becomes the input?
>> so where's the part where the output becomes the input?

Between the first and the second phase. See the algorithm you posted yourself.
I still don't get it.. in the example I posted above.. where's the part where the output becomes an input?
so 1 2 3 now becomes an input to be merged with the other one? is that what you mean?
>> so 1 2 3 now becomes an input to be merged with the other one? is that what you mean?

Well yes. How else would you do it ?
so you mean 1 2 3 becomes an input to be merged with 4 5 6?

By the way I made a mistake above:
4 should be merged in with 6 8 therefore the output buffer is only 4 then we pull out another block 5 7 9. all this is in the second phase still? am I right?
You're needlessly complicating things. Just read the algorithm as it is.

The first phase creates sorted runs, and the second phase merges those sorted runs into one big sorted run by repeatedly merging 2 or more runs until no more runs are left (ie. all runs are sorted).
Yes I know what you mean.. I am just not quite sure of the line between phases

If I have these sorted runs:


1 3 4 / 5 7 9/ 12 15 16
2 6 8 /10 11 13/ 14 17 18

all I have to do is do a two way merge.. bringing 1 3 4 first and merge them with 2 6 8 until the sorted runs is exhausted.. correct?

I am just not sure when is the first pass, second pass, third pass, and so on..

As far as I know first pass is merging 13 4 with 2 6 8 and then when we write the ouput to the output buffer that's the end of the first pass. When we bring a new block from the sorted run as the input then that's the second pass. I just wan't to clarify this.
>> I am just not sure when is the first pass, second pass, third pass, and so on..

The first pass comes first, then the second. Etc. You continue until you're left with only one sorted run. I really don't understand what you have a problem with ... One pass means merging 2 or more runs into one run.
>>One pass means merging 2 or more runs into one run.

is this until we write the output to the output buffer or until we bring in a new block to be merged if one is exhausted?
Two or more ENTIRE runs are merged !
two or more runs are merged??

I am getting more confused.,. can you just please please use the example above to explain.. it would be much clear that way
Just read what I'm saying : In each phase, two or more ENTIRE runs are merged into one run. Not just one block of each run - the ENTIRE runs.
so although we bring one block from a run it's still considered as a phase??
>> so although we bring one block from a run it's still considered as a phase??

It's considered as joining one block of a run with one block of another run. The phase is only over if the entire runs are merged.
hmm..hold on are we talking about pass or phase here?? cause I am talking about pass.. if you said there are only 1 phase until all are merge then I think that's not pass
I'm just trying to use your terminology. Define pass and phase as you understand it, and I'll tell you which word I meant ;)
I don't know what phase means here.. but a pass is considered when it checks the outer loop if the number of sorted runs is > 1.. this is considered as one pass right? I think you mentioned this somewhere above...

What I have in mind is this.. the inner loop says while the input file still has unread runs.. the input file always have unread runs.. it doesn't have an unread runs if all the data is sorted right? so therefore the outer loop is only checked once??

sorry if I am stupid asking this over and over again
>> the input file always have unread runs

No it doesn't. If you read a run, it's no longer unread.
what is meant by input file here is the sorted runs right? but if there are 2 sorted runs then which one do we consider?
I'm not sure what you mean. To merge you need at least two sorted runs.
yes I understand to merge we need at least two sorted runs

the algorithm mentions input file in the pass 1-n., is this the sorted runs as input file? lets deal with this first
>> is this the sorted runs as input file?

Of course. It's an external sort. The sorted runs don't fit in memory, so they're written to files.
okay now that we got this int the same page.. when the algorithm says t=while the input file still has unread runs, does this hold for all the sorted runs we have?
Of course ... you want to sort everything - that's the point isn't it :)
so the input is from each sorted run ew have.. if we have k sorted runs then it will be k-way merge..  so when does it goes up to the outer loop? From my point of view.. it will stay inside the inner loop as long as there is a file that needs to be merged.. therefore the outer loop is only executed once?
>> so when does it goes up to the outer loop?

When you merge the next set of runs ... Note that you only have a limited number of buffers and a limited amount of memory. So you can't merge all runs at the same time. You have to do it in steps ... which is exactly why the loops are there.
hmm... so let me get the terms clear first:

1 3 4 / 5 7 9/ 12 15 16 ==> this is one sorted run
2 6 8 /10 11 13/ 14 17 18  ==>this is the second sorted run

right?

if say I only have 3 buffer space then one is for the output.. I can bring two of these sorted runs directly into those two buffers? in this case doing a 2 way merge.. and I only need one pass to do this? right?

if say I have three then I need to do 2 pass?
That sounds about right, yes.
gotcha man.. I think I am understanding this much better.. I forgot that the memory has a limit, so say if we have 8 sorted runs and the memory only have 3 buffer and one for the input.. we still have to do a 2 way merge.. we can't do a 8 merge at a time..
one more question.. after this I fully understand this concept:

say I merge:

1 3 4 / 5 7 9/ 12 15 16
2 6 8 /10 11 13/ 14 17 18
22 24 26/ 27 28 32/ 39 44 70
75 78 80/ 82 84 87/ 90 92 97
101 104 105/  107 110 111/ 115 120 124

so first I merge  

1 3 4 / 5 7 9/ 12 15 16

and

2 6 8 /10 11 13/ 14 17 18

and

22 24 26/ 27 28 32/ 39 44 70

I got

1 2 3 4 5 6  7 8 9/ 10 11 12 13 14 15 16 17 18/ 22 24 26/ 27 28 32/ 39 44 70

what would be the result when I merge 1 2 3 4 5 6 7 8 9 with

75 78 80

and

101 104 105

as the second pass??

answer this and I will accept a solution to this thread and assign the points..

sorry I mean:

1 2 3 4 5 6  7 8 9/ 10 11 12 13 14 15 16 17 18/ 22 24 26 27 28 32 39 44 70

after the first pass.. just to leave the confusion out
>> so say if we have 8 sorted runs and the memory only have 3 buffer and one for the input.. we still have to do a 2 way merge.. we can't do a 8 merge at a time..

Correct. You'd merge the first two runs first. And when that's done, you'd pick two more runs to merge, etc. until there is only one run left, which is then completely sorted.


>> I got
>> 
>> 1 2 3 4 5 6  7 8 9/ 10 11 12 13 14 15 16 17 18/ 22 24 26/ 27 28 32/ 39 44 70

Your block size seems to have increased for some blocks - the block size stays 3.


>> what would be the result when I merge 1 2 3 4 5 6 7 8 9 with
>> 
>> 75 78 80

You would find that you have to bring in new blocks from the first run until you get a block that contains a value that is higher than 75.
so even though I after the first pass I got 1 2 3 4 5 6 7 8 9 I still have to bring in 3 items at a item and merge it with 75 78  80??

so

1 2 3 with 75 78 80?

I don't get it.. this is a 3 way merge so I have to merge it also with 22 24 26 too
ASKER CERTIFIED SOLUTION
Avatar of Infinity08
Infinity08
Flag of Belgium image

Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Just for the info.. I have 4 blocks here, 1 blocks for the output and 3 for the input buffer.. that's why I am doing a 3 way merge
oh okay..I got it, so how many pass is this going to take? 2?
How many passes is what going to take ?
oh never mind.. just one last question to clarify and for sure I am going to close this

merging 1 2 3 with 8 9 K and 4 v k in a three way merge would get me what?

assuming I have four buffer with each capacity can hold 3 items
>> just one last question to clarify and for sure I am going to close this

That sounds familiar heh ;)


>> merging 1 2 3 with 8 9 K and 4 v k in a three way merge would get me what?

Depends on the ordering you use. But the result will be a completely ordered run that contains all 9 items.
so in this case 1 2 3 will be exhausted first and I bring the next block right?

the ordering is based on ASCII