I just learned about using buffer space to improve external merge sort, I don't quite understand the algorithm though, but here it is:
Say I want to sort this:
6 4 16 11 7 5 9 2 8 12 10 3 1 13 14 15 18 17 20 19 23 21 22
we have a block each with 3 capacity and the buffer has 3 blocks
so first we should put the numbers in blocks:
6 4 16 11 7 5 9 2 8
Given: Unordered file of records; B = # of available buffersPass 0: repeat Fill B buffers with blocks from file Sort the records in the buffers Output the sorted run of records until input file is exhausted // now use B-1 buffers for input and 1 for outputPasses 1-n While the number of sorted runs is > 1: While the input file still has unread runs: Repeat Bring in one block each from the next B-1 runs Merge those blocks into the output buffer (dumping as it fills) until the longest of those B-1 runs is exhausted end while end while
The trick with the merge sort (or really any sort algorithm) is that, if the amount of data you are trying to sort will not fit into memory on the machine you are using, you can break the work up into bite sized chunks, sort each chunk, then bring all the chunks together at the end. This works because you can merge two sorted lists together to come up with the consolidated list in sorted order with a minimal amount of effort.
The way merge sort works is to recursively split the list into two until there is only one element in each bucket. You then merge the buckets of 1 into buckets of 2 and the 2s into 4s as you unwind the recursion. If you are doing this externally because your data set is to large for RAM, you merge sort each segment of the the file (buffer), write them out into separate files, then merge the files together.
How large is your typical file that needs to be sorted and what kind of machine and O/S are you running on?
Probably the best single source of information on sorting is in Knuth's Algorithms, the volume entitled Sorting and Searching. Knuth does an excellent job of laying out the history of sorting algorithms over time as the hardware changed. He also explains how to calculate the performance of a given algorithm. BTW, the best sorting algorithms (including merge sort) complete in nLog(n) time.