Which Sorting Algorithm to use for large datasets ?

Posted on 2008-10-07
Last Modified: 2012-05-05
I am writing a C program to read a comma separated csv file ( which will have anywhere between 300k to 1.5 million rows). At the end of the program I will be outputting a csv file which will have double the number of input rows(I  also sort it based on a DateTime field in the file ).

1) I am using Insertion Sort algorithm where in I sort as I create the nodes. This takes around 1.5 hrs for 500k rows. But insertion sort isnt going to work for >1 million rows. It's going to take a really long time!!

2) Will it be any better If I QuickSort ? I know it will be faster but if i use Quicksort, I will have to first create nodes(eg;say for a million) and then again do a sort.

3)Or is C a bad choice for this kind of a problem ?

Question by:dchvic
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
LVL 74

Expert Comment

ID: 22659461
merge sort runs in comparable time to quicksort while taking less space.

There are numerous online sources detailing its use

Assisted Solution

lakshmikarle earned 20 total points
ID: 22660467
If you are storing elements in array, try using variations of insertion sort i.e, Shell sort or gapped insertion sort which gives significant improvement over simple insertion sort. If you are using linear search to find appropriate position for new element t be inserted, use binary search instead since the elements are already sorted.
If you are using linked lists, above variation will not give you any significant improvement. So, try using binary tree sort or heap sort.

If you are sorting based on Date and time fields, maintain different sorted lists for different months of year.

Expert Comment

ID: 22660795
Heap sort is the sorting algorithm which competes with the famous quicksort.

Unlike quicksort, heapsort can sort in place and therefore, uses less memory. Since you have an input of a million data, you should probably use an algorithm that saves memory.

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

LVL 18

Accepted Solution

Jose Parrot earned 35 total points
ID: 22663691
If you have a dual or quad core computer to do the task, then MERGE SORT and OpenMP for parallel programming is a very good choice.
For a Core2 Quad based computer (Xeon or Pentium), you can divide the array in 4 sub-arrays of equivalent size, and run in 4 threads. This results in 4 sorted arrays. The next step is to merge them into a single one, by looking each subarray.
It is faster than make two merges, then a final merge, because has less moves. As you have huge arrays, probably your so long time is due disk I/O and this approach requires less I/O operations.

// merge sort of N elements
// create 4 threads
parallel for d=0 to 3
    first = d*N/4
    last  = (d+1)*N/4-1
    sort from first to last by using thread d
iR = 0
i0 = 0, i1 = N/4, i2= N/2, i3 = 3N/4 // indexes for subarrays
serial for d=0 to N/4
   find the smallest from SUBARRAY0[i0],..,SUBARRAY3[i3]
      get the winner element E
      the winner SUBARRAY index is incremented
      put E in RESULT[iR]
      increment iR
      get the next element from winner SUBARRAY
   repeat this until iR == N-1

Open in new window


Assisted Solution

msm1593 earned 20 total points
ID: 22670128
Jose, i agree with everything you said except the part about taking 4 arrays going to straight into 1.

From everything i have learned it would be more efficient to merge those 4 into 2, then into 1.

The following is more of a justification to myself but bear with me :)

4 to 1 would require for each array a comparison of 1 and 2, if 2 is bigger keep one, then compare to 3, then compare to 4. After all that work only one array (of the four) is incremented; meaning all the other arrays will need their values re-read again (possibly thousands of times)

In only comparing two arrays you can just compare those two values, take the minimum and then increment the smaller value's array. Still need to re-read the higher value later but it's only one comparison. To make this faster you could just set HigherValue to a variable and insert all values from the previous lower array while <= HigherValue, so you wouldn't even need to read that higher value every time.

Like i said take this with a grain of salt, and i'd like to have someone present the argument for the 4 > 1 merge as i don't have much background with it.

LVL 18

Expert Comment

by:Jose Parrot
ID: 22672536
msm, good point.
In the proposed parallel algorithm, if we merge 4 to 2, then 2 to 1,
then time complexity is
        O(n log n/4) + O(n)
If we do the last merge by using the "smallest from 4" algorithm, then it is
        O(n + n log n)
which are not so far, because 1 + log n  isn't very different of 1 + log n/4 and anyway, the longer "smallest from 4" is still in the logarithmic order of growth.
The reason of 4->1 is to minimize disk access.
Merge sort is a recursive algorithm, so spending a lot of memory for the author's so large files, not only for sub arrays, but for stack (subroutine return and arguments).
If the available memory was enough to fit the whole csv file and the other arrays, then, for sure, 4 parallel sorting of n/4 subarrays, then 4->2 then 2->1 is the right choice, over any other option, as you said. In this case, WALL time would be very close to CPU time.
By the other hand, as far I understand the case described by the author, if the program takes 1.5 hrs for 500k rows, then it is swapping a lot !!! And, in this case, the disk access times are included in the WALL time. Probably author's CPU time is something like 10 minutes and disk R/W 80 minutes... (my guess).
I'm not sure about correctness of that times relationship, very dependent on the hardware (CPU, memory), size and structure of the rows, and whatever. It would be good to try both approaches and chose the better in the case.
So, in essence, I agree with your point, but if we can do better WALL time with the worst algorithm, probably the author will preffer the faster.

Expert Comment

ID: 22673329
Ah, right- I didn't take into consideration R/W, thanks for that :)

Expert Comment

ID: 23228415
I know this question is closed already but I wanted to point out one other sorting algorithm, the Radix sort.

As the article states this sort works in O(nk) time where k is the average key length and n is the number of elements. That is much faster then a Quicksort. Just convert your dates into timestamps and you are off to the races.

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Have you thought about creating an iPhone application (app), but didn't even know where to get started? Here's how: ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Important pre-programming comments: I’ve never tri…
One of Google's most recent algorithm changes affecting local searches is entitled "The Pigeon Update." This update has dramatically enhanced search inquires for the keyword "Yelp." Google searches with the word "Yelp" included will now yield Yelp a…
Video by: Grant
The goal of this video is to provide viewers with basic examples to understand and use nested-loops in the C programming language.
The goal of this video is to provide viewers with basic examples to understand and use switch statements in the C programming language.

751 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question