Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Which Sorting Algorithm to use for large datasets ?

Posted on 2008-10-07
8
Medium Priority
?
1,353 Views
Last Modified: 2012-05-05
Experts,
I am writing a C program to read a comma separated csv file ( which will have anywhere between 300k to 1.5 million rows). At the end of the program I will be outputting a csv file which will have double the number of input rows(I  also sort it based on a DateTime field in the file ).

1) I am using Insertion Sort algorithm where in I sort as I create the nodes. This takes around 1.5 hrs for 500k rows. But insertion sort isnt going to work for >1 million rows. It's going to take a really long time!!

2) Will it be any better If I QuickSort ? I know it will be faster but if i use Quicksort, I will have to first create nodes(eg;say for a million) and then again do a sort.

3)Or is C a bad choice for this kind of a problem ?

thanks
0
Comment
Question by:dchvic
8 Comments
 
LVL 74

Expert Comment

by:sdstuber
ID: 22659461
merge sort runs in comparable time to quicksort while taking less space.

There are numerous online sources detailing its use
0
 
LVL 1

Assisted Solution

by:lakshmikarle
lakshmikarle earned 60 total points
ID: 22660467
If you are storing elements in array, try using variations of insertion sort i.e, Shell sort or gapped insertion sort which gives significant improvement over simple insertion sort. If you are using linear search to find appropriate position for new element t be inserted, use binary search instead since the elements are already sorted.
If you are using linked lists, above variation will not give you any significant improvement. So, try using binary tree sort or heap sort.

If you are sorting based on Date and time fields, maintain different sorted lists for different months of year.
0
 
LVL 3

Expert Comment

by:wktang83
ID: 22660795
Heap sort is the sorting algorithm which competes with the famous quicksort.

Unlike quicksort, heapsort can sort in place and therefore, uses less memory. Since you have an input of a million data, you should probably use an algorithm that saves memory.

ref: http://en.wikipedia.org/wiki/Heap_sort
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 18

Accepted Solution

by:
Jose Parrot earned 105 total points
ID: 22663691
If you have a dual or quad core computer to do the task, then MERGE SORT and OpenMP for parallel programming is a very good choice.
For a Core2 Quad based computer (Xeon or Pentium), you can divide the array in 4 sub-arrays of equivalent size, and run in 4 threads. This results in 4 sorted arrays. The next step is to merge them into a single one, by looking each subarray.
It is faster than make two merges, then a final merge, because has less moves. As you have huge arrays, probably your so long time is due disk I/O and this approach requires less I/O operations.
Jose

// merge sort of N elements
// DIVIDE SECTION
// create 4 threads
parallel for d=0 to 3
{
    first = d*N/4
    last  = (d+1)*N/4-1
    sort from first to last by using thread d
}
 
// MERGE SECTION
RESULT=empty
iR = 0
i0 = 0, i1 = N/4, i2= N/2, i3 = 3N/4 // indexes for subarrays
serial for d=0 to N/4
{
   find the smallest from SUBARRAY0[i0],..,SUBARRAY3[i3]
      get the winner element E
      the winner SUBARRAY index is incremented
      put E in RESULT[iR]
      increment iR
      get the next element from winner SUBARRAY
   repeat this until iR == N-1
}

Open in new window

0
 
LVL 1

Assisted Solution

by:msm1593
msm1593 earned 60 total points
ID: 22670128
Jose, i agree with everything you said except the part about taking 4 arrays going to straight into 1.

From everything i have learned it would be more efficient to merge those 4 into 2, then into 1.

The following is more of a justification to myself but bear with me :)

4 to 1 would require for each array a comparison of 1 and 2, if 2 is bigger keep one, then compare to 3, then compare to 4. After all that work only one array (of the four) is incremented; meaning all the other arrays will need their values re-read again (possibly thousands of times)

In only comparing two arrays you can just compare those two values, take the minimum and then increment the smaller value's array. Still need to re-read the higher value later but it's only one comparison. To make this faster you could just set HigherValue to a variable and insert all values from the previous lower array while <= HigherValue, so you wouldn't even need to read that higher value every time.

Like i said take this with a grain of salt, and i'd like to have someone present the argument for the 4 > 1 merge as i don't have much background with it.




0
 
LVL 18

Expert Comment

by:Jose Parrot
ID: 22672536
msm, good point.
Your
In the proposed parallel algorithm, if we merge 4 to 2, then 2 to 1,
then time complexity is
        O(n log n/4) + O(n)
If we do the last merge by using the "smallest from 4" algorithm, then it is
        O(n + n log n)
which are not so far, because 1 + log n  isn't very different of 1 + log n/4 and anyway, the longer "smallest from 4" is still in the logarithmic order of growth.
The reason of 4->1 is to minimize disk access.
Merge sort is a recursive algorithm, so spending a lot of memory for the author's so large files, not only for sub arrays, but for stack (subroutine return and arguments).
If the available memory was enough to fit the whole csv file and the other arrays, then, for sure, 4 parallel sorting of n/4 subarrays, then 4->2 then 2->1 is the right choice, over any other option, as you said. In this case, WALL time would be very close to CPU time.
By the other hand, as far I understand the case described by the author, if the program takes 1.5 hrs for 500k rows, then it is swapping a lot !!! And, in this case, the disk access times are included in the WALL time. Probably author's CPU time is something like 10 minutes and disk R/W 80 minutes... (my guess).
I'm not sure about correctness of that times relationship, very dependent on the hardware (CPU, memory), size and structure of the rows, and whatever. It would be good to try both approaches and chose the better in the case.
So, in essence, I agree with your point, but if we can do better WALL time with the worst algorithm, probably the author will preffer the faster.
Jose
0
 
LVL 1

Expert Comment

by:msm1593
ID: 22673329
Ah, right- I didn't take into consideration R/W, thanks for that :)
0
 

Expert Comment

by:lynchdigital
ID: 23228415
I know this question is closed already but I wanted to point out one other sorting algorithm, the Radix sort.

http://en.wikipedia.org/wiki/Radix_sort

As the article states this sort works in O(nk) time where k is the average key length and n is the number of elements. That is much faster then a Quicksort. Just convert your dates into timestamps and you are off to the races.
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

An Outlet in Cocoa is a persistent reference to a GUI control; it connects a property (a variable) to a control.  For example, it is common to create an Outlet for the text field GUI control and change the text that appears in this field via that Ou…
This tutorial is posted by Aaron Wojnowski, administrator at SDKExpert.net.  To view more iPhone tutorials, visit www.sdkexpert.net. This is a very simple tutorial on finding the user's current location easily. In this tutorial, you will learn ho…
The goal of this video is to provide viewers with basic examples to understand and use pointers in the C programming language.
The goal of this video is to provide viewers with basic examples to understand opening and reading files in the C programming language.
Suggested Courses

886 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question