Solved

Which Sorting Algorithm to use for large datasets ?

Posted on 2008-10-07
8
1,256 Views
Last Modified: 2012-05-05
Experts,
I am writing a C program to read a comma separated csv file ( which will have anywhere between 300k to 1.5 million rows). At the end of the program I will be outputting a csv file which will have double the number of input rows(I  also sort it based on a DateTime field in the file ).

1) I am using Insertion Sort algorithm where in I sort as I create the nodes. This takes around 1.5 hrs for 500k rows. But insertion sort isnt going to work for >1 million rows. It's going to take a really long time!!

2) Will it be any better If I QuickSort ? I know it will be faster but if i use Quicksort, I will have to first create nodes(eg;say for a million) and then again do a sort.

3)Or is C a bad choice for this kind of a problem ?

thanks
0
Comment
Question by:dchvic
8 Comments
 
LVL 73

Expert Comment

by:sdstuber
ID: 22659461
merge sort runs in comparable time to quicksort while taking less space.

There are numerous online sources detailing its use
0
 
LVL 1

Assisted Solution

by:lakshmikarle
lakshmikarle earned 20 total points
ID: 22660467
If you are storing elements in array, try using variations of insertion sort i.e, Shell sort or gapped insertion sort which gives significant improvement over simple insertion sort. If you are using linear search to find appropriate position for new element t be inserted, use binary search instead since the elements are already sorted.
If you are using linked lists, above variation will not give you any significant improvement. So, try using binary tree sort or heap sort.

If you are sorting based on Date and time fields, maintain different sorted lists for different months of year.
0
 
LVL 3

Expert Comment

by:wktang83
ID: 22660795
Heap sort is the sorting algorithm which competes with the famous quicksort.

Unlike quicksort, heapsort can sort in place and therefore, uses less memory. Since you have an input of a million data, you should probably use an algorithm that saves memory.

ref: http://en.wikipedia.org/wiki/Heap_sort
0
 
LVL 18

Accepted Solution

by:
JoseParrot earned 35 total points
ID: 22663691
If you have a dual or quad core computer to do the task, then MERGE SORT and OpenMP for parallel programming is a very good choice.
For a Core2 Quad based computer (Xeon or Pentium), you can divide the array in 4 sub-arrays of equivalent size, and run in 4 threads. This results in 4 sorted arrays. The next step is to merge them into a single one, by looking each subarray.
It is faster than make two merges, then a final merge, because has less moves. As you have huge arrays, probably your so long time is due disk I/O and this approach requires less I/O operations.
Jose

// merge sort of N elements

// DIVIDE SECTION

// create 4 threads

parallel for d=0 to 3

{

    first = d*N/4

    last  = (d+1)*N/4-1

    sort from first to last by using thread d

}
 

// MERGE SECTION

RESULT=empty

iR = 0

i0 = 0, i1 = N/4, i2= N/2, i3 = 3N/4 // indexes for subarrays

serial for d=0 to N/4

{

   find the smallest from SUBARRAY0[i0],..,SUBARRAY3[i3]

      get the winner element E

      the winner SUBARRAY index is incremented

      put E in RESULT[iR]

      increment iR

      get the next element from winner SUBARRAY

   repeat this until iR == N-1

}

Open in new window

0
Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

 
LVL 1

Assisted Solution

by:msm1593
msm1593 earned 20 total points
ID: 22670128
Jose, i agree with everything you said except the part about taking 4 arrays going to straight into 1.

From everything i have learned it would be more efficient to merge those 4 into 2, then into 1.

The following is more of a justification to myself but bear with me :)

4 to 1 would require for each array a comparison of 1 and 2, if 2 is bigger keep one, then compare to 3, then compare to 4. After all that work only one array (of the four) is incremented; meaning all the other arrays will need their values re-read again (possibly thousands of times)

In only comparing two arrays you can just compare those two values, take the minimum and then increment the smaller value's array. Still need to re-read the higher value later but it's only one comparison. To make this faster you could just set HigherValue to a variable and insert all values from the previous lower array while <= HigherValue, so you wouldn't even need to read that higher value every time.

Like i said take this with a grain of salt, and i'd like to have someone present the argument for the 4 > 1 merge as i don't have much background with it.




0
 
LVL 18

Expert Comment

by:JoseParrot
ID: 22672536
msm, good point.
Your
In the proposed parallel algorithm, if we merge 4 to 2, then 2 to 1,
then time complexity is
        O(n log n/4) + O(n)
If we do the last merge by using the "smallest from 4" algorithm, then it is
        O(n + n log n)
which are not so far, because 1 + log n  isn't very different of 1 + log n/4 and anyway, the longer "smallest from 4" is still in the logarithmic order of growth.
The reason of 4->1 is to minimize disk access.
Merge sort is a recursive algorithm, so spending a lot of memory for the author's so large files, not only for sub arrays, but for stack (subroutine return and arguments).
If the available memory was enough to fit the whole csv file and the other arrays, then, for sure, 4 parallel sorting of n/4 subarrays, then 4->2 then 2->1 is the right choice, over any other option, as you said. In this case, WALL time would be very close to CPU time.
By the other hand, as far I understand the case described by the author, if the program takes 1.5 hrs for 500k rows, then it is swapping a lot !!! And, in this case, the disk access times are included in the WALL time. Probably author's CPU time is something like 10 minutes and disk R/W 80 minutes... (my guess).
I'm not sure about correctness of that times relationship, very dependent on the hardware (CPU, memory), size and structure of the rows, and whatever. It would be good to try both approaches and chose the better in the case.
So, in essence, I agree with your point, but if we can do better WALL time with the worst algorithm, probably the author will preffer the faster.
Jose
0
 
LVL 1

Expert Comment

by:msm1593
ID: 22673329
Ah, right- I didn't take into consideration R/W, thanks for that :)
0
 

Expert Comment

by:lynchdigital
ID: 23228415
I know this question is closed already but I wanted to point out one other sorting algorithm, the Radix sort.

http://en.wikipedia.org/wiki/Radix_sort

As the article states this sort works in O(nk) time where k is the average key length and n is the number of elements. That is much faster then a Quicksort. Just convert your dates into timestamps and you are off to the races.
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

This tutorial is posted by Aaron Wojnowski, administrator at SDKExpert.net.  To view more iPhone tutorials, visit www.sdkexpert.net. This is a very simple tutorial on finding the user's current location easily. In this tutorial, you will learn ho…
Iteration: Iteration is repetition of a process. A student who goes to school repeats the process of going to school everyday until graduation. We go to grocery store at least once or twice a month to buy products. We repeat this process every mont…
Video by: Grant
The goal of this video is to provide viewers with basic examples to understand and use while-loops in the C programming language.
The goal of this video is to provide viewers with basic examples to understand and use switch statements in the C programming language.

759 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now