Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
Solved

# Which Sorting Algorithm to use for large datasets ?

Posted on 2008-10-07
Medium Priority
1,366 Views
Experts,
I am writing a C program to read a comma separated csv file ( which will have anywhere between 300k to 1.5 million rows). At the end of the program I will be outputting a csv file which will have double the number of input rows(I  also sort it based on a DateTime field in the file ).

1) I am using Insertion Sort algorithm where in I sort as I create the nodes. This takes around 1.5 hrs for 500k rows. But insertion sort isnt going to work for >1 million rows. It's going to take a really long time!!

2) Will it be any better If I QuickSort ? I know it will be faster but if i use Quicksort, I will have to first create nodes(eg;say for a million) and then again do a sort.

3)Or is C a bad choice for this kind of a problem ?

thanks
0
Question by:dchvic

LVL 74

Expert Comment

ID: 22659461
merge sort runs in comparable time to quicksort while taking less space.

There are numerous online sources detailing its use
0

LVL 1

Assisted Solution

lakshmikarle earned 60 total points
ID: 22660467
If you are storing elements in array, try using variations of insertion sort i.e, Shell sort or gapped insertion sort which gives significant improvement over simple insertion sort. If you are using linear search to find appropriate position for new element t be inserted, use binary search instead since the elements are already sorted.
If you are using linked lists, above variation will not give you any significant improvement. So, try using binary tree sort or heap sort.

If you are sorting based on Date and time fields, maintain different sorted lists for different months of year.
0

LVL 3

Expert Comment

ID: 22660795
Heap sort is the sorting algorithm which competes with the famous quicksort.

Unlike quicksort, heapsort can sort in place and therefore, uses less memory. Since you have an input of a million data, you should probably use an algorithm that saves memory.

ref: http://en.wikipedia.org/wiki/Heap_sort
0

LVL 18

Accepted Solution

Jose Parrot earned 105 total points
ID: 22663691
If you have a dual or quad core computer to do the task, then MERGE SORT and OpenMP for parallel programming is a very good choice.
For a Core2 Quad based computer (Xeon or Pentium), you can divide the array in 4 sub-arrays of equivalent size, and run in 4 threads. This results in 4 sorted arrays. The next step is to merge them into a single one, by looking each subarray.
It is faster than make two merges, then a final merge, because has less moves. As you have huge arrays, probably your so long time is due disk I/O and this approach requires less I/O operations.
Jose

``````// merge sort of N elements
// DIVIDE SECTION
parallel for d=0 to 3
{
first = d*N/4
last  = (d+1)*N/4-1
sort from first to last by using thread d
}

// MERGE SECTION
RESULT=empty
iR = 0
i0 = 0, i1 = N/4, i2= N/2, i3 = 3N/4 // indexes for subarrays
serial for d=0 to N/4
{
find the smallest from SUBARRAY0[i0],..,SUBARRAY3[i3]
get the winner element E
the winner SUBARRAY index is incremented
put E in RESULT[iR]
increment iR
get the next element from winner SUBARRAY
repeat this until iR == N-1
}
``````
0

LVL 1

Assisted Solution

msm1593 earned 60 total points
ID: 22670128
Jose, i agree with everything you said except the part about taking 4 arrays going to straight into 1.

From everything i have learned it would be more efficient to merge those 4 into 2, then into 1.

The following is more of a justification to myself but bear with me :)

4 to 1 would require for each array a comparison of 1 and 2, if 2 is bigger keep one, then compare to 3, then compare to 4. After all that work only one array (of the four) is incremented; meaning all the other arrays will need their values re-read again (possibly thousands of times)

In only comparing two arrays you can just compare those two values, take the minimum and then increment the smaller value's array. Still need to re-read the higher value later but it's only one comparison. To make this faster you could just set HigherValue to a variable and insert all values from the previous lower array while <= HigherValue, so you wouldn't even need to read that higher value every time.

Like i said take this with a grain of salt, and i'd like to have someone present the argument for the 4 > 1 merge as i don't have much background with it.

0

LVL 18

Expert Comment

ID: 22672536
msm, good point.
Your
In the proposed parallel algorithm, if we merge 4 to 2, then 2 to 1,
then time complexity is
O(n log n/4) + O(n)
If we do the last merge by using the "smallest from 4" algorithm, then it is
O(n + n log n)
which are not so far, because 1 + log n  isn't very different of 1 + log n/4 and anyway, the longer "smallest from 4" is still in the logarithmic order of growth.
The reason of 4->1 is to minimize disk access.
Merge sort is a recursive algorithm, so spending a lot of memory for the author's so large files, not only for sub arrays, but for stack (subroutine return and arguments).
If the available memory was enough to fit the whole csv file and the other arrays, then, for sure, 4 parallel sorting of n/4 subarrays, then 4->2 then 2->1 is the right choice, over any other option, as you said. In this case, WALL time would be very close to CPU time.
By the other hand, as far I understand the case described by the author, if the program takes 1.5 hrs for 500k rows, then it is swapping a lot !!! And, in this case, the disk access times are included in the WALL time. Probably author's CPU time is something like 10 minutes and disk R/W 80 minutes... (my guess).
I'm not sure about correctness of that times relationship, very dependent on the hardware (CPU, memory), size and structure of the rows, and whatever. It would be good to try both approaches and chose the better in the case.
So, in essence, I agree with your point, but if we can do better WALL time with the worst algorithm, probably the author will preffer the faster.
Jose
0

LVL 1

Expert Comment

ID: 22673329
Ah, right- I didn't take into consideration R/W, thanks for that :)
0

Expert Comment

ID: 23228415

As the article states this sort works in O(nk) time where k is the average key length and n is the number of elements. That is much faster then a Quicksort. Just convert your dates into timestamps and you are off to the races.
0

## Featured Post

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Okay. So what exactly is the problem here? How often have we come across situations where we need to know if two strings are 'similar' but not necessarily the same? I have, plenty of times. Until recently, I thought any functionality like that woâ€¦
This is a short and sweet, but (hopefully) to the point article. There seems to be some fundamental misunderstanding about the function prototype for the "main" function in C and C++, more specifically what type this function should return. I see soâ€¦
The goal of this video is to provide viewers with basic examples to understand opening and writing to files in the C programming language.
The goal of this video is to provide viewers with basic examples to understand and use conditional statements in the C programming language.
###### Suggested Courses
Course of the Month11 days, 3 hours left to enroll