Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

Sorting file.

Posted on 1998-10-12
15
Medium Priority
?
186 Views
Last Modified: 2010-04-02
Hello people.
I have to sort a huge (about 30MB) file of integers.
Please send me an efficient algorithm or program in C++.
Thanks to ozo I've decided to use qsort. However, I still have a problem. In order to use qsort I have to put my data in some kind of array, but it's impossible to create an array that size (my array should contain at least 800000 integers). What can I do?

Please answer quickly.
Thanks.
0
Comment
Question by:nir_sh
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 2
  • 2
  • +6
15 Comments
 
LVL 6

Expert Comment

by:snoegler
ID: 1174882
Beyond this link:

  http://now.cs.berkeley.edu/NowSort

you can find the fastest(not sure about that) algorithm for disk based sorting.
I am not sure if you can get the algorithm there, but i think so. I would try it, because the
algorithm works even for 6.4 GB(In the benchmark you can see there, in less than 1 hour with
32 workstations).

Another conventional sorting algorithm for disk files is the bucket sort. But i don't have a good
link, try to search for it.
Hope this helped...
0
 
LVL 84

Expert Comment

by:ozo
ID: 1174883
30MB isn't that large.
If it fits in memory you might just
#include <stdlib.h>
int compatr(const void *a, const void *b){
      return *((int *)a)-*((int *)b);
}

qsort(base, nel, sizeof(int), &compar);


0
 
LVL 3

Expert Comment

by:elfie
ID: 1174884
What about calling an already existing sort program? On UX just call "sort -n < in  > out"
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 6

Expert Comment

by:thresher_shark
ID: 1174885
I would think that 30mb would fit in memory, so if I were you I'd use ozo's method.
0
 

Author Comment

by:nir_sh
ID: 1174886
Edited text of question
0
 

Author Comment

by:nir_sh
ID: 1174887
Edited text of question
0
 

Author Comment

by:nir_sh
ID: 1174888
Adjusted points to 400
0
 
LVL 84

Expert Comment

by:ozo
ID: 1174889
C++ can easily declare an array containing 800000 integers,
but perhaps your system doesn't have the memory to load it after all.
So, I would either pipe it to sort, if you have it,
or split it into pieces that you can sort in memory, and then merge the pieces.
0
 
LVL 3

Expert Comment

by:danny_pav
ID: 1174890
what do you know about the data?
If these are integers, aren't there going to be a lot of duplicates?

Can you then use a map like map<val, count>?  Read it all in, insert into/increment the map as you go, and then traverse the map on output, writing count copies of val.  

0
 
LVL 14

Expert Comment

by:AlexVirochovsky
ID: 1174891
Hi, some times ago I have same problem(but data in file).
My way is very close to "ozo" solution:
1. Define number of parts that can be in memory: N
2. Loop (for i = 0 ; i < N; i++)
{
   read i-th part of file. Sort it by qsort and save in temp file
}
3. Union temp files: (this is problem with max.number of open files  and i make union by 10 files ).
  for union I open up 10 files, read 1-st line from all files,
  qsort them, save min, read line from that file, where is min
and ..., untill end of all files.
Can be this help, Alex
0
 

Expert Comment

by:demoneye
ID: 1174892
Get a bat and beat it with a stick.
That should make it work.

0
 
LVL 6

Expert Comment

by:thresher_shark
ID: 1174893
demoneye - If you have nothing of any value to add to this site, please refrain from commenting or answering.  Experts exchange is a professional site.  Wasting room and people's time is usually not accepted very well by the general populous.  If you have other questions, please refer to the FAQ page and HELP page.
0
 

Author Comment

by:nir_sh
ID: 1174894
I've managed to create a very big array using the farmalloc function. However, the array I'm using should be bigger, my pointer should be huge and not far. This is my last problem, if you solve this problem for me, everything is solved.
0
 
LVL 3

Expert Comment

by:elfie
ID: 1174895
when you talk about "huge" pointers, did you check the memory model in which you are compiling?
0
 
LVL 1

Accepted Solution

by:
msa092298 earned 1200 total points
ID: 1174896
You don't have to use huge pointers.  You can divide them into at most 64 K chunks, sort each one alone and output it to a temp file.  Finally read the first 4 K of every file (for performance), insert the first integer from every file into a heap-array (if you know about Heap Sort) where the smallest value is always kept at the top of the array. Each value has an identifier of the temp file it is coming from.  Now move the top of the list to the final sorted file and replace it with its successor from its temp file (if you finish on all the values of the 4 k buffer related to that file, read another 4 k buffer).  If the successor is still less than its 2 sons(which are the least values remaining in the 2 buffers corresponding to 2 chunks) then add it to the final file and repeat the process.  If not, you will have to remove it from the top of the heap and replace it with the lease one of its 2 children and reinsert it in the heap.   You can greatly reduce the time and size required if you know that you have a lot of repeated values, when writing the values to the temp files in the first phase, write the value and the count of its repetitions.
conclusion : divide the data into chunks, sort each chunk alone using quicksort and write it to a temp file and preferably with a repeat count for each repeated values.  Now merge the temp files into a single file in one of 2 ways:
1. you have an easy way : merge each 2 files together and replace them with a single file.  Repeat merging each 2 files untill all the files are done.  Delete merged files to save space.
2. the harder one (but should be faster): merge the n temp files together utilizing a heap sort to know which file contains the smallest values.  Keep reading from that file into the final file until the least value of another file is less than the file you are reading from.   This way, you don't have to qsort all the values while merging.
0

Featured Post

On Demand Webinar: Networking for the Cloud Era

Did you know SD-WANs can improve network connectivity? Check out this webinar to learn how an SD-WAN simplified, one-click tool can help you migrate and manage data in the cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Errors will happen. It is a fact of life for the programmer. How and when errors are detected have a great impact on quality and cost of a product. It is better to detect errors at compile time, when possible and practical. Errors that make their wa…
Article by: SunnyDark
This article's goal is to present you with an easy to use XML wrapper for C++ and also present some interesting techniques that you might use with MS C++. The reason I built this class is to ease the pain of using XML files with C++, since there is…
The goal of the tutorial is to teach the user how to use functions in C++. The video will cover how to define functions, how to call functions and how to create functions prototypes. Microsoft Visual C++ 2010 Express will be used as a text editor an…
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.

715 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question