Solved

Sorting data from a huge file( file size bigger than memory)

Posted on 2011-02-25
11
584 Views
Last Modified: 2012-05-11
what is the optimized way of searching some data from a huge file, the file is so big it can be loaded in the memory. I think it would require to break the file. what is the optimized way to break the file and searching from the breakup files,?
0
Comment
Question by:gname
  • 4
  • 3
  • 2
  • +1
11 Comments
 
LVL 4

Expert Comment

by:racastillojr
Comment Utility
You can use a program like winzip to beat a file into several different parts.
0
 
LVL 4

Expert Comment

by:racastillojr
Comment Utility
I mean brake not beat
0
 
LVL 4

Expert Comment

by:racastillojr
Comment Utility
7-zip is a free program that can do it too. You can download it from the link below.

http://download.cnet.com/7-Zip/3000-2250_4-10045185.html?tag=mncol;2
0
 
LVL 32

Expert Comment

by:phoffric
Comment Utility
Title says sorting, but then you say searching.
What are you sorting (or searching, or both)? Fixed length structures, strings in text, or ??

What is your OS and compiler names?
How much memory do you have? How large is the file?
0
 

Author Comment

by:gname
Comment Utility
Sorry about the confusion of sorting and searching, the goal is to sort a file which has numbers in it. And the assumption is that file  is at least 6 times bigger than the memory size.

about the comment on  - break a file; if I break it, I still need to compare a number in all the parts of the file , but all the parts can not exist together in the memory. Do you mean to compare in the zipped version? how would that work?
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 32

Expert Comment

by:phoffric
Comment Utility
"External sorting is a term for a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory (usually a hard drive). External sorting typically uses a sort-merge strategy. In the sorting phase, chunks of data small enough to fit in main memory are read, sorted, and written out to a temporary file. In the merge phase, the sorted subfiles are combined into a single larger file."

Ref:  http://en.wikipedia.org/wiki/External_sorting
0
 
LVL 37

Expert Comment

by:TommySzalapski
Comment Utility
Yes. You need some type of merge sort. I would do something like this.
Open first 1/10th of file (or whatever portion can fit)
Sort that and save it somewhere
Do the same for the next portion of the file, etc.
After you do this 10 times, the whole file is now in 10 pieces, each one is sorted.
Now connect to each file and load the first element. Push the element that should be first into the output file and load the next one from the file that one came from.

A more textbook version would be to spit the file in half and keep splitting each portion in half until the peices are small enough to manage, then merge them two at a time using the previosly described method.
0
 
LVL 32

Accepted Solution

by:
phoffric earned 500 total points
Comment Utility
Or, from the link:

External mergesort
"One example of external sorting is the external mergesort algorithm, which sorts chunks that each fit in RAM, then merges the sorted chunks together. For example, for sorting 900 megabytes of data using only 100 megabytes of RAM:

"1.Read 100 MB of the data in main memory and sort by some conventional method, like quicksort.

2.Write the sorted data to disk.

3.Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks, which now need to be merged into one single output file.

4.Read the first 10 MB of each sorted chunk into input buffers in main memory and allocate the remaining 10 MB for an output buffer. (In practice, it might provide better performance to make the output buffer larger and the input buffers slightly smaller.)

5.Perform a 9-way merge and store the result in the output buffer. If the output buffer is full, write it to the final sorted file. If any of the 9 input buffers gets empty, fill it with the next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is available.

"Additional passes
"That example shows a two-pass sort: a sort pass followed by a merge pass. For sorting, say, 50 GB in 100 MB of RAM, using a single merge pass isn't efficient: the disk seeks required to fill the input buffers with data from each of the 500 chunks take up most of the sort time. Using two merge passes solves the problem. Then the sorting process might look like this:

"1.Run the initial chunk-sorting pass as before.

2.Run a first merge pass combining 25 chunks at a time, resulting in 20 larger sorted chunks.

3.Run a second merge pass to merge the 20 larger sorted chunks."
0
 
LVL 37

Expert Comment

by:TommySzalapski
Comment Utility
Ah, that's almost the same thing. I didn't read the link because wikipedia tends to be overly abstract. For the sake of efficiency from a time complexity standpoint, the halving and merging two at a time is the best, but with the high cost of disk I/O, I would assume the multi-merge is better.
0
 

Author Comment

by:gname
Comment Utility
Thaks alot phoffric, I will try this.
0
 
LVL 32

Expert Comment

by:phoffric
Comment Utility
You are welcome. If you have good luck, maybe you could post some of the statistics from your run. If the stats are not good enough, then be sure to read the subsequent section on tuning

You can also google on "parallel external sort" and find many performance tips and even find newer ideas such as minimizing cache misses.
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

This algorithm (in C#) will resize any image down to a given size while maintaining the original aspect ratio. The maximum width and max height are both optional but if neither are given, the original image is returned. This example is designed t…
Iteration: Iteration is repetition of a process. A student who goes to school repeats the process of going to school everyday until graduation. We go to grocery store at least once or twice a month to buy products. We repeat this process every mont…
It is a freely distributed piece of software for such tasks as photo retouching, image composition and image authoring. It works on many operating systems, in many languages.
Internet Business Fax to Email Made Easy - With eFax Corporate (http://www.enterprise.efax.com), you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, fr…

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now