Solved

Sorting data from a huge file( file size bigger than memory)

Posted on 2011-02-25
11
608 Views
Last Modified: 2012-05-11
what is the optimized way of searching some data from a huge file, the file is so big it can be loaded in the memory. I think it would require to break the file. what is the optimized way to break the file and searching from the breakup files,?
0
Comment
Question by:gname
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
  • 2
  • +1
11 Comments
 
LVL 4

Expert Comment

by:racastillojr
ID: 34985226
You can use a program like winzip to beat a file into several different parts.
0
 
LVL 4

Expert Comment

by:racastillojr
ID: 34985229
I mean brake not beat
0
 
LVL 4

Expert Comment

by:racastillojr
ID: 34985239
7-zip is a free program that can do it too. You can download it from the link below.

http://download.cnet.com/7-Zip/3000-2250_4-10045185.html?tag=mncol;2
0
On Demand Webinar: Networking for the Cloud Era

Did you know SD-WANs can improve network connectivity? Check out this webinar to learn how an SD-WAN simplified, one-click tool can help you migrate and manage data in the cloud.

 
LVL 32

Expert Comment

by:phoffric
ID: 34985267
Title says sorting, but then you say searching.
What are you sorting (or searching, or both)? Fixed length structures, strings in text, or ??

What is your OS and compiler names?
How much memory do you have? How large is the file?
0
 

Author Comment

by:gname
ID: 34987554
Sorry about the confusion of sorting and searching, the goal is to sort a file which has numbers in it. And the assumption is that file  is at least 6 times bigger than the memory size.

about the comment on  - break a file; if I break it, I still need to compare a number in all the parts of the file , but all the parts can not exist together in the memory. Do you mean to compare in the zipped version? how would that work?
0
 
LVL 32

Expert Comment

by:phoffric
ID: 34987956
"External sorting is a term for a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory (usually a hard drive). External sorting typically uses a sort-merge strategy. In the sorting phase, chunks of data small enough to fit in main memory are read, sorted, and written out to a temporary file. In the merge phase, the sorted subfiles are combined into a single larger file."

Ref:  http://en.wikipedia.org/wiki/External_sorting
0
 
LVL 37

Expert Comment

by:TommySzalapski
ID: 34988872
Yes. You need some type of merge sort. I would do something like this.
Open first 1/10th of file (or whatever portion can fit)
Sort that and save it somewhere
Do the same for the next portion of the file, etc.
After you do this 10 times, the whole file is now in 10 pieces, each one is sorted.
Now connect to each file and load the first element. Push the element that should be first into the output file and load the next one from the file that one came from.

A more textbook version would be to spit the file in half and keep splitting each portion in half until the peices are small enough to manage, then merge them two at a time using the previosly described method.
0
 
LVL 32

Accepted Solution

by:
phoffric earned 500 total points
ID: 34988909
Or, from the link:

External mergesort
"One example of external sorting is the external mergesort algorithm, which sorts chunks that each fit in RAM, then merges the sorted chunks together. For example, for sorting 900 megabytes of data using only 100 megabytes of RAM:

"1.Read 100 MB of the data in main memory and sort by some conventional method, like quicksort.

2.Write the sorted data to disk.

3.Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks, which now need to be merged into one single output file.

4.Read the first 10 MB of each sorted chunk into input buffers in main memory and allocate the remaining 10 MB for an output buffer. (In practice, it might provide better performance to make the output buffer larger and the input buffers slightly smaller.)

5.Perform a 9-way merge and store the result in the output buffer. If the output buffer is full, write it to the final sorted file. If any of the 9 input buffers gets empty, fill it with the next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is available.

"Additional passes
"That example shows a two-pass sort: a sort pass followed by a merge pass. For sorting, say, 50 GB in 100 MB of RAM, using a single merge pass isn't efficient: the disk seeks required to fill the input buffers with data from each of the 500 chunks take up most of the sort time. Using two merge passes solves the problem. Then the sorting process might look like this:

"1.Run the initial chunk-sorting pass as before.

2.Run a first merge pass combining 25 chunks at a time, resulting in 20 larger sorted chunks.

3.Run a second merge pass to merge the 20 larger sorted chunks."
0
 
LVL 37

Expert Comment

by:TommySzalapski
ID: 34989122
Ah, that's almost the same thing. I didn't read the link because wikipedia tends to be overly abstract. For the sake of efficiency from a time complexity standpoint, the halving and merging two at a time is the best, but with the high cost of disk I/O, I would assume the multi-merge is better.
0
 

Author Comment

by:gname
ID: 34993744
Thaks alot phoffric, I will try this.
0
 
LVL 32

Expert Comment

by:phoffric
ID: 34996882
You are welcome. If you have good luck, maybe you could post some of the statistics from your run. If the stats are not good enough, then be sure to read the subsequent section on tuning

You can also google on "parallel external sort" and find many performance tips and even find newer ideas such as minimizing cache misses.
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Okay. So what exactly is the problem here? How often have we come across situations where we need to know if two strings are 'similar' but not necessarily the same? I have, plenty of times. Until recently, I thought any functionality like that wo…
One of Google's most recent algorithm changes affecting local searches is entitled "The Pigeon Update." This update has dramatically enhanced search inquires for the keyword "Yelp." Google searches with the word "Yelp" included will now yield Yelp a…
I've attached the XLSM Excel spreadsheet I used in the video and also text files containing the macros used below. https://filedb.experts-exchange.com/incoming/2017/03_w12/1151775/Permutations.txt https://filedb.experts-exchange.com/incoming/201…
NetCrunch network monitor is a highly extensive platform for network monitoring and alert generation. In this video you'll see a live demo of NetCrunch with most notable features explained in a walk-through manner. You'll also get to know the philos…

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question