Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Search in a large Array

Posted on 2012-03-09
9
Medium Priority
?
469 Views
Last Modified: 2012-05-10
I have sorted array which contains over 50,000 file names let sat this is Array F1. And there is another sorted array which contain the path and the file name let say this is Array P1. I will need to pass the file name from Array F1 to the array P1 to find the path and then store the path in a third array at the same index as of the index of the filename in array F1.

Should I use Binary search or is there any other better search algorithm for search from 50k array elements. I will need to use C++
0
Comment
Question by:javaCaravan0
9 Comments
 
LVL 7

Accepted Solution

by:
ziceva earned 256 total points
ID: 37703049
Hi,

For sorted arrays the best way of searching an element is definitely binary search ... you cannot hope for something better than the log2 complexity of the binary search. For a 50k array (relatively small) the search will end in a maximum of 16 steps.

I once was confronted with a harder problem were I had to perform an array_diff between two arrays of 4 to 6 million elements in php and the php provided function was taking literary hours (because it was not taking into account the fact that the arrays were already sorted).
When taking EACH element from the first array and making a binary search in the second one (so a few million binary searches in an array of a couple of million elemens), the job was done in seconds ... so binary search will perform well in solving problems way harder than yours

As stated before, 16 steps it a very fast search for your 50k array ...
0
 

Author Comment

by:javaCaravan0
ID: 37703189
thank you ... please note, F1 hs 50k filenames, so search will need to be conducted 50k times. Array P1 also has 50k paths, i.e. one for each element of F1. Should I delete the array element from P1 once it has been found, this way, the size of P1 will keep shrinking so there will be less element to search from. What do you think.
0
 
LVL 7

Expert Comment

by:ziceva
ID: 37703212
So there are 50k searches that will need a maximum of 16 steps ... no biggie ... As stated before I've tested something similar but with millions of elements instead of thousands as you have (and I was doing it in php, c++ will be much quicker)...
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:javaCaravan0
ID: 37703229
thank you
0
 
LVL 86

Assisted Solution

by:Mike Tomlinson
Mike Tomlinson earned 248 total points
ID: 37703422
Where did this data come from?...wouldn't it be better to store those pieces of information together in a CLASS (since you're using C++)?  Why the separate arrays?
0
 
LVL 32

Assisted Solution

by:phoffric
phoffric earned 248 total points
ID: 37703948
I agree - why use separate arrays. Why search at all. You can store the P1 array as a map (or multimap) with filename being the key, and the path (excluding the filename) being the folder path.
C++ References:
     http://www.cplusplus.com/reference/stl/map/
     http://www.cplusplus.com/reference/stl/multimap/

Let the map do the searching for you using find:
    http://www.cplusplus.com/reference/stl/map/find/
 
BTW - as you insert the filename and path into the map, it is sorted automatically for you.

But you may want to consider hash algorithm for faster access time. (Haven't used it, but I will if the need arises.)
C++ References:
     http://www.cplusplus.com/reference/stl/unordered_map/
Idea is to hash the filename to come up with a list of paths (only 1 if all filenames are known to be unique). Brings search time closer to linear. (Still have to deal with collisions.)
0
 
LVL 22

Assisted Solution

by:ambience
ambience earned 248 total points
ID: 37762554
>> Should I delete the array element from P1 once it has been found, this way, the size of P1 will keep shrinking so there will be less element to search from. What do you think.

Two things:

- Shrinking is not a good idea because either you can swap with last item and resort the array or you'd have to reallocate the array and copy left over item. In both cases, performance will suffer.

- Since you are willing to modify the array it appears as though these arrays are transient in nature, filled up for just this task and wont be reused. This make it important to consider optimizations at the time of population arrays.

For example,

Array F1 need not be sorted.

AND

Cant you just
- load the second array into a Tree (like the Filesystem) with each directory as a node in the tree and filenames as the leaves.
- Then do an preorder traversal of the tree (left, root, right) and put the path in PATH array and FILENAME in Filenames array.

At the end you get sorted arrays of fienames and corresponding paths. Unless ofcourse if you expecting a case where filenames in F1 array may not exist in P1 array.

Again, depending upon how deep the paths can be, A tree could achieve lesser number of comparisons compared to binary search on a 50K array. The worst case would be the maximum number path segments.

Note that looking up a path for a filename is a classic dictionary lookup and there are algorithms and data-structures that can achieve performance much greater than a binary search. For example, using a Trie variant like PARTICIA. The cost of setting up such structures isnt trivial and in your case a binary tree/search ought to perform well enough.
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The greatest common divisor (gcd) of two positive integers is their largest common divisor. Let's consider two numbers 12 and 20. The divisors of 12 are 1, 2, 3, 4, 6, 12 The divisors of 20 are 1, 2, 4, 5, 10 20 The highest number among the c…
This article will show, step by step, how to integrate R code into a R Sweave document
This theoretical tutorial explains exceptions, reasons for exceptions, different categories of exception and exception hierarchy.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.
Suggested Courses

926 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question