Link to home
Start Free TrialLog in
Avatar of HukkaHua
HukkaHua

asked on

Searching duplicates from a huge data set

Hi ,
     This question is from one of the popular books .

Given a sequential  file containing 4,300,000,000 integers, how can you find one that appears at least once ?
a) How to solve it with ample quantities of memory
b) External scratch files are allowed but only few hundred bytes of main memory is available.

I am aware of the bit map technique, where we map each integer to a bit and so on.
Any solution to the above problem is greatly appreciated.

Thanks,
H
Avatar of sunnycoder
sunnycoder
Flag of India image

>how can you find one that appears at least once ?
There would be many such integers ... are you interested in all of them?

>a) How to solve it with ample quantities of memory
Apart from bitmap, you could build yourself a B tree. Or you could use a hash table.

>b) External scratch files are allowed but only few hundred bytes of main memory is available.
Which means that you need to keep your B Tree of hash table on the disk and read only the relevant nodes/buckets at a time.

yet another approach is Bloom filters - they are fast and require constant amount of memory but only problem is that they can give you false positives too - constant memory comes at a price ;-)
B trees:
http://www.bluerwhite.org/btree/

B trees are also used for indexing databases - millions of entries.
Avatar of HukkaHua
HukkaHua

ASKER

Well let me split it into 2 cases :

1) any one integer will suffice
2) list of all duplicate integers

I can correlate with B trees, but don't see how a hash table will help in the main memory ?
If it is a better technique may you please explain

Also, can you provide me with the code for the external scratch files solution. It would be very helpful.

For bloom filters, how do u choose the number of hash functions, and the problem with bloom filters is one cant delete , although in this problem, I don't  forsee that.

Thanks,
H
>can you find one that appears at least once ?
>1) any one integer will suffice
Pick the first one!!

>2) list of all duplicate integers
>but don't see how a hash table will help in the main memory ?
Hash takes constant amount of time. Hash your number and place it in the right bucket. If it is already in the bucket then its a duplicate.

B tree on disk implementations
http://en.wikipedia.org/wiki/B-tree
http://swiss.csail.mit.edu/~jaffer/WB
So to sum it up, we need to sort the the random set of numbers and then scan to check whether there exist duplicates or not ?  To sort it will take like O(nlog n) and then need just one scan pass.

I am not actually sure how duplicates are handled in B tree and how do I scan duplicates in Btrees.

The bit map method will need additional bits and the problem with it is , if we dont know apriori how many duplicates exist for each number, it is very difficult to implement.

The disk based B tree link seems very good, I am going through it.

Please let me know your thoughts.

Thanks,
H
You dont necessarily have to sort. A good hashing function would give you near linear time.
>I am not actually sure how duplicates are handled in B tree and how do I scan duplicates in Btrees.
When you are generating a B tree, you can find a duplicate at the time of insertion. So by the time you are finished with constructing the B tree, you would already have the duplicate list.

>if we dont know apriori how many duplicates exist for each number, it is very difficult to implement.
Why do you need to know that ... Irrespective of number of duplicates the number would appear only once in the list of duplicates.

with lots of memory
perl -ne 'print if ++$c{$_} ==2' < file

wifh limited memory
sort file | uniq -d
@sunny can you please provide the hashing functions which will give me linear time.

Well, makes sense while creation we can track the duplicates.
Hmm but need additional memory to keep the count of number of times
a number is duplicated.

@ozo , I apologize I dont know perl. So cant decipher your solution :(

Thanks,
H
with lots of memory, use a hash table

with limited memory, do an external sort
or an external hash table
>>>> You dont necessarily have to sort. A good hashing function would give you near linear time.
I guess you both speak of different things. "Hashing" would require that the numbers already were inserted in a hash tree what costs extra time that isn't linear.  And of course the hashing for one specific number is quicker than linear time (same as a binary tree). Only if you evaluate all numbers using a hash you may get linear times. If the numbers were in a sorted file´you would have linear time to display duplicates too. Depending on the range the numbers were in, hashing could be very inefficient. Assume the numbers were all between 1 and 10 and the hash would not be optimized on that.

>>>> I am aware of the bit map technique,
Are you aware that 4.3 billion of integers is more than you can count with a 32bit integer? Assume the range of integers would be all 32 bit integers. Then you need about 500 MB memory or a file of the same size. I would guess that memory is 100 times or more faster than a file.
Excellent point @itsmeandnobodyelse,

I have not yet been able to comprehend the hashing mechanism in this case. Yes, I think one would have to make a hash table/tree for the numbers , but since mostly the numbers are going to be unique out of this huge data set , it is going to be quite a feat. I am lost...here any input is appreciable.

Well by the bit map technique, I referred to the technique showed in Programming Pearls, where say you allocate 10 bytes i.e. 80 bits and hence u can represent 80 numbers. 0 if missing 1 if present. This technique is quite nice if the numbers are unique.

Thanks,
H  
Guys can you please provide a working code for the problem ? It would be most helpful.

Given a sequential  file containing 4,300,000,000 integers, how can you find one that appears at least once ?
a) How to solve it with ample quantities of memory
b) External scratch files are allowed but only few hundred bytes of main memory is available.

I am kind of lost with so many approaches and solutions.

Thanks for all the help.
H
Hi itsmeandnobodyelse,

May be we too are talking of different things :)
>Hashing" would require that the numbers already were inserted in a hash tree what costs extra time that isn't linear.  
>And of course the hashing for one specific number is quicker than linear time (same as a binary tree). Only if you
>evaluate all numbers using a hash you may get linear times. If the numbers were in a sorted file´you would have
>linear time to display duplicates too.
Cost of applying a hash function is linear. When you are inserting in a hash table, if you find that entry already exists, you have found a duplicate. This process needs to be applied for each number. i.e. O(n). Only caveat is if your collision resolution is weak and gives you a highly skewed distribution (e.g. as an extreme case, two buckets - odd or even - colliding entries form a linked list - this would give you O(n^2))

Sorting OTOH is min O(nlogn). Even if you are able to detect duplicates while sorting, the cost remains the same. A scan after sorting means O(nlogn) + O(n) which is effectively O(nlogn)


>Depending on the range the numbers were in, hashing could be very inefficient. Assume the numbers were all
>between 1 and 10 and the hash would not be optimized on that.
if you are detecting while inserting, it does not matter how many duplicates you have. You simply do not insert a duplicate!!!

Hi HakkaHua,

>I have not yet been able to comprehend the hashing mechanism in this case
I hope the above explanation helps clear up some doubts. In case it is still not clear, please post the exact doubt you have and i will revert with clarifications. Do go through the hashing links I posted earlier. It is indeed a good hashing function. It has very well documented source code too.

>Guys can you please provide a working code for the problem ?
I already have posted links to code for B tree and hashing :)
>>>> Cost of applying a hash function is linear.
>>>> This process needs to be applied for each number. i.e. O(n).
That would be O(n^2). But applying a hash is less than linear (if applying a hash would be linear you could iterate in a loop as well, what obviously isn't true). So depending on the goodness of the hash you have somewhat between O(n) and O(n^2).

>>>> Given a sequential  file containing 4,300,000,000 integers, how can you find one that appears at least once ?
I have problems to understand that requirement. You only have to read the first number of the file to find an integer which appears at least once (or did you mean atmost?)

   std::ifstream ifs("inputfile");
   int i;
   if (ifs >> i)
        std::cout << i << std::endl;

Note, a file with 4.3 billion of integers - not sorted -  isn't a realistic scenario (it looks more like an academical homework task). You normally would write the numbers sorted and/or already evaluated like

1     737
2        3
3      17
...
99    12
110  15
...

where the first number is one of the 4.3 billion of numbers and the second is the frequency count.  In case you have a maximum of 10000 different numbers that would reduce file size (and/or performance) by factor 200k. This kind of storage is even appropriate if you have 1 billion of different integers (what is a kind of more unreal), cause by storing a billion of  (sorted) numbers with their frequencies you still only need about half the space than for 4.3 billion of integers. If you have 4 billion of different integers (what means that - assuming any integers is 32bit maximum - nearly any 32bit integer was included), you could make a negative storage, i. e. you only put integers to the file which have a count different to 1. In any case you shouldn't have the problem that you have to deal with 4.3 billion numbers unsorted in a file and too less memory to get it properly sorted.

>>Cost of applying a hash function is linear.
Arrgh .. my bad ... I meant constant ..

>That would be O(n^2).
I guess that would change to linear now. A constant time operation applied on entire input is O(n).
ASKER CERTIFIED SOLUTION
Avatar of itsmeandnobodyelse
itsmeandnobodyelse
Flag of Germany image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>Only if the hash always had a hit at the first approach (what isn't very likely)
>you would need to search for the number (e.g. in a linked list) what costs extra time
While it is hard to guarantee that there would be no collisions, it is easy to limit the extra cost of search incurred due to collisions. You can limit the length of the list to say 10 elements and after that you resize and move around things. For a well designed hash function, a table wide load factor would suffice.
In such a case the cost of search may not exactly be that of applying hash function but it is guaranteed to be less than a known constant time (hash function + say 10 comparisons)
In my experience, I have generally found hashing to be faster than trees for very large data sets.
If range is known in advance (and it is reasonable) then bitmaps, as you illustrated, are indeed the best way to go.
>> >>>> Given a sequential  file containing 4,300,000,000 integers, how can you find one that appears at least once ?
>> I have problems to understand that requirement. You only have to read the first number of the file to find an integer which appears at least once

That is what I'm wondering about too. Were those the exact words of the question ? Maybe it's a trick question, and what Alex said is the solution ?
This might be the exact question: http://mikedebo.ca/
>> This might be the exact question: http://mikedebo.ca/

There it says at least twice, which makes more sense I guess :)
You can do it in less than 2.0 passes with 128 Kbytes.

Since 4.3 billion > 2^32, by the pigeonhole principle there must be at least one 64k range of numbers with more than 64k representatives.  

So you do the first pass with an array of 64k 16-bit ints, counting the total in each 64k range.  As soon as you find a range with more than 64k entries, you know that there is a duplicated entry in this particular range.  

Seek back to the beginning of the file.  Now you need a 64kbit array (might as well just reuse the array you just had).  Start going through the file again, looking only for numbers in the range of interest. Use the 64kbit array to track the numbers in the range you see.  As soon as you see a duplicate, you're done.
NovaDenizen, I dont get that

>counting the total in each 64k range.  
>As soon as you find a range with more than 64k entries, you know that there is a duplicated entry in this particular range.  

If you are looping over a 64k range, how would you get more than 64k entries in that range?

Am I missing something?
I think what NovaDenizen is referring to is to have 64k counters, one for each of the 64k ranges of values in a 32bit int (0-65535, 65536-131071, etc.). Then one pass over the file will increment these counters to count how many values are in each of those ranges. Finally, simply find a range that has a cardinality higher than the length of the range, and do another pass on the file, only checking values in that range for duplicates.
>>>> I think what NovaDenizen is referring to is to have 64k counters,
64k counters badly could be managed with a few 100 bytes of memory ...
>>>> I think what NovaDenizen is referring to is to have 64k counters,
64k counters badly could be managed with a few 100 bytes of memory ...

The (b) clause in the original question suggests that one should do it with lots of external storage and a few  hundred bytes of memory.

The pigeonhole idea can be adapted to use either multiple passes or lots of external storage (equal to the original file) successfully.

Multiple passes:
Conceptually divide the original data into 16 ranges, each 2^28 in length.  By the pigeonhole principle, at least one of these ranges will contain more than 2^28 entries.  Scan through the file, count the elements in each range (with a 64-byte array of 16 32-bit ints) until you find a range with more than 2^28 entries.

Now you do the same thing, except now we know that at least one of the 16 ranges of size 2^24 within the 2^28 range of interest will contain more than 2^24 entries.  Scan through the whole file until you find the right range of size 2^24.  And so on.

Doing it dividing by 16 each time will result in ceil(32/4) = 8 passes through the original datafile.  Instead dividing by 128 each time (requiring 512 bytes) will result in ceil(32/7) = 5 passes.

Using External Storage:
Do the same as above, except as you scan through you create 16 files, each with the contents of each range.  As soon as a file reaches 2^28+1 entries, stop the scan and delete the other files.  Now do the same recursion into this file, looking for a 2^24 range with at least 2^24+1 entries.

At first it will make a little less than full pass through the data, and during each pass it writes out as much data as it reads in. Then a second pass through 1/16 of the original data, then 1/256, etc.  1 + 1/16 + 1/256 + ... = 16/15, so assuming the process is I/O bound it will take a little longer than just copying the original file would, and you wouldn't get a significant speedup from breaking it down to 128 ranges.
>>>> The pigeonhole idea can be adapted to use either multiple passes or lots of external storage (equal to the original file) successfully.
Fine aproach ;-)


A maybe more pragmatically way is to sort the file by means of the operation system and have a simple loop until the first duplicate was found (I know I was cheating but I don't care).

Thanks for putting the effort and writing the detailed code.