Link to home
Start Free TrialLog in
Avatar of dc_cypher
dc_cypher

asked on

How can I get the intersection of two very large lists of numbers in C?

I have the following perl script to determine the intersection of two very large lists of numbers.  How can I do the same in C?  I'm hoping that by writing it in a compiled language rather than an interpreted one, I can get a performance boost (please let me know if that's not the case).
# Get first large data file
$data_file = $ARGV[0];
open(DAT, $data_file) || die("Could not open file [$ARGV[0]]!");
@raw_data = <DAT>;
close(DAT);
@a = split(',', join(//, @raw_data));
 
# Get second large data file
$data_file = $ARGV[1];
open(DAT, $data_file) || die("Could not open file [$ARGV[1]]!");
@raw_data = <DAT>;
close(DAT);
@b = split(',', join(//, @raw_data));
 
# Find intersection
%isect = ();
map { $isect{$_} = 1 } @a;
@isect_list = grep { $isect{$_} } @b;
 
# Make unique
undef %saw;
@unique = grep(!$saw{$_}++, @isect_list);
 
# Print results
$count = @unique;
print $count;

Open in new window

Avatar of ozo
ozo
Flag of United States of America image

You can implement a hash in c, and then use the perl algorithm,
or you could sort the lists and scan sequentially
SOLUTION
Avatar of evilrix
evilrix
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Basically, sort the two lists, and then traverse both, retaining only the values that appear in both lists.

Does "very large" mean "too large for memory" ? Can you give an idea of the size ?
Avatar of dc_cypher
dc_cypher

ASKER

They will be lists of several hundreds of thousands of numbers (possibly millions) stored in two separate data files.  Unfortunately, my C/C++ skills are basic at best.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>> Unfortunately, my C/C++ skills are basic at best.
There was a very good example in the link I posted above Did you look at it?
// set_intersection example
#include <iostream>
#include <algorithm>
#include <vector>
using namespace std;
 
int main () {
  int first[] = {5,10,15,20,25};
  int second[] = {50,40,30,20,10};
  vector<int> v(10);                           // 0  0  0  0  0  0  0  0  0  0
  vector<int>::iterator it;
 
  sort (first,first+5);     //  5 10 15 20 25
  sort (second,second+5);   // 10 20 30 40 50
 
  it=set_intersection (first, first+5, second, second+5, v.begin());
                                               // 10 20 0  0  0  0  0  0  0  0
 
  cout << "intersection has " << int(it - v.begin()) << " elements.\n";
 
  return 0;
}

Open in new window

ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
computers with a gigabyte of memory are common, so it may not be too bad to sort in memory
Hashing may be more efficient on average than sorting, but perl hashes can easily waste enough memory so that your numbers no longer fit in ram and have to thrash through swap space, so a C qsort would probably be be better.

many system that have perl and C also have an optimized external sort utility, so you might use that if it's too big to fit in your computers memory
Sorting them in advance will not be a problem, but even so, every millisecond counts.  I've put together the following with my basic C++ abilities, but is it really the fastest solution?
#include <iostream>
#include <fstream>
using namespace std;
 
int main(void) {
  fstream InFile1, InFile2;
  int Num1, Num2;
 
  // Open file 1
  InFile1.open("./bar", ios::in);
  if (InFile1.fail()) {
    cerr << "Could not open file " << "./bar" << endl;
    exit(1);
  }
 
  // Open file 2
  InFile2.open("./baz", ios::in);
  if (InFile2.fail()) {
    cerr << "Could not open file " << "./baz" << endl;
    exit(1);
  }
 
  InFile1 >> Num1;
  InFile2 >> Num2;
 
  // Compare and print if intersection is found
  while (!InFile1.fail() && !InFile2.fail()) {
    if (Num1 == Num2) {
      cout << Num1 << endl;
      InFile1 >> Num1;
      InFile2 >> Num2;
    } else if (Num1 < Num2) {
      InFile1 >> Num1;
    } else {
      InFile2 >> Num2;
    }
  }
 
  InFile1.close();
  InFile2.close();
 
  return 0;
}

Open in new window

evilrix: I looked at the code you provided, but won't adding each number to a vector take up unnecessary time?  I guess the question is, will creating the vectors offset the performance gain (if any) from using set_intersection()?

  // Compare and print if intersection is found
  while (!InFile1.fail() && !InFile2.fail()) {
    if (Num1 == Num2) {
this is probably limited by how fast you can read and write the files
>> I guess the question is, will creating the vectors offset the performance gain (if any) from using set_intersection()?
Well you have to put them somewhere :)
The cost in using an STL container is the heap allocation. Appending to a vector if the memory is already allocated is cheap. If you don't know up front how many entries they'll be then allocate memory in blocks or, for example, 4Kb, as required using the resize() method. If you do know up front then allocate all the memory you need in one code.
>> using the resize() method.

Sorry, typo (it's early here!), I meant reserve() method...

void reserve(size_type n)  If n is less than or equal to capacity(), this call has no effect. Otherwise, it is a request for allocation of additional memory. If the request is successful, then capacity() is greater than or equal to n; otherwise, capacity() is unchanged. In either case, size() is unchanged.
Incidentally, you don't have to pusk_back one at a time. If you read a block of entries from file you can insert the whole block in one go, which will also take care of block allocating heap for you. Something (untested) like below...
#include <vector>
 
int main ()
{
	std::vector<int> vi;
 
	int buffer[4096] = {0}; // This will be used to read from file
	int read_from_file = 0; // This will be the number of bytes read from file
 
	...
 
	vi.insert(vi.end(), buffer, (buffer + read_from_file/sizeof(int)));
	return 0;
}

Open in new window

>> I've put together the following with my basic C++ abilities

For that code to work correctly, the files can't contain doubles (the same value can't appear twice). Will that be the case ?
Infinity08: Yes, that will be the case.

Ok, I think I've gotten what I need from this question.  Now the problem is how to accept solution(s) because there's been a lot of good advice flowing.  Any suggestions?
You would normally select which ever answers you feel helped you resolve your problem. You can accept multiple answers. Divide the points up based upon how useful you found the advice in each accepted answer.

Does that help?
Makes sense.  Thanks again!