asked on

How can I get the intersection of two very large lists of numbers in C?

I have the following perl script to determine the intersection of two very large lists of numbers. How can I do the same in C? I'm hoping that by writing it in a compiled language rather than an interpreted one, I can get a performance boost (please let me know if that's not the case).

# Get first large data file
$data_file = $ARGV[0];
open(DAT, $data_file) || die("Could not open file [$ARGV[0]]!");
@raw_data = <DAT>;
close(DAT);
@a = split(',', join(//, @raw_data));
 
# Get second large data file
$data_file = $ARGV[1];
open(DAT, $data_file) || die("Could not open file [$ARGV[1]]!");
@raw_data = <DAT>;
close(DAT);
@b = split(',', join(//, @raw_data));
 
# Find intersection
%isect = ();
map { $isect{$_} = 1 } @a;
@isect_list = grep { $isect{$_} } @b;
 
# Make unique
undef %saw;
@unique = grep(!$saw{$_}++, @isect_list);
 
# Print results
$count = @unique;
print $count;

Open in new window

ozo

You can implement a hash in c, and then use the perl algorithm,
or you could sort the lists and scan sequentially

SOLUTION

evilrix

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Infinity08

Basically, sort the two lists, and then traverse both, retaining only the values that appear in both lists.

Does "very large" mean "too large for memory" ? Can you give an idea of the size ?

dc_cypher

ASKER

They will be lists of several hundreds of thousands of numbers (possibly millions) stored in two separate data files. Unfortunately, my C/C++ skills are basic at best.

SOLUTION

Adam314

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

evilrix

>> Unfortunately, my C/C++ skills are basic at best.
There was a very good example in the link I posted above Did you look at it?

// set_intersection example
#include <iostream>
#include <algorithm>
#include <vector>
using namespace std;
 
int main () {
  int first[] = {5,10,15,20,25};
  int second[] = {50,40,30,20,10};
  vector<int> v(10);                           // 0  0  0  0  0  0  0  0  0  0
  vector<int>::iterator it;
 
  sort (first,first+5);     //  5 10 15 20 25
  sort (second,second+5);   // 10 20 30 40 50
 
  it=set_intersection (first, first+5, second, second+5, v.begin());
                                               // 10 20 0  0  0  0  0  0  0  0
 
  cout << "intersection has " << int(it - v.begin()) << " elements.\n";
 
  return 0;
}

Open in new window

ASKER CERTIFIED SOLUTION

Infinity08

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ozo

computers with a gigabyte of memory are common, so it may not be too bad to sort in memory
Hashing may be more efficient on average than sorting, but perl hashes can easily waste enough memory so that your numbers no longer fit in ram and have to thrash through swap space, so a C qsort would probably be be better.

many system that have perl and C also have an optimized external sort utility, so you might use that if it's too big to fit in your computers memory

dc_cypher

ASKER

Sorting them in advance will not be a problem, but even so, every millisecond counts. I've put together the following with my basic C++ abilities, but is it really the fastest solution?

#include <iostream>
#include <fstream>
using namespace std;
 
int main(void) {
  fstream InFile1, InFile2;
  int Num1, Num2;
 
  // Open file 1
  InFile1.open("./bar", ios::in);
  if (InFile1.fail()) {
    cerr << "Could not open file " << "./bar" << endl;
    exit(1);
  }
 
  // Open file 2
  InFile2.open("./baz", ios::in);
  if (InFile2.fail()) {
    cerr << "Could not open file " << "./baz" << endl;
    exit(1);
  }
 
  InFile1 >> Num1;
  InFile2 >> Num2;
 
  // Compare and print if intersection is found
  while (!InFile1.fail() && !InFile2.fail()) {
    if (Num1 == Num2) {
      cout << Num1 << endl;
      InFile1 >> Num1;
      InFile2 >> Num2;
    } else if (Num1 < Num2) {
      InFile1 >> Num1;
    } else {
      InFile2 >> Num2;
    }
  }
 
  InFile1.close();
  InFile2.close();
 
  return 0;
}

Open in new window

dc_cypher

ASKER

evilrix: I looked at the code you provided, but won't adding each number to a vector take up unnecessary time? I guess the question is, will creating the vectors offset the performance gain (if any) from using set_intersection()?

ozo

// Compare and print if intersection is found
while (!InFile1.fail() && !InFile2.fail()) {
if (Num1 == Num2) {
this is probably limited by how fast you can read and write the files

evilrix

>> I guess the question is, will creating the vectors offset the performance gain (if any) from using set_intersection()?
Well you have to put them somewhere :)
The cost in using an STL container is the heap allocation. Appending to a vector if the memory is already allocated is cheap. If you don't know up front how many entries they'll be then allocate memory in blocks or, for example, 4Kb, as required using the resize() method. If you do know up front then allocate all the memory you need in one code.

evilrix

>> using the resize() method.

Sorry, typo (it's early here!), I meant reserve() method...

void reserve(size_type n) If n is less than or equal to capacity(), this call has no effect. Otherwise, it is a request for allocation of additional memory. If the request is successful, then capacity() is greater than or equal to n; otherwise, capacity() is unchanged. In either case, size() is unchanged.

evilrix

Incidentally, you don't have to pusk_back one at a time. If you read a block of entries from file you can insert the whole block in one go, which will also take care of block allocating heap for you. Something (untested) like below...

#include <vector>
 
int main ()
{
	std::vector<int> vi;
 
	int buffer[4096] = {0}; // This will be used to read from file
	int read_from_file = 0; // This will be the number of bytes read from file
 
	...
 
	vi.insert(vi.end(), buffer, (buffer + read_from_file/sizeof(int)));
	return 0;
}

Open in new window

Infinity08

>> I've put together the following with my basic C++ abilities

For that code to work correctly, the files can't contain doubles (the same value can't appear twice). Will that be the case ?

dc_cypher

ASKER

Infinity08: Yes, that will be the case.

Ok, I think I've gotten what I need from this question. Now the problem is how to accept solution(s) because there's been a lot of good advice flowing. Any suggestions?

evilrix

You would normally select which ever answers you feel helped you resolve your problem. You can accept multiple answers. Divide the points up based upon how useful you found the advice in each accepted answer.

Does that help?

dc_cypher

ASKER

Makes sense. Thanks again!