[Webinar] Streamline your web hosting managementRegister Today

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 833
  • Last Modified:

How can I get the intersection of two very large lists of numbers in C?

I have the following perl script to determine the intersection of two very large lists of numbers.  How can I do the same in C?  I'm hoping that by writing it in a compiled language rather than an interpreted one, I can get a performance boost (please let me know if that's not the case).
# Get first large data file
$data_file = $ARGV[0];
open(DAT, $data_file) || die("Could not open file [$ARGV[0]]!");
@raw_data = <DAT>;
close(DAT);
@a = split(',', join(//, @raw_data));
 
# Get second large data file
$data_file = $ARGV[1];
open(DAT, $data_file) || die("Could not open file [$ARGV[1]]!");
@raw_data = <DAT>;
close(DAT);
@b = split(',', join(//, @raw_data));
 
# Find intersection
%isect = ();
map { $isect{$_} = 1 } @a;
@isect_list = grep { $isect{$_} } @b;
 
# Make unique
undef %saw;
@unique = grep(!$saw{$_}++, @isect_list);
 
# Print results
$count = @unique;
print $count;

Open in new window

0
dc_cypher
Asked:
dc_cypher
  • 6
  • 5
  • 3
  • +2
3 Solutions
 
ozoCommented:
You can implement a hash in c, and then use the perl algorithm,
or you could sort the lists and scan sequentially
0
 
evilrixSenior Software Engineer (Avast)Commented:
If you were to do this in C++ it would probably be easier. You could probably use std::set_intersection  

http://www.cplusplus.com/reference/algorithm/set_intersection.html

Does that help you?
0
 
Infinity08Commented:
Basically, sort the two lists, and then traverse both, retaining only the values that appear in both lists.

Does "very large" mean "too large for memory" ? Can you give an idea of the size ?
0
The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

 
dc_cypherAuthor Commented:
They will be lists of several hundreds of thousands of numbers (possibly millions) stored in two separate data files.  Unfortunately, my C/C++ skills are basic at best.
0
 
Adam314Commented:
If you can sort the files first, then you will be able to read the files one line at a time.
Are you on a unix OS?  If so, sort the files using the sort command.
Then your C program will read 1 line from each file.  If the numbers are equal, the number is in the intersection - and you can output it (either standard output, or to a file).  If the numbers are not equal, read the next line from the file with the lower number.  Repeat this until you have processed both files completly.
0
 
evilrixSenior Software Engineer (Avast)Commented:
>> Unfortunately, my C/C++ skills are basic at best.
There was a very good example in the link I posted above Did you look at it?
// set_intersection example
#include <iostream>
#include <algorithm>
#include <vector>
using namespace std;
 
int main () {
  int first[] = {5,10,15,20,25};
  int second[] = {50,40,30,20,10};
  vector<int> v(10);                           // 0  0  0  0  0  0  0  0  0  0
  vector<int>::iterator it;
 
  sort (first,first+5);     //  5 10 15 20 25
  sort (second,second+5);   // 10 20 30 40 50
 
  it=set_intersection (first, first+5, second, second+5, v.begin());
                                               // 10 20 0  0  0  0  0  0  0  0
 
  cout << "intersection has " << int(it - v.begin()) << " elements.\n";
 
  return 0;
}

Open in new window

0
 
Infinity08Commented:
>> several hundreds of thousands of numbers (possibly millions)

Let's take the worst case then, and assume a 100 million in each file. Each file would correspond to about half a gigabyte in memory, so that's a total of 1 gigabyte. That's a bit much, and you probably don't want to put all of the data in memory at once.

So, the approach would be to :

a) sort both files (in place)
b) read the files line by line, and compare. Write the "doubles" to an output file

Step b) is trivial ... Step a) is the more complicated. Obviously if the files are already sorted (which is the preferred scenario), you don't have to bother about that step. Is that the case ?

Anyway, assuming that it's not the case, and we need to sort the files, you'll need to make use of an external sorting algorithm :

        http://en.wikipedia.org/wiki/External_sort
0
 
ozoCommented:
computers with a gigabyte of memory are common, so it may not be too bad to sort in memory
Hashing may be more efficient on average than sorting, but perl hashes can easily waste enough memory so that your numbers no longer fit in ram and have to thrash through swap space, so a C qsort would probably be be better.

many system that have perl and C also have an optimized external sort utility, so you might use that if it's too big to fit in your computers memory
0
 
dc_cypherAuthor Commented:
Sorting them in advance will not be a problem, but even so, every millisecond counts.  I've put together the following with my basic C++ abilities, but is it really the fastest solution?
#include <iostream>
#include <fstream>
using namespace std;
 
int main(void) {
  fstream InFile1, InFile2;
  int Num1, Num2;
 
  // Open file 1
  InFile1.open("./bar", ios::in);
  if (InFile1.fail()) {
    cerr << "Could not open file " << "./bar" << endl;
    exit(1);
  }
 
  // Open file 2
  InFile2.open("./baz", ios::in);
  if (InFile2.fail()) {
    cerr << "Could not open file " << "./baz" << endl;
    exit(1);
  }
 
  InFile1 >> Num1;
  InFile2 >> Num2;
 
  // Compare and print if intersection is found
  while (!InFile1.fail() && !InFile2.fail()) {
    if (Num1 == Num2) {
      cout << Num1 << endl;
      InFile1 >> Num1;
      InFile2 >> Num2;
    } else if (Num1 < Num2) {
      InFile1 >> Num1;
    } else {
      InFile2 >> Num2;
    }
  }
 
  InFile1.close();
  InFile2.close();
 
  return 0;
}

Open in new window

0
 
dc_cypherAuthor Commented:
evilrix: I looked at the code you provided, but won't adding each number to a vector take up unnecessary time?  I guess the question is, will creating the vectors offset the performance gain (if any) from using set_intersection()?
0
 
ozoCommented:

  // Compare and print if intersection is found
  while (!InFile1.fail() && !InFile2.fail()) {
    if (Num1 == Num2) {
this is probably limited by how fast you can read and write the files
0
 
evilrixSenior Software Engineer (Avast)Commented:
>> I guess the question is, will creating the vectors offset the performance gain (if any) from using set_intersection()?
Well you have to put them somewhere :)
The cost in using an STL container is the heap allocation. Appending to a vector if the memory is already allocated is cheap. If you don't know up front how many entries they'll be then allocate memory in blocks or, for example, 4Kb, as required using the resize() method. If you do know up front then allocate all the memory you need in one code.
0
 
evilrixSenior Software Engineer (Avast)Commented:
>> using the resize() method.

Sorry, typo (it's early here!), I meant reserve() method...

void reserve(size_type n)  If n is less than or equal to capacity(), this call has no effect. Otherwise, it is a request for allocation of additional memory. If the request is successful, then capacity() is greater than or equal to n; otherwise, capacity() is unchanged. In either case, size() is unchanged.
0
 
evilrixSenior Software Engineer (Avast)Commented:
Incidentally, you don't have to pusk_back one at a time. If you read a block of entries from file you can insert the whole block in one go, which will also take care of block allocating heap for you. Something (untested) like below...
#include <vector>
 
int main ()
{
	std::vector<int> vi;
 
	int buffer[4096] = {0}; // This will be used to read from file
	int read_from_file = 0; // This will be the number of bytes read from file
 
	...
 
	vi.insert(vi.end(), buffer, (buffer + read_from_file/sizeof(int)));
	return 0;
}

Open in new window

0
 
Infinity08Commented:
>> I've put together the following with my basic C++ abilities

For that code to work correctly, the files can't contain doubles (the same value can't appear twice). Will that be the case ?
0
 
dc_cypherAuthor Commented:
Infinity08: Yes, that will be the case.

Ok, I think I've gotten what I need from this question.  Now the problem is how to accept solution(s) because there's been a lot of good advice flowing.  Any suggestions?
0
 
evilrixSenior Software Engineer (Avast)Commented:
You would normally select which ever answers you feel helped you resolve your problem. You can accept multiple answers. Divide the points up based upon how useful you found the advice in each accepted answer.

Does that help?
0
 
dc_cypherAuthor Commented:
Makes sense.  Thanks again!
0

Featured Post

Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

  • 6
  • 5
  • 3
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now