Very large arrays

I have a structure as follows:

typedef struct {
  unsigned char address[4];
  unsigned short port;
  unsigned long traffic;
} IPData;

I am going to be inserting data into this quite a bit... the total number of structures in use will be between 10,000 and 25 million.

I assume a hash function/table of sorts would be needed, since a linked list would take forever to traverse everytime I wanted to read/write an element.  I will pretty much be reading in a cache engine's data file and summarizing all of the data by (1) the source address (2) the destination address (3) the destination port (4) the source port.

I thought the best way to do this would be to read in the data, create two "separate" lists for source IP's and destination IP's... and then querying the lists by address or port, depending on which report I am generating.  The lists will contain a unique entry for every unique address/port pair that the parser encounters.

What I'm looking for is a viable suggestion for handling such a large amount of data.  I doubt I can create an array with 25 million elements, so I am trying to figure out the best alternative.  This isn't going to run on a supercomputer... probably a desktop Linux station with a few hundered MB of memory.  The method for indexing the data should, of course, be as fast as possible.

Thanks for any suggestions.
Who is Participating?
nebekerConnect With a Mentor Commented:
Your IPData structure only has 10 bytes, so 25 million records would be 238 MB of memory -- certainly doable on your average linux box with 256 or 384 MB of memory... Some swap will be used, but it won't kill the system.

What is the "traffic" item used for?

If you keep the source IPs and destination IPs in separate lists, how will you link the two?  It would seem that you need some type of record structure that captures both sides of a transaction, i.e. :

typedef struct {
    IPData source;
    IPData destination;
} IPTransaction;

As for searching this data, you have a couple of choices:  first, you could create a hash table on the address or port;  second, you could simply sort the data as you read it from the cache engine (using an insertion sort to sort it as you read it, or a simple qsort after all the data is read in).  A binary search is quite efficient, no matter how larger the data array.

As far as creating an array of 25 million elements -- as long as you allocated the memory with malloc(), instead of declaring the variable as

IPData myArray[25000000];

you should be ok :-)

As far as I understand from your question you don't need to search the data, only to go all over it. (?)
Assuming this is true, just save it to a file (binary), and read a small buffer of a data (let's say 10Mb) each time, proccess it, free it, and read the next buffer.

Good Luck !
jimstarAuthor Commented:
Thanks... over the next few days I'll be giving this a try.  I will let you know how it goes.
Free tool for managing users' photos in Office 365

Easily upload multiple users’ photos to Office 365. Manage them with an intuitive GUI and use handy built-in cropping and resizing options. Link photos with users based on Azure AD attributes. Free tool!

jimstarAuthor Commented:
Can't seem to make the large arrays work.

#include <stdio.h>

typedef struct {
     unsigned long a;
     unsigned long b;
} foo;

int main(void)
     foo* bar;
     (bar+ (50000*sizeof(foo)) )->a=1;
     return 0;

Gives a segmentation fault.  Running RedHat.  Want to set the 50,000th element... ideas?
jimstarAuthor Commented:
I also put in the free(bar); before return 0; but it gives a segfault right at the line which sets the element.

(bar + 49999)->a = 1;

Or, you could also use this format:

 bar[49999].a  = 1;
jimstarAuthor Commented:
Works great!  Thanks.
I thought I should explain just *why* my answer works, rather than just having you take my word for it.  C, as a language, handles some of the arithmetic for you when dealing with different types.  For example, assume you have the following variables:

char *a = 0;  (stored at memory address 0x1000)
int *b  = 0;  (stored at memory address 0x2000)
long *c = 0;  (stored at memory address 0x3000)

Assume for the sake of argument here that a char is 1 byte, an int is 4 bytes and a long is 8 bytes, and that the memory address above are something I just made up...

When you say "a + 1", a is a char, so the memory pointer will be advanced the SIZEOF a char, which happens to be one byte.  When you say "b + 1", the memory pointer is advanced the SIZEOF an int, or 4 bytes.

So the term "variable + 1" doesn't mean "add one to the memory location", it means "add one * sizeof(variable_type) to the memory location.

Therefore, the memory addresses would be:

 (a + 1) =  0x1001   (0x1000 + 1)
 (b + 1) =  0x2004   (0x2000 + 1 * sizeof(int))
 (c + 1) =  0x3008   (0x3000 + 1 * sizeof(long))

So in your previous code, you had:

 (bar+ (50000*sizeof(foo)) )->a=1;

the "sizeof(foo)" was redundant, since that was taken care of by the compiler.  What you had written above was the same thing aas:

 (bar+ (50000*16) )->a=1;    (which equals)

 (bar + 800000)->a=1;

which means you were trying to set element # 800,000 -- which is way out of the memory range you allocated.  That's why you were getting the access violation and segmentation fault.

Hope that clears it up a little.
Oops - small mistake :)

(bar + 800000) ->a = 1;

sets element number 800,001

Always have to remember that C arrays start at 0 ...
All Courses

From novice to tech pro — start learning today.