Very large arrays

Posted on 2002-03-09
Last Modified: 2010-05-18
I have a structure as follows:

typedef struct {
  unsigned char address[4];
  unsigned short port;
  unsigned long traffic;
} IPData;

I am going to be inserting data into this quite a bit... the total number of structures in use will be between 10,000 and 25 million.

I assume a hash function/table of sorts would be needed, since a linked list would take forever to traverse everytime I wanted to read/write an element.  I will pretty much be reading in a cache engine's data file and summarizing all of the data by (1) the source address (2) the destination address (3) the destination port (4) the source port.

I thought the best way to do this would be to read in the data, create two "separate" lists for source IP's and destination IP's... and then querying the lists by address or port, depending on which report I am generating.  The lists will contain a unique entry for every unique address/port pair that the parser encounters.

What I'm looking for is a viable suggestion for handling such a large amount of data.  I doubt I can create an array with 25 million elements, so I am trying to figure out the best alternative.  This isn't going to run on a supercomputer... probably a desktop Linux station with a few hundered MB of memory.  The method for indexing the data should, of course, be as fast as possible.

Thanks for any suggestions.
Question by:jimstar
  • 5
  • 4

Accepted Solution

nebeker earned 100 total points
ID: 6854160
Your IPData structure only has 10 bytes, so 25 million records would be 238 MB of memory -- certainly doable on your average linux box with 256 or 384 MB of memory... Some swap will be used, but it won't kill the system.

What is the "traffic" item used for?

If you keep the source IPs and destination IPs in separate lists, how will you link the two?  It would seem that you need some type of record structure that captures both sides of a transaction, i.e. :

typedef struct {
    IPData source;
    IPData destination;
} IPTransaction;

As for searching this data, you have a couple of choices:  first, you could create a hash table on the address or port;  second, you could simply sort the data as you read it from the cache engine (using an insertion sort to sort it as you read it, or a simple qsort after all the data is read in).  A binary search is quite efficient, no matter how larger the data array.

As far as creating an array of 25 million elements -- as long as you allocated the memory with malloc(), instead of declaring the variable as

IPData myArray[25000000];

you should be ok :-)


Expert Comment

ID: 6856874
As far as I understand from your question you don't need to search the data, only to go all over it. (?)
Assuming this is true, just save it to a file (binary), and read a small buffer of a data (let's say 10Mb) each time, proccess it, free it, and read the next buffer.

Good Luck !

Author Comment

ID: 6857223
Thanks... over the next few days I'll be giving this a try.  I will let you know how it goes.
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.


Author Comment

ID: 6889830
Can't seem to make the large arrays work.

#include <stdio.h>

typedef struct {
     unsigned long a;
     unsigned long b;
} foo;

int main(void)
     foo* bar;
     (bar+ (50000*sizeof(foo)) )->a=1;
     return 0;

Gives a segmentation fault.  Running RedHat.  Want to set the 50,000th element... ideas?

Author Comment

ID: 6889838
I also put in the free(bar); before return 0; but it gives a segfault right at the line which sets the element.

Expert Comment

ID: 6889942

(bar + 49999)->a = 1;

Expert Comment

ID: 6889951

Or, you could also use this format:

 bar[49999].a  = 1;

Author Comment

ID: 6890438
Works great!  Thanks.

Expert Comment

ID: 6890494
I thought I should explain just *why* my answer works, rather than just having you take my word for it.  C, as a language, handles some of the arithmetic for you when dealing with different types.  For example, assume you have the following variables:

char *a = 0;  (stored at memory address 0x1000)
int *b  = 0;  (stored at memory address 0x2000)
long *c = 0;  (stored at memory address 0x3000)

Assume for the sake of argument here that a char is 1 byte, an int is 4 bytes and a long is 8 bytes, and that the memory address above are something I just made up...

When you say "a + 1", a is a char, so the memory pointer will be advanced the SIZEOF a char, which happens to be one byte.  When you say "b + 1", the memory pointer is advanced the SIZEOF an int, or 4 bytes.

So the term "variable + 1" doesn't mean "add one to the memory location", it means "add one * sizeof(variable_type) to the memory location.

Therefore, the memory addresses would be:

 (a + 1) =  0x1001   (0x1000 + 1)
 (b + 1) =  0x2004   (0x2000 + 1 * sizeof(int))
 (c + 1) =  0x3008   (0x3000 + 1 * sizeof(long))

So in your previous code, you had:

 (bar+ (50000*sizeof(foo)) )->a=1;

the "sizeof(foo)" was redundant, since that was taken care of by the compiler.  What you had written above was the same thing aas:

 (bar+ (50000*16) )->a=1;    (which equals)

 (bar + 800000)->a=1;

which means you were trying to set element # 800,000 -- which is way out of the memory range you allocated.  That's why you were getting the access violation and segmentation fault.

Hope that clears it up a little.

Expert Comment

ID: 6890495
Oops - small mistake :)

(bar + 800000) ->a = 1;

sets element number 800,001

Always have to remember that C arrays start at 0 ...

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Handling string inputs in C/Linux 23 187
An API detour question 7 93
IIS Log files on Exchange 2013 server 6 168
Loading flat file data in tables 2 40
An Outlet in Cocoa is a persistent reference to a GUI control; it connects a property (a variable) to a control.  For example, it is common to create an Outlet for the text field GUI control and change the text that appears in this field via that Ou…
Preface I don't like visual development tools that are supposed to write a program for me. Even if it is Xcode and I can use Interface Builder. Yes, it is a perfect tool and has helped me a lot, mainly, in the beginning, when my programs were small…
The goal of this video is to provide viewers with basic examples to understand opening and writing to files in the C programming language.
The goal of this video is to provide viewers with basic examples to understand how to use strings and some functions related to them in the C programming language.

860 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question