Very large arrays

Posted on 2002-03-09
Last Modified: 2010-05-18
I have a structure as follows:

typedef struct {
  unsigned char address[4];
  unsigned short port;
  unsigned long traffic;
} IPData;

I am going to be inserting data into this quite a bit... the total number of structures in use will be between 10,000 and 25 million.

I assume a hash function/table of sorts would be needed, since a linked list would take forever to traverse everytime I wanted to read/write an element.  I will pretty much be reading in a cache engine's data file and summarizing all of the data by (1) the source address (2) the destination address (3) the destination port (4) the source port.

I thought the best way to do this would be to read in the data, create two "separate" lists for source IP's and destination IP's... and then querying the lists by address or port, depending on which report I am generating.  The lists will contain a unique entry for every unique address/port pair that the parser encounters.

What I'm looking for is a viable suggestion for handling such a large amount of data.  I doubt I can create an array with 25 million elements, so I am trying to figure out the best alternative.  This isn't going to run on a supercomputer... probably a desktop Linux station with a few hundered MB of memory.  The method for indexing the data should, of course, be as fast as possible.

Thanks for any suggestions.
Question by:jimstar
  • 5
  • 4

Accepted Solution

nebeker earned 100 total points
ID: 6854160
Your IPData structure only has 10 bytes, so 25 million records would be 238 MB of memory -- certainly doable on your average linux box with 256 or 384 MB of memory... Some swap will be used, but it won't kill the system.

What is the "traffic" item used for?

If you keep the source IPs and destination IPs in separate lists, how will you link the two?  It would seem that you need some type of record structure that captures both sides of a transaction, i.e. :

typedef struct {
    IPData source;
    IPData destination;
} IPTransaction;

As for searching this data, you have a couple of choices:  first, you could create a hash table on the address or port;  second, you could simply sort the data as you read it from the cache engine (using an insertion sort to sort it as you read it, or a simple qsort after all the data is read in).  A binary search is quite efficient, no matter how larger the data array.

As far as creating an array of 25 million elements -- as long as you allocated the memory with malloc(), instead of declaring the variable as

IPData myArray[25000000];

you should be ok :-)


Expert Comment

ID: 6856874
As far as I understand from your question you don't need to search the data, only to go all over it. (?)
Assuming this is true, just save it to a file (binary), and read a small buffer of a data (let's say 10Mb) each time, proccess it, free it, and read the next buffer.

Good Luck !

Author Comment

ID: 6857223
Thanks... over the next few days I'll be giving this a try.  I will let you know how it goes.
NAS Cloud Backup Strategies

This article explains backup scenarios when using network storage. We review the so-called “3-2-1 strategy” and summarize the methods you can use to send NAS data to the cloud


Author Comment

ID: 6889830
Can't seem to make the large arrays work.

#include <stdio.h>

typedef struct {
     unsigned long a;
     unsigned long b;
} foo;

int main(void)
     foo* bar;
     (bar+ (50000*sizeof(foo)) )->a=1;
     return 0;

Gives a segmentation fault.  Running RedHat.  Want to set the 50,000th element... ideas?

Author Comment

ID: 6889838
I also put in the free(bar); before return 0; but it gives a segfault right at the line which sets the element.

Expert Comment

ID: 6889942

(bar + 49999)->a = 1;

Expert Comment

ID: 6889951

Or, you could also use this format:

 bar[49999].a  = 1;

Author Comment

ID: 6890438
Works great!  Thanks.

Expert Comment

ID: 6890494
I thought I should explain just *why* my answer works, rather than just having you take my word for it.  C, as a language, handles some of the arithmetic for you when dealing with different types.  For example, assume you have the following variables:

char *a = 0;  (stored at memory address 0x1000)
int *b  = 0;  (stored at memory address 0x2000)
long *c = 0;  (stored at memory address 0x3000)

Assume for the sake of argument here that a char is 1 byte, an int is 4 bytes and a long is 8 bytes, and that the memory address above are something I just made up...

When you say "a + 1", a is a char, so the memory pointer will be advanced the SIZEOF a char, which happens to be one byte.  When you say "b + 1", the memory pointer is advanced the SIZEOF an int, or 4 bytes.

So the term "variable + 1" doesn't mean "add one to the memory location", it means "add one * sizeof(variable_type) to the memory location.

Therefore, the memory addresses would be:

 (a + 1) =  0x1001   (0x1000 + 1)
 (b + 1) =  0x2004   (0x2000 + 1 * sizeof(int))
 (c + 1) =  0x3008   (0x3000 + 1 * sizeof(long))

So in your previous code, you had:

 (bar+ (50000*sizeof(foo)) )->a=1;

the "sizeof(foo)" was redundant, since that was taken care of by the compiler.  What you had written above was the same thing aas:

 (bar+ (50000*16) )->a=1;    (which equals)

 (bar + 800000)->a=1;

which means you were trying to set element # 800,000 -- which is way out of the memory range you allocated.  That's why you were getting the access violation and segmentation fault.

Hope that clears it up a little.

Expert Comment

ID: 6890495
Oops - small mistake :)

(bar + 800000) ->a = 1;

sets element number 800,001

Always have to remember that C arrays start at 0 ...

Featured Post

The Eight Noble Truths of Backup and Recovery

How can IT departments tackle the challenges of a Big Data world? This white paper provides a roadmap to success and helps companies ensure that all their data is safe and secure, no matter if it resides on-premise with physical or virtual machines or in the cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
stack 22 166
How to organize data in excel ? 2 114
Concatenate two strings Last and First Name 10 60
IIS Log files on Exchange 2013 server 6 119
Have you thought about creating an iPhone application (app), but didn't even know where to get started? Here's how: ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Important pre-programming comments: I’ve never tri…
This tutorial is posted by Aaron Wojnowski, administrator at  To view more iPhone tutorials, visit This is a very simple tutorial on finding the user's current location easily. In this tutorial, you will learn ho…
The goal of this video is to provide viewers with basic examples to understand and use pointers in the C programming language.
The goal of this video is to provide viewers with basic examples to understand opening and writing to files in the C programming language.

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question