Solved

Very large arrays

Posted on 2002-03-09
10
178 Views
Last Modified: 2010-05-18
I have a structure as follows:

typedef struct {
  unsigned char address[4];
  unsigned short port;
  unsigned long traffic;
} IPData;

I am going to be inserting data into this quite a bit... the total number of structures in use will be between 10,000 and 25 million.

I assume a hash function/table of sorts would be needed, since a linked list would take forever to traverse everytime I wanted to read/write an element.  I will pretty much be reading in a cache engine's data file and summarizing all of the data by (1) the source address (2) the destination address (3) the destination port (4) the source port.

I thought the best way to do this would be to read in the data, create two "separate" lists for source IP's and destination IP's... and then querying the lists by address or port, depending on which report I am generating.  The lists will contain a unique entry for every unique address/port pair that the parser encounters.

What I'm looking for is a viable suggestion for handling such a large amount of data.  I doubt I can create an array with 25 million elements, so I am trying to figure out the best alternative.  This isn't going to run on a supercomputer... probably a desktop Linux station with a few hundered MB of memory.  The method for indexing the data should, of course, be as fast as possible.

Thanks for any suggestions.
0
Comment
Question by:jimstar
  • 5
  • 4
10 Comments
 
LVL 5

Accepted Solution

by:
nebeker earned 100 total points
ID: 6854160
Your IPData structure only has 10 bytes, so 25 million records would be 238 MB of memory -- certainly doable on your average linux box with 256 or 384 MB of memory... Some swap will be used, but it won't kill the system.

What is the "traffic" item used for?

If you keep the source IPs and destination IPs in separate lists, how will you link the two?  It would seem that you need some type of record structure that captures both sides of a transaction, i.e. :

typedef struct {
    IPData source;
    IPData destination;
} IPTransaction;

As for searching this data, you have a couple of choices:  first, you could create a hash table on the address or port;  second, you could simply sort the data as you read it from the cache engine (using an insertion sort to sort it as you read it, or a simple qsort after all the data is read in).  A binary search is quite efficient, no matter how larger the data array.

As far as creating an array of 25 million elements -- as long as you allocated the memory with malloc(), instead of declaring the variable as

IPData myArray[25000000];

you should be ok :-)

0
 
LVL 3

Expert Comment

by:ygal02
ID: 6856874
As far as I understand from your question you don't need to search the data, only to go all over it. (?)
Assuming this is true, just save it to a file (binary), and read a small buffer of a data (let's say 10Mb) each time, proccess it, free it, and read the next buffer.

Good Luck !
0
 
LVL 4

Author Comment

by:jimstar
ID: 6857223
Thanks... over the next few days I'll be giving this a try.  I will let you know how it goes.
0
 
LVL 4

Author Comment

by:jimstar
ID: 6889830
Can't seem to make the large arrays work.

#include <stdio.h>

typedef struct {
     unsigned long a;
     unsigned long b;
} foo;

int main(void)
{
     foo* bar;
     bar=(foo*)malloc(65000*sizeof(foo));
     (bar+ (50000*sizeof(foo)) )->a=1;
     return 0;
}

Gives a segmentation fault.  Running RedHat.  Want to set the 50,000th element... ideas?
0
 
LVL 4

Author Comment

by:jimstar
ID: 6889838
I also put in the free(bar); before return 0; but it gives a segfault right at the line which sets the element.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 5

Expert Comment

by:nebeker
ID: 6889942

(bar + 49999)->a = 1;
0
 
LVL 5

Expert Comment

by:nebeker
ID: 6889951

Or, you could also use this format:

 bar[49999].a  = 1;
0
 
LVL 4

Author Comment

by:jimstar
ID: 6890438
Works great!  Thanks.
0
 
LVL 5

Expert Comment

by:nebeker
ID: 6890494
I thought I should explain just *why* my answer works, rather than just having you take my word for it.  C, as a language, handles some of the arithmetic for you when dealing with different types.  For example, assume you have the following variables:

char *a = 0;  (stored at memory address 0x1000)
int *b  = 0;  (stored at memory address 0x2000)
long *c = 0;  (stored at memory address 0x3000)

Assume for the sake of argument here that a char is 1 byte, an int is 4 bytes and a long is 8 bytes, and that the memory address above are something I just made up...

When you say "a + 1", a is a char, so the memory pointer will be advanced the SIZEOF a char, which happens to be one byte.  When you say "b + 1", the memory pointer is advanced the SIZEOF an int, or 4 bytes.

So the term "variable + 1" doesn't mean "add one to the memory location", it means "add one * sizeof(variable_type) to the memory location.

Therefore, the memory addresses would be:

 (a + 1) =  0x1001   (0x1000 + 1)
 (b + 1) =  0x2004   (0x2000 + 1 * sizeof(int))
 (c + 1) =  0x3008   (0x3000 + 1 * sizeof(long))


So in your previous code, you had:

 (bar+ (50000*sizeof(foo)) )->a=1;

the "sizeof(foo)" was redundant, since that was taken care of by the compiler.  What you had written above was the same thing aas:

 (bar+ (50000*16) )->a=1;    (which equals)

 (bar + 800000)->a=1;

which means you were trying to set element # 800,000 -- which is way out of the memory range you allocated.  That's why you were getting the access violation and segmentation fault.

Hope that clears it up a little.
0
 
LVL 5

Expert Comment

by:nebeker
ID: 6890495
Oops - small mistake :)

(bar + 800000) ->a = 1;

sets element number 800,001

Always have to remember that C arrays start at 0 ...
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
reading tzdatabase for timezone definitions 5 123
Way to improve it 16 59
Coverting 24 hour time to 12 hour in C++ 15 162
C hashtable library 3 71
An Outlet in Cocoa is a persistent reference to a GUI control; it connects a property (a variable) to a control.  For example, it is common to create an Outlet for the text field GUI control and change the text that appears in this field via that Ou…
Preface I don't like visual development tools that are supposed to write a program for me. Even if it is Xcode and I can use Interface Builder. Yes, it is a perfect tool and has helped me a lot, mainly, in the beginning, when my programs were small…
The goal of this video is to provide viewers with basic examples to understand opening and writing to files in the C programming language.
Video by: Grant
The goal of this video is to provide viewers with basic examples to understand and use while-loops in the C programming language.

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now