Solved

Dealing with large amounts of data

Posted on 1999-01-18
11
167 Views
Last Modified: 2010-04-02
null
0
Comment
Question by:lewin
11 Comments
 
LVL 22

Accepted Solution

by:
nietod earned 200 total points
ID: 1184430
Personally, I would not recomend placing the word list in in the program.   But if you feel you must, I would store it as a sorted list of const char pointers, like

const char *WordList[] = {"Apple","Banana".....};

Then you can use a binary search to find items in the list quickly and you can iterate through the list easily.  It uses a minimal amount of "wasted" storage space.  That is probably the best option, if you must have the data in the program.
 
0
 
LVL 22

Expert Comment

by:nietod
ID: 1184431
>>The file will probably be split up by initial letter and number of
>> letters in the word.  Bearing in mind that there will probably
>> be a maximum of about 3000 words in each of these
>> chunks of data, would it make sense to create a big switch
>> statement for each chunk (I could create it automatically so
>> that is not a problem)

I'm not sure what you are talking about here.

>> what I want to avoid is using too much memory and the program being too slow
Using 2M of static storage space for this purpose makes me cringe.  It seems terribly wasteful.  but in all honesty, in this day an age, that is not really an issue.  Most computers have 64M+ of memory so this is less than 6% of the computer's RAM.  And If things get a little tight, there is always virtual memory.  If it gets paged out to disk, it will slow things down (substantially), but it isn't likely to slow it down that much more than aa disk based algorythm would have been in the first place.

There are some dissadvatanges to placing the data in the program, that have nothing to do with speed however, You can't change the word list without recompiling.  That means fixing small typos may take time.  The end user can't change the word list.  Seperate compies of the program would have to be generated for different languauges.  The compile time may be unreasonably large.
0
 

Author Comment

by:lewin
ID: 1184432
OK, what would be the best way of implementing this if I didn't compile the data into the program?  Also bearing in mind that I want the data split up by initial letter and number of letters in the word (for example, all the 4 letter words beginning with a, 6 letter words beginning with g, etc).  I don't particularly want to be able to search the list, I just want to be able to randomly select words.  I realise that having the whole list in memory is a waste which is why I want to avoid doing it, especially as in the future I will possibly want to build in a second word list (a different language).
0
 
LVL 22

Expert Comment

by:nietod
ID: 1184433
Before we discuss how, lets discuss why.  Why do you want it organized by "letter and length"?  What is your ultimate goal?

>> I realise that having the whole list in memory is a waste which
>> is why I want to avoid doing it, especially as in the future I will
>> possibly want to build in a second word list (a different language).
In all honesty, it probably isn't a waste in terms of memory use or processor speed, but, it does make changing the list nearly impossible.


0
 
LVL 84

Expert Comment

by:ozo
ID: 1184434
If you're willing to recompile it shouldn't be hard to automatically rebuild the list.
Depending on what you want to do with the list, you might also consider a hash table or a trie
0
Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

 
LVL 1

Expert Comment

by:shrif
ID: 1184435
I don't recommend putting everything in a sorted static array.  First of all, your compiler may puke.  But the real issue is that a binary search is not the best way to scan for 200,000 entries.  The poster of the question is on the right track with the 3000 entries idea, in which case he's only scanning 3000 entries.  However, a switch/case statement will do nothing for you because you can have the cases be string literals.

To the poster: I do have to say that I don't understand the part about "not taking too much memory".  Is your concern that it is allocating to much dynamic memory, or what?  If you had that much data, whether it is in program space or data space may not make a difference since memory is memory.  The only time this is is significant is when you are going to be running multiple instances of your program, in which case program memory is shared.

In any case, going back to what you've already done with the many chunks with each containing 3000 entries.  In other words, you're using a simple hash to split the list so that instead of working with 200,000 entries, you're only solving the problem for 3000 entries.  At that point, a binary search would be an okay solution, so can use the first writer's suggestion and create an array, and make sure the strings are sorted:
const char *WordList[] = {"Apple","Banana".....};
Then then use a binary search.  Worst case is 12 iterations. 2^12 is 4096.

However, many don't know this but there's a program called gperf (free from GNU) and now there is a VC++ plug-in that uses it; this program can create a perfect hash table from a given set of keywords -- they say it works well with up to 25,000 symbols.  Now, for your 3000 word chunk, you may want to use this.  What's the advantage?  Well, since it's a perfect hash, you are able to compute an index for any given word based on some formula.  You don't actually search -- you just compute the index and use that to access the entry in the array -- very very fast.

You could get the source code for gperf from GNU and then use that to build your program.  So you'll have program A that builds program B and program B is what you give to your users.  Program A will link with gperf to build its tables.

Seems to me this the *fastest* solution.  Give it a shot and see how much memory you're actually taking up.
0
 
LVL 22

Expert Comment

by:nietod
ID: 1184436
>> But the real issue is that a binary search is not the best way to
>> scan for 200,000 entries.  The  poster of the question is on the
>> right track with the 3000 entries idea, in which case he's only
>> scanning 3000 entries
If you are talking about finding an entry (and it is not clear if that is what is needed, that is why I asked for clarification), then a binary search of 200,000 entries will involve no more than 18 tests.  If you are proposing a scheme that involves scaning 3000 entries, it is going to be many orders of magnitude slower.

>> If you had that much data, whether it is in program space or
>> data space may not make a difference since memory
>> is memory.
What about disk space?  Using an index or other disk based solution may involve a couple hundred bytes of memory.

>> In other words, you're using a simple hash to split the list so that instead of working
>>  with 200,000 entries, you're only solving the problem for 3000 entries.  At that
>> point, a binary search would be an okay solution
Why use a hash table followed by a binary search?  I can't imagine a case where that is appropriate.  One or the other yes, a hybred, no. (Perhaps for collision resolution yes, but I don't think that is what you are talking about.)

0
 
LVL 1

Expert Comment

by:shrif
ID: 1184437
I *am* and did propose a scheme that involves scanning 3000 entries.  Read on and see how it can be done without any scanning -- by using a perfect hash.  This can be done only because the programmer has all the data upfront.

Regarind your other comment, why do you say it's not a good idea.  You use a hash to split from 200K to 3K and then use a binary search for the rest.

Perhaps you're right about using a binary search for 200K.  May not be so bad.
0
 
LVL 22

Expert Comment

by:nietod
ID: 1184438
Hashing is ussually used in a case where you need to search for particular strings and those strings will change with run-time.  Its advantage is that it allows the strings to be stored quickly in a container that allows them to be found relatively quickly.  Its dissadvantage is that you can't ussualy do sequential access and that it tends to use significantly more space than other containers (to avoid collisions, which tend to be costly)  Those advantages don't seem needed for this case and the size dissadvantage seems to be significant..  The dissadvantage of a binary search through a table is that it is extremely slow to modify the table at run-time.  But there is not need to modify it at run-time.  Its advantage is that it is extremely fast and has nearly no wasted space.

>> Regarind your other comment, why do you say it's not a good idea.  
>> You use a hash to split from 200K to 3K and then use a binary
>> search for the rest.
Why do you think it is good?  what does that give you?  You know you are not talking about a single 3K block, right?  You would have to have many of them, unless I am missunderstanding you.  
0
 
LVL 1

Expert Comment

by:shrif
ID: 1184439
nietod, I'm thinking that this would be good because I can use a "perfect hash" for a 3K chunk but can't for a 200K chunk.

If it wasn't for being able to use a perfect hash, I'll agree that you can do a binary search on the entire 200K even if the data isn't in one contiguous array.

You say that hasing is usually used in a case where you need to search that will change in run-time.  I am not sure what you mean by usually, but there are many uses of hashing when the table is known at compile time, and thus a perfect hash is used.  Take a look at gperf and the readme there, as it describes applications where that is useful.  One common one is to syntax color source code.

0
 
LVL 22

Expert Comment

by:nietod
ID: 1184440
What is a PERFECT hash?  and why only on3K? and were is gperf?  (which probably answers the other two.)
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Unlike C#, C++ doesn't have native support for sealing classes (so they cannot be sub-classed). At the cost of a virtual base class pointer it is possible to implement a pseudo sealing mechanism The trick is to virtually inherit from a base class…
Article by: SunnyDark
This article's goal is to present you with an easy to use XML wrapper for C++ and also present some interesting techniques that you might use with MS C++. The reason I built this class is to ease the pain of using XML files with C++, since there is…
The goal of the tutorial is to teach the user how to use functions in C++. The video will cover how to define functions, how to call functions and how to create functions prototypes. Microsoft Visual C++ 2010 Express will be used as a text editor an…
The viewer will be introduced to the technique of using vectors in C++. The video will cover how to define a vector, store values in the vector and retrieve data from the values stored in the vector.

759 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now