• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 176
  • Last Modified:

Dealing with large amounts of data

null
0
lewin
Asked:
lewin
1 Solution
 
nietodCommented:
Personally, I would not recomend placing the word list in in the program.   But if you feel you must, I would store it as a sorted list of const char pointers, like

const char *WordList[] = {"Apple","Banana".....};

Then you can use a binary search to find items in the list quickly and you can iterate through the list easily.  It uses a minimal amount of "wasted" storage space.  That is probably the best option, if you must have the data in the program.
 
0
 
nietodCommented:
>>The file will probably be split up by initial letter and number of
>> letters in the word.  Bearing in mind that there will probably
>> be a maximum of about 3000 words in each of these
>> chunks of data, would it make sense to create a big switch
>> statement for each chunk (I could create it automatically so
>> that is not a problem)

I'm not sure what you are talking about here.

>> what I want to avoid is using too much memory and the program being too slow
Using 2M of static storage space for this purpose makes me cringe.  It seems terribly wasteful.  but in all honesty, in this day an age, that is not really an issue.  Most computers have 64M+ of memory so this is less than 6% of the computer's RAM.  And If things get a little tight, there is always virtual memory.  If it gets paged out to disk, it will slow things down (substantially), but it isn't likely to slow it down that much more than aa disk based algorythm would have been in the first place.

There are some dissadvatanges to placing the data in the program, that have nothing to do with speed however, You can't change the word list without recompiling.  That means fixing small typos may take time.  The end user can't change the word list.  Seperate compies of the program would have to be generated for different languauges.  The compile time may be unreasonably large.
0
 
lewinAuthor Commented:
OK, what would be the best way of implementing this if I didn't compile the data into the program?  Also bearing in mind that I want the data split up by initial letter and number of letters in the word (for example, all the 4 letter words beginning with a, 6 letter words beginning with g, etc).  I don't particularly want to be able to search the list, I just want to be able to randomly select words.  I realise that having the whole list in memory is a waste which is why I want to avoid doing it, especially as in the future I will possibly want to build in a second word list (a different language).
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
nietodCommented:
Before we discuss how, lets discuss why.  Why do you want it organized by "letter and length"?  What is your ultimate goal?

>> I realise that having the whole list in memory is a waste which
>> is why I want to avoid doing it, especially as in the future I will
>> possibly want to build in a second word list (a different language).
In all honesty, it probably isn't a waste in terms of memory use or processor speed, but, it does make changing the list nearly impossible.


0
 
ozoCommented:
If you're willing to recompile it shouldn't be hard to automatically rebuild the list.
Depending on what you want to do with the list, you might also consider a hash table or a trie
0
 
shrifCommented:
I don't recommend putting everything in a sorted static array.  First of all, your compiler may puke.  But the real issue is that a binary search is not the best way to scan for 200,000 entries.  The poster of the question is on the right track with the 3000 entries idea, in which case he's only scanning 3000 entries.  However, a switch/case statement will do nothing for you because you can have the cases be string literals.

To the poster: I do have to say that I don't understand the part about "not taking too much memory".  Is your concern that it is allocating to much dynamic memory, or what?  If you had that much data, whether it is in program space or data space may not make a difference since memory is memory.  The only time this is is significant is when you are going to be running multiple instances of your program, in which case program memory is shared.

In any case, going back to what you've already done with the many chunks with each containing 3000 entries.  In other words, you're using a simple hash to split the list so that instead of working with 200,000 entries, you're only solving the problem for 3000 entries.  At that point, a binary search would be an okay solution, so can use the first writer's suggestion and create an array, and make sure the strings are sorted:
const char *WordList[] = {"Apple","Banana".....};
Then then use a binary search.  Worst case is 12 iterations. 2^12 is 4096.

However, many don't know this but there's a program called gperf (free from GNU) and now there is a VC++ plug-in that uses it; this program can create a perfect hash table from a given set of keywords -- they say it works well with up to 25,000 symbols.  Now, for your 3000 word chunk, you may want to use this.  What's the advantage?  Well, since it's a perfect hash, you are able to compute an index for any given word based on some formula.  You don't actually search -- you just compute the index and use that to access the entry in the array -- very very fast.

You could get the source code for gperf from GNU and then use that to build your program.  So you'll have program A that builds program B and program B is what you give to your users.  Program A will link with gperf to build its tables.

Seems to me this the *fastest* solution.  Give it a shot and see how much memory you're actually taking up.
0
 
nietodCommented:
>> But the real issue is that a binary search is not the best way to
>> scan for 200,000 entries.  The  poster of the question is on the
>> right track with the 3000 entries idea, in which case he's only
>> scanning 3000 entries
If you are talking about finding an entry (and it is not clear if that is what is needed, that is why I asked for clarification), then a binary search of 200,000 entries will involve no more than 18 tests.  If you are proposing a scheme that involves scaning 3000 entries, it is going to be many orders of magnitude slower.

>> If you had that much data, whether it is in program space or
>> data space may not make a difference since memory
>> is memory.
What about disk space?  Using an index or other disk based solution may involve a couple hundred bytes of memory.

>> In other words, you're using a simple hash to split the list so that instead of working
>>  with 200,000 entries, you're only solving the problem for 3000 entries.  At that
>> point, a binary search would be an okay solution
Why use a hash table followed by a binary search?  I can't imagine a case where that is appropriate.  One or the other yes, a hybred, no. (Perhaps for collision resolution yes, but I don't think that is what you are talking about.)

0
 
shrifCommented:
I *am* and did propose a scheme that involves scanning 3000 entries.  Read on and see how it can be done without any scanning -- by using a perfect hash.  This can be done only because the programmer has all the data upfront.

Regarind your other comment, why do you say it's not a good idea.  You use a hash to split from 200K to 3K and then use a binary search for the rest.

Perhaps you're right about using a binary search for 200K.  May not be so bad.
0
 
nietodCommented:
Hashing is ussually used in a case where you need to search for particular strings and those strings will change with run-time.  Its advantage is that it allows the strings to be stored quickly in a container that allows them to be found relatively quickly.  Its dissadvantage is that you can't ussualy do sequential access and that it tends to use significantly more space than other containers (to avoid collisions, which tend to be costly)  Those advantages don't seem needed for this case and the size dissadvantage seems to be significant..  The dissadvantage of a binary search through a table is that it is extremely slow to modify the table at run-time.  But there is not need to modify it at run-time.  Its advantage is that it is extremely fast and has nearly no wasted space.

>> Regarind your other comment, why do you say it's not a good idea.  
>> You use a hash to split from 200K to 3K and then use a binary
>> search for the rest.
Why do you think it is good?  what does that give you?  You know you are not talking about a single 3K block, right?  You would have to have many of them, unless I am missunderstanding you.  
0
 
shrifCommented:
nietod, I'm thinking that this would be good because I can use a "perfect hash" for a 3K chunk but can't for a 200K chunk.

If it wasn't for being able to use a perfect hash, I'll agree that you can do a binary search on the entire 200K even if the data isn't in one contiguous array.

You say that hasing is usually used in a case where you need to search that will change in run-time.  I am not sure what you mean by usually, but there are many uses of hashing when the table is known at compile time, and thus a perfect hash is used.  Take a look at gperf and the readme there, as it describes applications where that is useful.  One common one is to syntax color source code.

0
 
nietodCommented:
What is a PERFECT hash?  and why only on3K? and were is gperf?  (which probably answers the other two.)
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now