Solved

Data structure

Posted on 1999-01-21
21
298 Views
Last Modified: 2011-09-20
I got a lot of words (millions!), which I'm going to index to gain fast access into a large file. Which data structure could I use? The problem is that I need fast access to single words, AND I need to be able to extract sequentially (sorted alfabetically) starting with a random word.
I was thinking of a B-tree of some sort, but running through the tree to get 100 index words (in-order), would mean reading A LOT of diskblocks, since the next word will usually be in a new block. The B-tree is very good for the single search, though.

Both space used and seek time is very important. The indexwords will probably have to be compressed (I was thinking of Huffman coding), which complicates things further.

Regards,
Pede
0
Comment
Question by:pede
21 Comments
 
LVL 1

Expert Comment

by:Seeker092397
ID: 1184701
Maybe you should try a database approach (MSS or Sybase).
0
 
LVL 1

Author Comment

by:pede
ID: 1184702
Both size and speed is _that_ important, that it's not possible to use a common database. It have to be much smaller and faster. It's read-only (I didn't mention that), so it is quite possible. Actually the above seems to be the only real problem left (knock-knock).

/Pede

0
 
LVL 10

Expert Comment

by:rbr
ID: 1184703
Both size and search time can be reduced at the same time. If this would be so all database problems had been solved already. What will be more important for you. The time or the speed.
0
 
LVL 1

Author Comment

by:pede
ID: 1184704
It might not be possible to get the ultimate speed and the ultimate size, but since it's a read only base, it's possible to use data structures which would be awful for updating and inserting.
If I really have to make this choice, I would make it an option to either optimize for size or speed. This base engine will be used for different purposes, and sometimes size will be most important, sometimes speed.
Actually, the more I think about it, the more I realize that size will - in most cases - be the one to optimize for.
.or maybe a little of both ;)

Regards,
Pede

0
 
LVL 22

Accepted Solution

by:
nietod earned 50 total points
ID: 1184705
Use an n* stree as described in knuth.   It is the best tree to use for disk-based searches.
0
 
LVL 1

Author Comment

by:pede
ID: 1184706
Never heard of that one before. As mentioned, the down side of a B-tree is the traversing of the tree. Does N* solve this? Could you provide a few details?

/Pede

0
 
LVL 22

Expert Comment

by:nietod
ID: 1184707
The tree is always kept balanced, yet there is never a need to perform a balancing operation.  The tree is always more than 50% populated (Except when it starts out empty).  Search operations are faster than a disk based binary tree.  Next and previous operations are close to instantateous.

In this tree make your block size the sector size for the hard disk (or a multiple of that size). (There is no point to reading a smaller block, as the hard disk controller will read the whole sector anyways.)  Then figure out how many keys will fit into this block  (along with the "header" information used to manage the block).  Each block branches by this number of keys, i,e, n times, hence the name n* tree.  This is why it is so good for disk based storage, very few disk reads to get to a key.  You have to do more in memory,  because once a block is read, you need to do a scan for the key to branch on and that might involve scanning through 5 or 10 keys, but that is not the bottleneck, disk I/O is.
0
 
LVL 22

Expert Comment

by:nietod
ID: 1184708
Sorry for the delay (the details were coming), but EE is acting slow now.   I forgot to mention that next and previous are so fast, because you buffer the current node so you are doing next and previous in memory.  You only need to go to disk when you go next or previous past the end of the current node.  Also you can read the next or previous node directly.  You do not have to traverse the tree.  Since the tree is kept balanced without using rotates, you can maintain a "linkd list" of nodes at each level.  That is each node contains next and previous offsets to the adjacent nodes on the same level.  This alllows sequential operations to proceed very fast.  The only drawback (and it really isn't one) is the 50% population.  That means up to 50 % wasted space if your key eactly fits your block size.  I don't think that is a problem.
0
 
LVL 1

Author Comment

by:pede
ID: 1184709
Maybe Im wrong, but In my book a B-tree is a "Balanced tree", not a "Binary tree". Your description sounds almost exactly like the solution I've come up with, but I can't follow your next-previous explanation. If I make a tree (B, N or whatever) with the letters A-Z, and we assume a block can contain 5 letters (just for example purposes). If I locate the first block, the first letter of this block will be 'A', but the next letter in this block will certainly not be 'B', because then there wouldn't be any branching between 'A' and 'B'. So I would have to read the next block. This goes for every word (letter) in the tree. Does N* differentiate from this?
The reason that I wan't to do this, is that I wan't to display an index, so the user can choose a word to search for. Therefore I need all the indexed words, and I got to extract them fast.

Regarding the space issue. 50% waste would be way too much! I would rather make the nodes dynamic sized (and the words in the nodes too!). This will cause boundary crosses in the diskblocks, meaning the operating system will have to do two reads to get my block, eventhough it could be smaller than 1 block. This is definately a speed vs. size tradeoff, but I believe it's well worth it!
 
/pede

0
 
LVL 22

Expert Comment

by:nietod
ID: 1184710
First of all, if you don't have knuth's book, get it.  they (3 volumes) is the bible of algorithms.  I don't remember which volume you need, and mine are in storage, but I beleive it is called "searching and sorting"

In a n* tree there are two types of nodes.  (they are the same format but there is a boolean switch to indicate what type we are talking about.)  The bottom level of the tree has "leaf" nodes and all the levels above (if there are any) are "internal" nodes.  Each nodes begins with a "header" that records if it is a leaf or internal node.  also it has the offsets to the next and prev nodes.  Finally it records the number of keys currently stored in the node.  (That is important, because the node might not be full.)
Following this header is the key info.  This info is the key and the offset to the node associated with the key.  I just store it as an array of key&offsets.  This array must be at least 3 items long of the algorithm fails.  (Your key has to be short enough or your nodes large enough to fit 3.)  You keep these keys sorted inside each node.  

The empty index starts out with just one node, the head node (in the header to the index file you must record the offset of the head node, as it will move.)  The key count is 0.  This is a leaf node. Add keys are added, you place them in the head node and keep it sorted.  When the node is full and you need to add one (any time, not just in this case), you split the node in two.  This involves adding another node, to the immediate right of the current node.  (It is stored at the end of the file, its logical position is to the right.)  The 2nd half of the keys are moved to this node to the right and the first half of the keys remain with the original (left node).  Now there is room to add the new key to one or the other node.  (Some points, when a node splits the new node is allways the same type as the original.  This is also the time when you set the next and previous offsets.) .Now the new node has to be added to the tree.  Thus you go up one level and add its first key to the node above.  If there is room great, if not, you need to split this node and continue up.  If you continue up the tree and reach the head node and need to it and it is full, then you split the head node, and add a new head head node above these two nodes.  thus the index now has a new head.  This process keeps the tree perfectly balanced and uses a minimal number of disk accesses in both adding and searching.

I hope that helps.  If not, get knuth.
0
Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

 
LVL 22

Expert Comment

by:nietod
ID: 1184711
Another type of tree that you canuse would be a radix tree.  This would be much smaller (substancially), but it is slower.  The number of disk access is exactly the number of letters in the word to be found.   (This type of tree allows words of differnet lengths to be stored, which is one reason why it is so small. Most trees require you to use a key that is a specific length (the longest word you have, and pad shorter ones with spaces.).  I can give you details on this type of tree too, but it is likely to be slower than the n* tree.
0
 
LVL 1

Author Comment

by:pede
ID: 1184712
I still can't see why extracting a sequence of words should be faster than with a B-tree. See my previous comment.

The radix tree sounds interesting, but one disk access pr. letter is too slow. I'd rather go for a solution where I load large blocks at a time, and have less searches. Mainly because the base is likely to be located on CD-ROM.

/Pede
0
 
LVL 22

Expert Comment

by:nietod
ID: 1184713
>> I still can't see why extracting a sequence of words should be faster than with a B-tree
I'm not sure what you mean here.

>> Mainly because the base is likely to be located on CD-ROM.
If this information is static, that is, words aren't being added or deleted, that is different.
The N* tree can be filled for this case (well nearly filled as there may no be enough words to exactly fill a node for this case.  This would be a fast way to access the data, but I don't know if I think it would be the best.  n* trees code is quite complex, but definitly worth it because the tree can have endless additions and deletions and still remain balanced with very vew operations.  You don't care about that though, you just need to create the file once and stick it on disk.  You want it to search fast, but it doesn't need to be alteredable.
0
 
LVL 22

Expert Comment

by:nietod
ID: 1184714
I think for your case, I would use a slightly "modified" array to store the words.  This will give you close to 100% storage space, and binary access speeds, (slower than N* trees, but your life will be easier and you'll get smaller storage.)  Essentually, this solutiojn will be an array of sorted words, but with some optimisation for disk access.   What I would do is pick a block size, preferably the sector size on a CD.  Your file will consist of an array of these blocks.  Inside each block store as many words that will fit, in sorted order.  Use a NUL character to mark the end of the word and the start of the next word immediately follows.  The only wasted space is the NUL characters and the space at the end of the block that doesn't fit a word.  You can search this file using a binary search.  i.e. You get the middle block of the file.  If the word occurs on it, you are done.   if not look to see if  it occurs in the half before the block or the half after the block, etc...
This may be more disk access than an N* tree--making it slower. But actually since you get far better packing (even when the N* tree  is made full as it could be for your case) this will reduce that effect.  (For small files this method could be faster.  eventually there will be a crtical point at which the n* tree is faster.)
0
 
LVL 1

Author Comment

by:pede
ID: 1184715
yep :) I DID mention earlier that it was read-only! This is very important, since I can "cheat" whereever it is possible. if it works, is small and is fast here and now, it's perfect. I don't have to use any standard IR methods, since they all care about insertions, and this is where it's REALLY hard to optimize for both speed and size.
As an example, I would construct my B-tree by inserting SORTED words. This way I don't have to follow any rules for insertion. I know the number of words, and they are sorted (both a side-effect of eliminating duplicates), so I decide how many words to hold in each node, calculate the height of the tree, and insert the words just like I would traverse it to extract sequentially from it.
This all seems fine, BUT now I need to be able to extract the words in order, which is SLOW!! I need a whole new data structure. Probably one with LARGE blocksizes, and the words in sort order!

Regards,
Pede

0
 
LVL 22

Expert Comment

by:nietod
ID: 1184716
Say you have a 1 million words and average only 10 words per block  (seems small to me, but I want to give you a worst case).  You would be looking at no more than 23 disk accesses to find the word.  Not great, but not too bad.  (any binary algorithm will be the same).  (Now a n* tree with only 5 words per node (small again) can do the search in only 9 access.  And the difference between the two becomes larger as the total number of words increases and the number per node increases.  With 10 words per node a n* tree needs only 6 accesses)

It looks like the choice is between speed and size (as always) with programing convenience thrown in on the size side.
0
 
LVL 1

Author Comment

by:pede
ID: 1184717
I was thinking of something like that too, but the binary search would be too slow. I was thinking of making an index into this file. A small one, which I could keep in RAM. Maybe something like the first word in each block. This would speed things up, but a quick calcution tells me this index could be ~200K (using some real-world figures). I would like it smaller!
Furthermore I was thinking of coding the blocks with Huffman coding, but I'm worried that it would take too long to unpack it. I really don't know the performance of it.
As you can see, size DOES matter. I wan't to make the perfect solution for both size and speed, and other issues aswell, like the extracting of indexwords.

I know Im being a pain in the .... right now, bit Im obsessed by this. I've spend weeks trying to figure out the perfect way to do it (actually not only this, but a whole read-only database system), and I actually find it entertaining! This is one of the last huge problems.

/Pede

Regards,
Pede
0
 
LVL 22

Expert Comment

by:nietod
ID: 1184718
You know what, for your case there is no rule that says that for an n*  tree all nodes have to have the exact same number of keys stored.  You could use an n* algorithm that stores the varible length keys nodes.  (Again you need a NULL or a length to indicate the key length.)  That combined with the fact that you can insert in sorted order allows for close to perfect packing.  That will be fast and will be efficient.  (Now it is larger than the "array" method for several reasons, but probably no more than a factor of 2 at the most.)  I would consider that.
0
 
LVL 1

Author Comment

by:pede
ID: 1184719
You keep getting right before me, messing up the order of the messages! *doh*
Anyway, YES, a tree (any balanced) with large blocks is the fastest way to search. I was calculating with 3 disk accesses to look up a word, and almost no waste. But as mentioned, I NEED to be able to extract the indexwords fast. The tree would simply be too slow. There can be no tradeoff here.

I didn't expect THE answer to this question, rather a discussion which I got. Im giving you your points now, but if you get any great ideas, I hope you will share them with me!

/Pede
0
 
LVL 1

Author Comment

by:pede
ID: 1184720
Maybe I should mention that I know of a system, which pretty much does all this. By using a hexeditor I can see that the indexwords are close together, only seperated by 4-8 bytes (inkonsistent for some reason). Probably some pointers and stuff. This index can be packed to be very small, and is VERY fast, even when packed. Unfortunately I dont know how it's done.

/Pede
0
 
LVL 22

Expert Comment

by:nietod
ID: 1184721
I'm pretty sure an n* tree where the keys are different lengths is going to give you the best results.  The wasted space is pretty minimal (more than others still) but the number of accesses to disk will be very minimal.  When doing a search n log(x) (the nth log of x) where n is the number of branches per node (not constant, but use an average or worse case) and x is the total number of words.  As you can see this is far far better than a binary search where the accesses would be 2 log(x).  By keeping your blocks at the clustor size (and on cluster boundaries) you make accesses fast.  (You can save space by removing the unused space at the end of a block, but that would missalign blocks and would slow things down by about a factor of 2).  Another thing you can do is to organize the tree so that within the file the 1st level nodes (1 node really) is first, then the 2nd level nodes etc.  (I have no idea how to do so, the normal tree is not like that).  This will make the search potentially faster as the read head only has to move in one direction (most likely).  Next an previous are going to be very fast, they invoke at most one access and often none.  Take a look at knuth and the standard implimentation, then let me know if you have questions.  

(one thing knuth doesn't menthion is that to pack the thing, you add the words in order and when  you split a node don't put half the keys in each node.  Put all that fit in the left node (that should be exactly the ones that were in the original node. and put the addition key (should be the one being added) in the right node.  you get 100% packing except the last nodes of each level might not be full if there aren't enough keys.  Now this still consumes extra space because you have multiple levels, but not that much because the levels quickly get small as you move up the tree.).
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Article by: SunnyDark
This article's goal is to present you with an easy to use XML wrapper for C++ and also present some interesting techniques that you might use with MS C++. The reason I built this class is to ease the pain of using XML files with C++, since there is…
IntroductionThis article is the second in a three part article series on the Visual Studio 2008 Debugger.  It provides tips in setting and using breakpoints. If not familiar with this debugger, you can find a basic introduction in the EE article loc…
The goal of the tutorial is to teach the user how to use functions in C++. The video will cover how to define functions, how to call functions and how to create functions prototypes. Microsoft Visual C++ 2010 Express will be used as a text editor an…
The goal of the video will be to teach the user the concept of local variables and scope. An example of a locally defined variable will be given as well as an explanation of what scope is in C++. The local variable and concept of scope will be relat…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now