• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1049
  • Last Modified:

b-tree implementation

In my adventures with disk-based data structures in C, I'm looking into understanding b-trees.  

I've read a lot about how they work, but I have a question about how they're actually stored on disk.  A typical b-tree diagram is very structured looking, consisting of multiple levels of nodes and branches, originating from a single root node.  But the actual reality seems to be that a b-tree is just a linear sequence of nodes and offset pointers stored one after the other in a file.  

My question is, when a new node is created because of a split, where is the new node actually stored?  Is it just appended to the end of the file?  If so, wouldn't this cause the b-tree to lose locality of reference after a while?  Although a typical diagram of a b-tree shows sibling nodes all grouped together, in reality any two sibling nodes may be at totally different ends of the file, right?

Also, if a chain of splits happens all the way up to the root node, a new root node needs to be created.  Is this new root node also just appended at the end of the file?  
0
chsalvia
Asked:
chsalvia
  • 5
  • 3
1 Solution
 
sunnycoderCommented:
Hi chsalvia,

1. It is possible to implement trees in array to maintain locality of reference ... e.g. a simple binary tree with children of node i located at 2i+1 and 2i+2

2. Split affects at most 2 levels

Cheers!
sunnycoder
0
 
chsalviaAuthor Commented:
>> 1. It is possible to implement trees in array to maintain locality of reference ... e.g. a simple binary tree with children of node i located at 2i+1 and 2i+2

Yes, but in a standard b-tree, wouldn't it be the case that splits would often destroy locality of reference?  If for example, you have a b-tree with say 1,000 nodes, and then you do an insert.  The insert requires, say, the 12th node to be split.  So a new node has to be written to disk now.  Where do we store the new node?  We have to store it at the end of the file.  So now it is very far from it's sibling node, the 12th node.

Isn't this something that would typically happen with a b-tree?
0
 
sunnycoderCommented:
No .. you would have to re-organize some elements towards end of the array but that would be all ... Array representations do not write to end of array ... Every node has a fixed index depending on its position in the hierarchy, if the node is missing, that index is left blank.

Split in this case would mean some reorg in affected level ... moving around data in arrays is expensive, but it wont be very frequent.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
adg080898Commented:
You should use "file offset" pointers for your tree. Have a root pointer at a known location, then just follow file offsets for walking. This way, the nodes never need to be moved around in bulk.

Locality of reference is not a big deal, because the depth of a b-tree is so shallow, you are not likely to touch a lot of nodes when searching.

0
 
sunnycoderCommented:
>But the actual reality seems to be that a b-tree is just a linear sequence of nodes and offset pointers stored one after the other in a file.  

Range searches on a B-tree can indeed touch a lot of nodes. That is where locality of reference is important.
To search a single node, you need to access only one node per level. This holds true not only for B tree but any tree which supports guided search - even binary search tree. So locality of reference does not really count unless you are thinking of range searches. Btw, B+ trees are better suited for range search.
0
 
chsalviaAuthor Commented:
>> Btw, B+ trees are better suited for range search.

Yes, I've read about that as well.  My understanding is that B+ trees only store keys on the bottom level, and the higher level nodes just contain pointers.

My question is, in a B+ tree, how would you direct the program from one pointer to another, without doing a key comparison?  Since the keys are stored in the bottom nodes, how does the program get from the root to the bottom node in the first place?  What does it compare to decide which node to go to next?
0
 
sunnycoderCommented:
It has some values in the internal nodes for guiding the search. However, those values are not (neccessarily) keys.e.g.
                         25
                      /       \
                   12        30
                 /    \      /    \
                6    14   27    35

Keys are at leaf ... values in the non-leaf node are used only to guide the search. Leaves are connected using a bi-directional liked list.
0
 
chsalviaAuthor Commented:
Hmm...if the non-leaf nodes just contain small values, it would seem that the fixed-size of leaf nodes would be different than the fixed size of non-leaf nodes.  Also, it seems that the non-leaf nodes would be relatively small.  Perhaps all non-leaf nodes could be loaded into memory, so that only leaf-nodes would require disk reads.
0
 
sunnycoderCommented:
Value + couple of pointers for forming a doubly linked list. Typically smaller than non-leaf node.

>Perhaps all non-leaf nodes could be loaded into memory, so that only leaf-nodes would require disk reads.
Number of non-leaf nodes could still be large depending on how much data you have. Loading first few levels - depending of resource capability of your system - might be a better idea.
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

  • 5
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now