Managing Trees and Hierachies in Access

Introduction

In a relational database, most relations are between two diffent types of objects. A Product is assigned to a Category (a many-to-one relationship), Suppliers are found for Products (a many-to-many relationship), Products are sold to Customers (an indirect many-to-many relationship, via Orders and Details). In all cases, the left and right side are different tables.

Some relationships are so called self-joins, where both sides are in fact the same table. For example, a field "reports to" in the Employees table selects another Employee as supervisor. The supervisor can in turn have his or her own supervisor, creating a hierarchy (and potentially cycles).

Access, or rather the "Jet Engine", has little to offer to manage hierarchies. They are treated just like any other one-to-many relationships, and no special SQL commands are available to navigate hierarchies.

This article will explore what can be done with such data, and show that specialised functions might be needed for large or complex trees. When the tables are large, these functions should be of the "fast table lookup" variety, making use of the index(es) defined on the table(s), or using specially constructed indexes.

Many techniques are easier to understand using an actual database, so a demo file is attached. The article should be readable without it; it was at least written assuming the database wasn't opened in parallel. If it is, small indications in brackets and italics like [qryEmployees] below, designate a specific database object. Anyway, you can safely start reading without it.

A Basic Hierarchy

The table most often used to introduce self-joins in Access is the old Employees table from the "Northwind Traders" demo database. The data, seen in a query that reconstructs the full names [qryEmployees], looks like this:

  PID    Full Name          Reports To
                        ---    -----------------  ----------
                         1     Davolio, Nancy         2
                         2     Fuller, Andrew
                         3     Leverling, Janet       2
                         4     Peacock, Margaret      2
                         5     Buchanan, Steven       2
                         6     Suyama, Michael        5
                         7     King, Robert           5
                         8     Callahan, Laura        2
                         9     Dodsworth, Anne        5

Open in new window

Everybody reports either to Andrew (the boss) or to Steven (a manager). The table validation rules could prohibit an employee to be his or her own manager by forcing the field "reports to" to be different from "PID". But there is no simple way to prevent Nancy from becoming Andrew's supervisor, in turn being already Nancy's supervisor. This would be a cycle, a condition that can cause problems later.

The field "reports to" will often be displayed as a combo box [tblEmployees], translating the number to a full name. A combo box is the most obvious way to display a many-to-one relationship in the interface. Conversely, the Employees form could have a subform showing subordinate employees for a manager, the obvious way to display a one-to-many relationship. When looking at Steven's record, one would see three names in the subform, those for whom his PID is selected in the "reports to" field.

What about queries? If we want the name of the employee and the name of his or her supervisor, we need to add the same table twice. The query below [qryHierarchy] uses the same query four times, using different aliases to distinguish between them:

SELECT E.*, S.FullName, M.FullName, B.FullName
FROM ((
qryEmployees AS E LEFT JOIN
qryEmployees AS S ON E.ReportsTo = S.PID) LEFT JOIN
qryEmployees AS M ON S.ReportsTo = M.PID) LEFT JOIN
qryEmployees AS B ON M.ReportsTo = B.PID;

The letters 'E', 'S', 'M', and 'B' designate four levels of the hiearchy, suggested by "Employee", "Supervisor", "Manager", and "Boss". This is quite arbitrary, and should in fact be replaced by some sort of numbering ("L1", "L2", etc.) because the designation "Boss" doesn't really mean anything. Andrew can be found in any column, depending on how many levels he is "above" the current employee.

  PID    E.FullName         Reports To  S.FullName         M.FullName         B.FullName
                        ---    -----------------  ----------  -----------------  -----------------  -----------------
                         1     Davolio, Nancy         2       Fuller, Andrew
                         2     Fuller, Andrew
                         3     Leverling, Janet       2       Fuller, Andrew
                         4     Peacock, Margaret      2       Fuller, Andrew
                         5     Buchanan, Steven       2       Fuller, Andrew
                         6     Suyama, Michael        5       Buchanan, Steven   Fuller, Andrew
                         7     King, Robert           5       Buchanan, Steven   Fuller, Andrew
                         8     Callahan, Laura        2       Fuller, Andrew
                         9     Dodsworth, Anne        5       Buchanan, Steven   Fuller, Andrew

Open in new window

Using the current data, the last column isn't used. This hierarchy has a depth of three levels only. If we play with the data a little, for example by having Steven report to Margaret, the depth will quickly change.

This points to the main problems of this view.

We need to anticipate the maximal depth of the tree.

The columns have no intrinsic meaning.

It is difficult to use them, for example to build a "path".

We have tried to navigate the hierarchy from "employee" to "supervisor". We can try the opposite direction [qryTree1]: showing the subordinates of each employee.

SELECT E.PID, E.FullName, A.FullName, B.FullName, C.FullName
FROM ((
qryEmployees AS E LEFT JOIN
qryEmployees AS A ON E.PID = A.ReportsTo) LEFT JOIN
qryEmployees AS B ON A.PID = B.ReportsTo) LEFT JOIN
qryEmployees AS C ON B.PID = C.ReportsTo;

Again, aliases are used to distinguish between the instances of the same query. The result is perhaps surprising, in that it shows too many records.

  PID    E.FullName         A.FullName         B.FullName         C.FullName
                        ---    -----------------  -----------------  -----------------  -----------------
                         1     Davolio, Nancy
                         2     Fuller, Andrew     Davolio, Nancy		
                         2     Fuller, Andrew     Leverling, Janet		
                         2     Fuller, Andrew     Peacock, Margaret		
                         2     Fuller, Andrew     Buchanan, Steven   Suyama, Michael	
                         2     Fuller, Andrew     Buchanan, Steven   King, Robert	
                         2     Fuller, Andrew     Buchanan, Steven   Dodsworth, Anne	
                         2     Fuller, Andrew     Callahan, Laura		
                         3     Leverling, Janet
                         4     Peacock, Margaret
                         5     Buchanan, Steven   Suyama, Michael
                         5     Buchanan, Steven   King, Robert
                         5     Buchanan, Steven   Dodsworth, Anne
                         6     Suyama, Michael
                         7     King, Robert
                         8     Callahan, Laura
                         9     Dodsworth, Anne

Open in new window

There are three records ending with Anne: as sub-subordinate, as subordinate, and as employee. The problem is that we didn't specify a starting point for the exploration, namely Andrew. We can add this condition to the query [qryTree2]:

WHERE ReportsTo Is Null

In other words, the first column will only contain employees without a supervisor, hence the Boss. The result is more readable:

  PID    E.FullName         A.FullName         B.FullName         C.FullName
                        ---    -----------------  -----------------  -----------------  -----------------
                         2     Fuller, Andrew     Davolio, Nancy		
                         2     Fuller, Andrew     Leverling, Janet		
                         2     Fuller, Andrew     Peacock, Margaret		
                         2     Fuller, Andrew     Buchanan, Steven   Suyama, Michael	
                         2     Fuller, Andrew     Buchanan, Steven   King, Robert	
                         2     Fuller, Andrew     Buchanan, Steven   Dodsworth, Anne	
                         2     Fuller, Andrew     Callahan, Laura

Open in new window

Looking a bit closer, however, we realise there is no record for Steven. He appears in three records, but he doesn't have his own record, nor does Andrew for that matter... To understand why, we need some vocabulary used to describe hierachies or "trees".

The figure shows the relationship "reports to" as arrows. This builds a tree (well an upside-down tree anyway) in which each PID is a node. Starting from node five (Steven), we find the parent node (Andrew), three child nodes, and four sibling nodes.

A node without children is called a leaf node (using to the tree analogy) in light green in the figure, while those that do are called branch nodes. Finally, a node without parent is simply a root node (we have only one in this example, but if Steven stops reporting to Andrew, we would have two roots and two trees). Note that a tree stops being a tree if a cycle is created (e.g. if Andrew was to report to Margaret).

Returning now to the last query output, we see that we have in fact one record for every leaf. Even before filtering, there was no record for the nodes two and five. Knowing that the records represent leaves, the displayed PID is useless. It would be more meaningful to see the leaf's PID, but leaves are not all in the same column...

The "perfect" output would have columns like these:

  PID    FullName           ReportsTo  Depth  Root  Path
                        ---    -----------------  ---------  -----  ----  ----------------
                         2     Fuller, Andrew	                 0      2   2
                         1     Davolio, Nancy         2        1      2   2/1
                         3     Leverling, Janet       2        1      2   2/3
                         4     Peacock, Margaret      2        1      2   2/4
                         5     Buchanan, Steven       2        1      2   2/5
                         6     Suyama, Michael        5        2      2   2/5/6
                         7     King, Robert           5        2      2   2/5/7
                         9     Dodsworth, Anne        5        2      2   2/5/9
                         8     Callahan, Laura        2        1      2   2/8

Open in new window

The relevant derived information is the depth of the node, the root node, and the path (presented here in numeric form). In this simple case, it was even possible to use the path as sort column, since all PID numbers have only one digit.

One solution for the query can be found in the demo database [quniEmployeesTree]; it is in fact much longer than the list it creates... In essence, it is a UNION query, with one sub-query for every depth. The first simply selects employees without a supervisor; the second selects employees with a supervisor himself or herself without a supervisor; and so forth. The last level (the fourth level in the demo query) should be more generic and return something for all employees below that level: a generic "4+" depth, no known root, and an incomplete path.

Simple Trees in Access

Given the limited tools at hand, complexity grows parabolically with the depth of the tree. The UNION query giving all useful node information for a depth of four contains the base table ten times (1+2+3+4). This means that the solution works only up to about seven levels (due to limitations in the number of tables in a single query).

Said like this, it may look small. But if you stop to think about it, a depth of seven is already quite complex. This covers all possible categories and sub-categories a company might use; all cost centres with sub-cost-centres of any accounting system; chapters, paragraphs, sections, and articles of large documents; continents, subcontinental areas, countries, regions, sub-regions, and communes; etc. It is difficult to find a meaningful hierarchy deeper than seven levels.

In most cases, it is perfectly possible to manage trees in Access, even with the limited built-in resources. The initial queries will not be easy to write, perhaps, and the depth will have a hard-coded limit, but besides that, the only problem is performance.

When the tree is too deep, or when there are too many nodes, other solutions must be found.

Botanical names are one good example (along with other taxonomic trees in biology). The current structure and rules have been created over 250 years ago, and the tree continues to be updated daily by botanists around the world. It comprises millions of names, with varying depth depending on the precision of the classification used. Compare the classification of the dog rose in different databases: The International Plant Names Index, Tropicos.org (Missouri Botanical Garden), African flowering plants database, Wikipedia, Wikispecies

The trick here is that each level has its own name. This makes it possible to skip intermediate classification levels and concentrate on the important ones: the genus and the species, with perhaps the family, and if needed the infra-specific name (for example the variety).

However, even with a depth of four, the number of names in any decent botanical database will prevent using a UNION query for sorting, for example.

An Intermediate Tree: the Zones of Experts-Exchange

Instead of using botanical names, we can use something more familiar to all users of Experts-Exchange: the topic areas or "Zones". If you look at the top of the page, you will see something like this:

Home // Articles // Microsoft // Development // MS Access // (article title)

This is a "path", with "branch nodes" linked to one another, leading ultimately from the "root" (Home) to the current leaf node (the article). The Zone of this article is "Microsoft Access". The parent of that Zone is "Microsoft Development", etc.

In reality, I will not use the articles hierarchy, stemming from a first-level branch of "Home", but the regular Questions zones. The path for an Access question would be:

Home // Microsoft // Development // MS Access

The example is good because there are currently about thousand zones with a maximal depth of seven, for example:

Home // Programming // Languages // Java // J2EE // Frameworks // Velocity

The table [tblZones] looks like this:

ZoneID   Name                                       ParentID  ChildName
                      ------   -----------------------------------------  --------  ----------------------
                          0    Experts Exchange                                     Home
                          2    General Business & Productivity Software       3063  Productivity Apps
                         10    Programming Languages                          5628  Languages
                         11    Web Languages/Standards                        5385
                         14    Email Software                                 3351
                         21    Operating Systems Miscellaneous                5382  Misc OS
                         22    Apple Operating Systems                        5349  Apple OS
                         23    MS DOS                                         5298
                         24    Windows 3.x Operating System                    442  Windows 3.x
                         25    Windows 95 Operating System                     442  Windows 95
                         26    Windows 98 Operating System                    3560  Windows 98
                        ... (selection of a few more records below ID 26:)
                         39    Microsoft Access Database                      5318  MS Access
                        442    Windows 95 and 3.x                             3560  Windows 95 & 3.x
                       3560    Microsoft Windows Operating Systems            5298  Windows
                       5297    Microsoft Zone                                    0  Microsoft
                       5298    Microsoft Operating Systems                    5382  Microsoft
                       5318    Microsoft Development                          5297  Development
                       5382    OS Zone                                           0  OS

Open in new window

Note: This is a reconstruction, assembled from various sources, which has been simplified. The actual zone structure of EE is more complex, and might very well have evolved significantly since the time of writing.

Each zone has a name and also a short name, used only when building a path. It doesn't make sense to repeat the name Microsoft at every level one of the parent nodes already contains it. In this extract, we see two primary branches, with ParentID=0, "Microsoft" and "OS". We can also backtrack the zones "MS Access" or "Windows 3.x" back to the root (or "Home").

You can study the table at leisure in the attached database. Even in the small extract above, the tree is very hard to read. The finger or the eyes get lost, the numbers need to be double checked and then rechecked a few seconds later, it's not comfortable. And that is the reason why deep trees and especially trees with unnamed levels are becoming common only now, since computers read and navigate them for us.

As said earlier, this tree is still in the range of a seven-level UNION query [quniZonesTree], with good performance (sorting thousand records causes only an imperceptible delay). It can be handled by Access with existing methods. However, it also makes a good example for more sophisticated techniques.

Fast Table Lookup Functions

A complete presentation of these functions can be found in a companion article. In short, a table can be opened in a special mode allowing to use any Index directly for the fastest search method available in databases, called Seek. It calls the method used internally by Jet to locate records, for example when joining tables, as in this query:

SELECT Z.*, P.Name As ParentName
FROM tblZones Z LEFT JOIN tblZones P ON Z.ParentID = P.ZoneID

Jet will need to find as quickly as possible the record designated in the field ParentID (this is a gross simplification on how the Engine actually works, useful only in this exact context). When using the Seek method in a function, it could look like this:

SELECT Z.*, ZoneName(Z.ParentID) AS ParentName
FROM tblZones AS Z;

Function ZoneName(ID)
                          
                          With CurrentDb("tblZones").OpenRecordset
                              .Index = "PrimaryKey"
                              .Seek "=", ID
                              If .NoMatch Then ZoneName = Null Else ZoneName = !Name
                          End With
                          
                      End Function

Open in new window

The Seek method is nearly instantaneous, but the calls to CurrentDb and OpenRecordset create an enormous overhead. We need to eliminate this in order to be able to call the function not thousand times but a million time if needed, to build paths for example (the parent of the parent of the parent...).

This is obtained with a function called GetTable(), which maintains a collection of already opened tables. That way, the second and all subsequent calls to ZoneName() will not have to reopen the table. (Note that GetTable() applies the index "PrimaryKey" by default.)

The faster version thus becomes

Function ZoneName(ID)
                          
                          With GetTable("tblZones")
                              .Seek "=", ID
                              If .NoMatch Then ZoneName = Null Else ZoneName = !Name
                          End With
                          
                      End Function

Open in new window

For a full explanation of the technique and implementation considerations, please read the article mentioned above, Fast Table Lookup Functions. In all code examples below, Seek will be used one way or another, so the functions are all part of that family.

Navigating Hierarchies

The elements of information needed to navigate a tree are all "derived data" in the sense that they can be extracted from the table as it is. A single field pointing to the parent node is all that is needed to fully define a tree. This derived data is:

the path

some form of cycle detection

the depth of the current node

the root node

the parent (that item is immediately available)

the children (a count or the first child)

the siblings (next and previous sibling, typically)

The first three are in nature recursive functions, but can be implemented with a simple loop as well. The loop structure makes it easier to see how the function defends against an infinite loop caused by a cyclic relationship.

The Path, Root, and Depth

Function Path(ByVal id As Long)
                          
                          Dim i As Integer
                          
                          Path = Null
                          With GetTable("tblZones")
                              For i = 1 To MAX_DEPTH
                                  .Seek "=", id
                                  If .NoMatch Then Exit Function
                                  
                                  ' append the name to the front of the path
                                  Path = Nz(!ChildName, !Name) & " // " + Path
                                  
                                  ' expected exit point, at the root node!
                                  If IsNull(!ParentID) Then Exit Function
                                  
                                  ' continue with parent node
                                  id = !ParentID
                              Next i
                          End With
                          ' if this line is reached, the For/Next was exhausted
                          Path = "... // " + Path
                          
                      End Function

Open in new window

The function (cf. [qryChildrenAlpha]) will generate the familiar "path" of a zone, using a special separator. A similar function could be used to generate actual disk paths from folders, or "chains of command" from employees.

The minimal requirement to avoid an infinite loop in a cycle is to add a counter limited to a certain depth. Notice that the function exits from inside the loop at line 15 (or exceptionally at line 9). If the loop terminates, the maximum has been reached, and there is at least one more unknown parent. If the function gets caught in a loop, it will exit when the depth is exhausted (and not when the path overflows due to its length).

The cycle itself is not detected. If the tree is unknown or if it is important to catch cycles, the following version, using a different name to avoid confusion, will catch them.

Function Navigation(ByVal id As Long)
                          
                          Dim Visited() As Long
                          Dim Depth As Long
                          Dim i As Long
                          
                          Navigation = Null
                          With GetTable("tblZones")
                              Do   ' infinite loop (until Exit Function)
                                  
                                  .Seek "=", id
                                  If .NoMatch Then Exit Function
                                  
                                  ' building the path
                                  Navigation = Nz(!ChildName, !Name) & " // " + Navigation
                                  If IsNull(!ParentID) Then Exit Function
                                  
                                  ' cycle detection
                                  If Depth = 0 Then
                                      ' initialise array
                                      ReDim Visited(Depth)
                                  Else
                                      ' compare node with previously visited nodes
                                      For i = 0 To Depth - 1
                                          If id = Visited(i) Then
                                              ' cycle detected!
                                              Navigation = "#cyclic // " + Navigation
                                              Exit Function
                                          End If
                                      Next i
                                      ' make room for this node
                                      ReDim Preserve Visited(Depth)
                                  End If
                                  
                                  ' remember node as being visited, increase depth, and loop
                                  Visited(Depth) = id
                                  Depth = Depth + 1
                                  id = !ParentID
                                  
                              Loop
                          End With
                          ' this line is never reached!
                          
                      End Function

Open in new window

Navigation() builds both the path and an array of visited nodes. If any node in the current path is visited again, the function exits with a special error display at the start of the "path". The current data doesn't have cycles, but you can easily create some to test the response [qryZonesPath].

Notice that the function has two exit points, corresponding to the conditions "root found", at line 16, and "cycle detected", at line 28.

The second could be used in a special validation function: when a parent is selected, the new path is calculated. If the function doesn't find a cycle, the edit is accepted. A simplified version could be called IsCycle(), returning a boolean.

The "root found" exit point has access to the root record and name, and to the calculated depth of the node. Using the function above as template, new functions can be written to return the root or the depth instead of the path.

Parent / Children / Siblings

The "parent" isn't derived data, except when it means the "parent name", easily obtained through a query or a combo box, as shown above. About children, there are two questions: are there children at all (or how many children are there)? and what is the first child? If a parent has several children, they are siblings, and we need to be able to navigate to the next or the previous sibling.

These answers can be obtained from a query [qryChildrenIDs]

SELECT
Z.ZoneID,
Z.Name,
Z.ParentID,

( Select Count(*) From tblZones
Where ParentID=Z.ZoneID
) AS NbChildren,

( Select Top 1 ZoneID From tblZones
Where ParentID=Z.ZoneID
Order By ZoneID
) AS FirstChild,

( Select Top 1 ZoneID From tblZones
Where ParentID=Z.ParentID And ZoneID>Z.ZoneID
Order By ZoneID
) AS NextSibling,

( Select Top 1 ZoneID From tblZones
Where ParentID=Z.ParentID And ZoneID<Z.ZoneID
Order By ZoneID Desc
) AS PrevSibling,

PathNum(Z.ZoneID) AS Path

FROM tblZones AS Z
ORDER BY PathNum(ZoneID);

This query isn't nearly as bad as the UNION query cited above. All links are on enforced relationships, with an index, and filtering and ordering can also be optimized by index searches. This works even on large tables, on any type of hiearchy. Since the lookups aren't recursive, it doesn't need to be a tree: cycles will not break the query.

However, this isn't very useful. Notice that "first", "next", and "previous" refer to the numbering of zones, to the ZoneID (the call to PathNum -- a path using numbers -- makes that more obvious). We probably want to sort by name instead. In reality, Experts-Exchange sorts the zones according to yet another field, not reproduced here, different from both the IDs and the names. It creates a logical or rather a thematic order for the branches of a node.

If we want to use a name, it should probably be the short name. This means we need to sort on an expression building it. The number of records is small enough for this not to be a problem. For example, the next sibling becomes [qryChildrenAlpha]:

( Select Top 1 ZoneID From tblZones
Where ParentID=Z.ParentID
And Nz(ChildName, Name) > Nz(Z.ChildName, Z.Name)
Order By Nz(ChildName, Name)
) AS NextSibling,

This fact makes the use of fast lookup functions much less attractive. These functions work best when using a single Seek operation on an index. But since Access does not support calculated indexes -- indexes based on an expression rather than one or several fields -- the functions will quickly become rather complex.

Function NextSibling(ByVal id As Long)
                          
                          Dim strMyName As String
                          Dim strCurrent As String
                          Dim strThis As String
                          Dim lngParent As Long
                          Dim lngSibling As Long
                          
                          NextSiblingAlpha = Null
                          With GetTable("tblZones", "PrimaryKey")
                              
                              ' retrieve information about the current node
                              .Seek "=", id
                              If .NoMatch Then Exit Function
                              If IsNull(!ParentID) Then Exit Function
                              lngParent = !ParentID
                              strMyName = Nz(!ChildName, !Name)
                              
                              ' switch to an alternate index to find all siblings
                              .Index = "Zones_ParentID"
                              .Seek "=", lngParent
                              Do Until .NoMatch Or .EOF
                                  If !ParentID <> lngParent Then Exit Do
                                  ' a sibling exists, examine "This" sibling
                                  strThis = Nz(!ChildName, !Name)
                                  If strThis > strMyName Then
                                      ' it is a younger sibling
                                      If Len(strCurrent) = 0 Or strThis < strCurrent Then
                                          ' but older than the "Current" next sibling; switch:
                                          strCurrent = strThis
                                          lngSibling = !ZoneID
                                      End If
                                  End If
                                  .MoveNext
                              Loop
                          End With
                          
                          ' the "Current" next sibling is the next in line
                          If Len(strCurrent) Then NextSiblingAlpha = lngSibling
                          
                      End Function

Open in new window

The first lines (up to line 17) are initialisation code: finding the name of the zone and the parent (if there is no parent, there can be no sibling). At line 20 the index is changed. (This name refers to the index used for the relationship, which indexes the parent field.)

If there was a match, all records are read until one of two things happen: the end of the table is reached or the field ParentID no longer matches. The test finally looks for a zone name that is both greater (later in alphabetical order) than "MyName" and smaller than the "Current" best choice. The function then returns the ID of the "next sibling".

Technically, it is a "fast table lookup function", in that it uses the Seek method instead of "find first". However, all siblings need to be examined in order to find the next. Jet does exactly the same thing, but it knows several strategies for this problem, and might find a faster solution.

Again, the function above can be used as template to build functions for the number of children, the first child, or the previous sibling.

If the ordering can be represented by an index, then fast table lookup functions become more attractive again (cf [qryChildrenLong]). I added a new index, called "Special" on both fields ParentID and Name, assuming that the full name of the zone will be used instead of the short name.

Function NextSiblingName(ByVal id As Long)
                          
                          Dim strMyName As String
                          Dim lngParent As Long
                          
                          NextSiblingName = Null
                          With GetTable("tblZones", "PrimaryKey")
                              ' retrieve information about the current node
                              .Seek "=", id
                              If .NoMatch Then Exit Function
                              If IsNull(!ParentID) Then Exit Function
                              lngParent = !ParentID
                              strMyName = !Name
                              ' switch to the special index to find the next sibling
                              .Index = "Special"
                              .Seek ">", lngParent, strMyName
                              If .NoMatch Then Exit Function
                              If !ParentID = lngParent Then NextSiblingName = !ZoneID
                          End With
                          
                      End Function

Open in new window

The code is similar to the previous version, except that the loop through all children is replaced by a single Seek, using the greater-than operator. Since the records are ordered by Name within each ParentID, this single line immediately returns the next sibling. Note that if there is not next sibling, the first child of the next parent group is returned, so we need to test for that.

The symmetrical function for the previous sibling is very similar; it differs only in the comparison operator: less-than instead of greater-than. Finding the first child becomes very easy also (see below at the end of the next section).

Sorting Trees

For trees of a reasonable size, simple SQL techniques as well as fast table lookup functions can be used for navigation and to answer all questions about a node. The only real problem is sorting. In the queries up to now, the sort was performed on a calculated "path" field. This is all right for thousand records, not optimal but tolerable.

I don't know of a method to sort any faster dynamically. However, many trees are rather static in nature. The zones of Experts-Exchange do not change often. When they do, it is always included in a roll-out of a new version or something similar.

It makes sense to pre-calculate the sort order, and index that new field.

If you have read the present article this far, you probably don't need help in setting up the field and writing a sort sequence in it, based on a calculated path, for thousand records. So let's assume we have one million records instead of one thousand, and that a global sort must be avoided.

The query showing basic node information (parent, children, and siblings) can be used to navigate the tree, while numbering the records. It isn't trivial programming, but all information is there: go to the first child, else go to the next sibling, else go to the next sibling of the parent, else...

Breaking down the operations is meaningful and even necessary when programming for a "tree-view control", or when building any other navigation tool (for example on a form). Mere sorting can be obtained through a tighter algorithm.

The following procedure will sort the tree recursively. For each node, a list of children is obtained and sorted. Then each child is numbered, and the function is called recursively for the grand-children (and so forth).

Private Sub SortBranch(ByVal plngRoot As Long, ByRef plngI As Long)
                          
                          Dim rec As New ADODB.Recordset
                          
                          ' create a temporary recordset in memory
                          With rec
                              .CursorLocation = adUseClient
                              .LockType = adLockOptimistic
                              .Fields.Append "ID", adInteger
                              .Fields.Append "Name", adVarChar, 100
                              .Open
                              .Fields("Name").Properties("Optimize") = True
                          End With
                          
                          ' build a list of sub-zones for the current branch
                          With GetTable("tblZones", "Zones_ParentID")
                              .Seek "=", plngRoot
                              Do Until .NoMatch Or .EOF
                                  If !ParentID <> plngRoot Then Exit Do
                                  rec.AddNew
                                  rec!ID = !ZoneID
                                  rec!Name = Nz(!ChildName, !Name)
                                  rec.Update
                                  .MoveNext
                              Loop
                          End With
                          
                          ' sort by the calculated name and loop
                          rec.Sort = "Name"
                          Do Until rec.EOF
                              ' number one branch or leaf
                              With GetTable("tblZones", "PrimaryKey")
                                  .Seek "=", rec!ID
                                  .Edit
                                  !Sort = plngI
                                  .Update
                              End With
                              plngI = plngI + 1
                              ' number all sub-branches and leaves
                              SortBranch rec!ID, plngI
                              rec.MoveNext
                          Loop
                          
                      End Sub
                      
                      Sub SortZones()
                          
                          ' remove previous sort to identify orphans
                          CurrentDb.Execute "UPDATE tblZones SET Sort = Null"
                          
                          ' root is ID zero, the first zone:
                          SortBranch 0, 1
                          
                      End Sub

Open in new window

The code structure is typical for recursion: a private recursive routine, called from an initialisation routine. The code uses two types of recordset, DAO and ADO. Many programmers would prefer to use only one or the other, but Access itself isn't that picky. Even when one of the libraries isn't referenced, objects from both interfaces are used and available through many objects (e.g. CurrentDb for DAO and CurrentProject for ADO).

DAO recordsets offer the easiest access to the table's indexes, and a very convenient implementation of Seek. All fast table lookup functions can be rewritten for ADO, but with a slight loss in performance.

ADO on the other hand offers the very useful feature of temporary tables in memory. In a few lines of code, a recordset is created, defined, and opened, ready to be populated. This recordset is used only to sort "ZoneID / short name" pairs. In order to eliminate ADO, other data structures could be used instead (a collection, an array, a dictionary object, etc.)

I find the routine quite readable and in little need of comments. The first block is the ADO recordset creation, with an optimised column, meaning an indexed column. The second is quite similar to the loop in NextSibling(), and captures the calculated name in the indexed column. The last block first sorts the records and then loops through them again: each node is assigned a number, and its children are numbered by a recursive call.

Please note that the active instance of the function doesn't use GetTable() while making the recursive call. This is because each instance of the function will in fact use the same table-type recordset (this is what GetTable() does). It would not be possible to navigate records in one instance while calling the routine recursively.

The resulting new column can then be used whenever sorting or partial sorting is required. Functionally, it replaces a calculated index.

The corresponding version of NextSibling needs a special index on ParentID and Sort; let's call it "TreeSort". As before, the function differs from PrevSibling in only a single character -- the direction of the Seek -- so they can be merged into one function (cf. [qryZonesTree]).

Function Sibling(Dir As String, ByVal id As Long)
                          
                      ' Dir: one of ">" (next) or "<" (previous)
                          
                          Dim lngSort As Long
                          Dim lngParent As Long
                          
                          Sibling = Null
                          With GetTable("tblZones", "PrimaryKey")
                              ' retrieve information about the current node
                              .Seek "=", id
                              If .NoMatch Then Exit Function
                              If IsNull(!ParentID) Then Exit Function
                              lngParent = !ParentID
                              lngSort = !Sort
                              ' switch to the specially constructed index
                              .Index = "TreeSort"
                              .Seek Dir, lngParent, lngSort
                              If .NoMatch Then Exit Function
                              ' return the next/previous sibling according to "Sort"
                              If !ParentID = lngParent Then Sibling = !ZoneID
                          End With
                          
                      End Function

Open in new window

The function to find the first child also becomes pleasantly simple.

Function Child(ByVal ID As Long)
                          
                          With GetTable("tblZones", "TreeSort")
                              .Seek "=", ID
                              If .NoMatch Then Child = Null Else Child = !ZoneID
                          End With
                          
                      End Function

Open in new window

Notice how Seek is being invoked on a two-field index with a single value. Seek guarantees that in this case, the match occurs on the first matching record in the index, exactly what we want.

Dynamic Trees

Provided the column Sort is renumbered after every change in the tree, we now have fast solutions to all problems posed by hiearchies in Access. Botanical names can be managed in Access, with however the inconvenient that added or edited scientific names will not appear in sequence until the "sort column" has been renumbered.

For even more dynamic trees, this can become a real problem.

Sorting is really needed for overviews: all subspecific names for a species, all species in a genus with synonyms from other genuses, etc. These lists must be sorted. If neither the path nor a precalculated sort column can be used for any reason, then temporary extracts are needed.

The sorting procedure shows how a temporary recordset can be created in memory, a temporary table could be used as well. A very similar recursive procedure could be used to collect records instead of numbering them. For example, starting with the species Rosa canina, the procedure can crawl upwards to collect all parents (this is the equivelent of the path), and collect all children recursively, even adding the children's synonyms. The final recordset or temporary table would no longer be a large tree, and could thus be sorted using one of the methods presented above.

In a nutshell, when the tree is really large, you never work with the entire tree. Instead you work on one branch, which you can extract using fast functions and then sort or otherwise manipulate more easily. This will not be developed here.

However, the navigation can be dynamic. Since we have various solutions to navigate in any direction from a node (parent, children, and siblings), the tree can always appear sorted in form view. The [Next] button will simply run the relevant navigation functions and determine the node that is best perceived as "next" in the interface. It can be the first child, the next sibling, the parent's next sibling, etc.

To summarise, when working with a large dynamic tree:

An overall sort order is best obtained using a dedicated sort field, acting like a calculated index in other database systems. The procedure will in reality only sort siblings, and even that only when sorting on a calculated expression.

Local views of the tree are obtained by extracting one or several branches into a temporary recordset or table. The extracted information is a snapshot, but can be treated like a small tree, with more options for sorting.

The mere navigation from node to node can be dynamic at all times, so that the user can experience the tree as a constantly sorted structure. A newly created node will appear to be immediately inserted in the tree.

Note: Computer science studies many different types of trees, and has developed data structures and algorithms to maintain a dynamic hierarchy in more efficient ways. Typically, a different data structure is used to define and sort the tree, and rules are defined for the "insert", "delete", and "move" operations, which maintain the sortability. These techniques will not be developed here, but if you reach the limits of simple trees for a dynamic hiearchy, be aware that other methods exist, most of which can be implemented in Access.

Complex Trees

Hierarchy and file systems, are absolute trees. When a path exists between two nodes, it is always unique. Cyclic relationships can theoretically create an infinite number of paths (each loop through the cycle creates a new longer path), but they are normally prohibited or treated specially -- as an error or a temporary condition.

A genealogy also contains trees, trees of ancestors and trees of decendants. No cycles are possible using realistic data without time travel. However, since each individual has two parents, distant ancestors tend to appear several times in the tree, with different lineage. Somebody's great-great-grand-mother might also be his great-great-grand-aunt, a different "path". The same is true for distant descendants.

When exploring a genealogy, you always use a "local root", the individual you are currently interested in, whether you are looking upwards or downwards. One article on Allen Browne's site, about Self Joins, has a commented illustration of a horses pedigree query: each horse is the root of his own tree. In thoroughbred pedigrees, it is very common to find the same stallion several times as ancestor, which is called inbreeding.

Much more could be said specifically about genealogies, pedigrees, and the measure of inbreeding. Again, fast table lookup functions can be created to navigate and collect information from a genealogy, replacing a fifteen-table query (for three generations only) with a single-table query, but making fourteen function calls. To calculate a nine-generation inbreeding factor requires examining at most thousand ancestors (much less in reality due to inbreeding), something a function using fast lookups can handle easily even if the number of horses in the database is very large.

Let's close this brief mention of genealogies here.

I presented the zones of Experts-Exchange as a simple tree above. This was a simplification. In reality, the zones exist as such, and the tree is only a navigation tool for the users. However, every user is different, and each can be looking for, say, Access by following a different path. One may think first "it's a database", an other "it's part of Microsoft Office", and yet another "I'm developping using a Microsoft platform". Experts-Exchange currently offers nine different paths for this zone:

Home // Database // MS Access
Home // Developer // Database // MS Access
Home // Developer // Office / Productivity // Office Suites // MS Office // MS Access
Home // Developer // Programming // Database // MS Access
Home // Microsoft // Applications // MS Office // MS Access
Home // Microsoft // Development // MS Access
Home // Programming // Database // MS Access
Home // Software // Database // MS Access
Home // Software // Office / Productivity // Office Suites // MS Office // MS Access

Each node has a parent, but can also have a god-parent. "Access" is a child of "Development", but also of "Database" and of "MS Office"; each parent can in turn have alternate god-parents. The reverse paths look like this, with the god-parents in italics.

MS Access \\ Development \\ Microsoft \\ Home
MS Access \\ Database \\ Home
MS Access \\ Database \\ Developer \\ Home
MS Access \\ Database \\ Programming \\ Home
MS Access \\ Database \\ Programming \\ Developer \\ Home
MS Access \\ Database \\ Software \\ Home
MS Access \\ MS Office \\ Office Suites \\ Office / Productivity \\ Software \\ Home
MS Access \\ MS Office \\ Office Suites \\ Office / Productivity \\ Developer \\ Home
MS Access \\ MS Office \\ Applications \\ Microsoft \\ Home

The demo database contains a UNION query showing all paths available [quniZonesAltTree], creating the complete tree of zones at the time of writing, allowing the same node to appear several times. The table [tblZonesAlt] needed to describe it is quite simple, it contains only two fields: ChildID and ParentID.

Technically, this is a many-to-many self join (!). Each zone can have several children, as before, but also several parents. Despite the inherent complexity, if cycles are avoided and if there is at least one reverse path from every zone to "Home", it is still a tree.

The SQL UNION query might be long, but is generates the complete tree. There are 1361 "parent-child" pairs (or records in the link table), generating 4292 full paths.

With this example, we reach a new level of complexity. Simple navigation functions become contextual: the "next sibling" meant before the "next child of the same parent". Now we need to specify the parent, since each node can have several. Navigation functions would become "next child of parent", "next parent of child", and all path manipulations would have to be done using lists.

In practice, this particular tree is presented to the users of Experts-Exchange just like a simple tree. There isn't a single element of the interface that exposes the fact that a zone has several parents, that would only confuse the user. The apparent structure is one tree of over four thousand paths. Some users might wonder if the "Microsoft Access" zone reached through a different path is another synonymous zone, but that question can be answered quickly: the page is identical (except for the navigation bar).

If you want to play with the data, you can use one final element. Most zones are both branches and leaves. You can ask a question in "Office Suites" (it's treated as a leaf) and the zone can have children like "Star / OpenOffice" and "Microsoft Office". However, the first branches of "Home" are only used as branches. You cannot ask a question in "Microsoft" or "Database". The formal definition using the simple tree is easy: children of "Home" (zones having ParentID=0) cannot be used as leaves; they serve only as parent of other nodes. Using the alternate tree, the definition is slightly more complicated: zones for which one of the parents is "Home".

Navigating complex trees

The attached file contains only one other example using the alternate tree, a navigation form [frmZoneNavigator]. It shows, for the current zone:

details about the zone

a list of parents (one of them is selected)

a list of siblings (the current zone is selected)

a list of children

The one problem it tries to solve, a rather artificial but tricky problem, is the management of the multiple paths for each zone during navigation. Using only the current parent isn't sufficient. Looking at the list of paths for Access above, we find that Access is a child of only three zones: "Development", "Database" and "Office". But "Access as a child of Database" doesn't distinguish between the different parents of "Database".

The key of a path is in fact the path itself. I used a "path of numbers", e.g. "0000-5297-5318" or "0000-3062-3063-3093-0035" instead of the readable path. Since all the alternate paths are formally derived data, they do not exist as objects in the database, and do not have a key. The only available "natural key" is one version of the path itself.

The code of the "zone navigator" form isn't reproduced here, but a brief presentation of the implementation choices can be useful.

The list of all alternate paths to the current zone is built using a recursive algorithm. The list itself is stored in a recordset-in-memory, just like for the Sorting procedure presented above. Each call adds one new element to both paths: the human-readable text path, and the "number path" using the ID numbers of the nodes. The short list (at most a dozen paths) is sorted alphabetically before being used.

Navigating to a child node means adding the current ID to the "number path". This maintains the illusion of a simple tree (if one mentally disregards alternate parent paths). Navigating to one of the parent paths means jumping to that parent (remember that each parent can appear several times through alternate paths) and to remove the last number from the "number path". Again, this maintains the illusion of a simple tree.

Selecting another path means that the siblings could change (if another parent is selected, this generates a new list of children). However, the current zone should always be in the list. Similarily, selecting another sibling (changing the current zone), can change the list of paths (because the parents might have changed), but again, the currently selected path should be in the new list. Naturally, the list of children will change.

The Demo Database

The attached file is an Access 2000 format database, containing three tables, several queries, one form, and one module. It isn't an application and is meant only as a testing ground for the various techniques presented in the article. Brief indications in brackets and italics signal the relevant queries and a few other objects troughout the article; a short description has been added to each object in the database.
Hierarchies.mdb

Conclusion

Self-joins appear in the most unexpected places, sometimes quite naturally, sometimes as a tricky solution to a delicate relational problem. Anyone working with databases encounters them, often long before being prepared for them.

I remember setting myself a challenge. My daughter was reading one of these books where every paragraph ends with jumps to one or several other paragraphs, "The Lizard King" if I remember correctly. I entered all paragraphs into a self-joined table and then attempted to answer various questions. Not only "what is the path to the most desirable ending?" and "what are the dead-ends? (litterally)" but also more subtle questions about the formal structure of the narrative, the bottlenecks, and how different choices would still allow the main story line to advance.

It was impossible, and I had to give up. I ended up finishing the exercise in a classical programming language. Then again, I didn't know about fast table lookup functions then, or I would quite possibly have succeeded.

Since I started working with Botanical Taxonomy, self-joins have naturally become totally familar, in Access and in other database systems. I am still surprised on how much you can do in Access, provided both the data structure and the purpose of the current endeavour is clear in one's mind. It is easy to blame failure on the lack of a neat feature, but it is very rewarding to find solutions using a limited toolset.

I hope this article helped you get acquainted with self-joins and with the tools you can use to manage them, or -- if you were already familar with them -- that you discovered a new trick or two.

Markus G Fischer
(°v°)

about unrestricted Access

Comments (5)

Markus Fischer

Author

Commented: 2010-05-24

I would like to thank the page editors, especially aikimark for his review even if I was not able to incorporate his suggestions at the time (I might still do so), and both mark_wills and mbizup for moving things along in my absence.

By coincidence, my current job involves hierarchical queries in Oracle, which has special keywords to navigate hierarchies (CONNECT BY, etc.). In several ways, writing this article has been a perfect preparation: I knew exactly what I expected from a hierarchical query... and I wasn't disappointed!

Small trees (in the thousands) can be managed very well in Access, but large trees (above hundred thousand) really benefit from a database engine that manages them natively.

Thanks again!
Markus -- (°v°)

rdroberts

Commented: 2011-01-02

Nice article, and it targets an area I've been researching.

However, it references a Demo database via "The attached file is an...".

I cannot find a link to the Demo within the article. Is it there and I don't see it, or is it somewhere else?

Thanks,

Ron

telyni19

CERTIFIED EXPERT

Commented: 2011-04-28

Great article; very thorough. I just have one point of feedback on the layout though - I missed the demo file section on my first pass through the article, and when I went to look for it, I looked at the top of the article, where it had originally been mentioned, and at the bottom of the article, where files are usually attached for questions, and didn't see it either place. I temporarily thought the glitch encountered by a previous commenter had returned. I finally found it after scanning slowly through the whole article again.

So I think it would have been more intuitive to place the file either at the top where it was first mentioned, or at the very bottom. The conclusion is long enough that someone looking at the bottom of the article won't necessarily see the demo file section placed just above the conclusion.

Dale Fye

Owner, Dev-Soln LLC

Most Valuable Expert 2014

Top Expert 2010

Commented: 2012-04-05

Markus,

Great article. I used it and your Fast Table lookup article to come up with a solution to this problem.

But that created another problem. When I call the GetTable function and pass it the name of a table that is a linked Access table, I get the following error message:

Run-time error '3251':
Operation is not supported for this type of object.

But the linked table contains an index titled "PrimaryKey", so I don't know why the GetTable function is generating an error.

Commented: 2012-04-06

Disregard previous post. I didn't read far enough through the Fast Table Lookups article.