Categories
Technical

Nested Sets and Storing Hierarchical Data In Databases

I recently wrapped up a small project where I had to store hierarchical data in a database.  Essentially, here’s what I started with in code:

    class node {
        node parent;
        string data;
    }

Simple to store in a database, right?  Just a single table with three fields:

int node_id
int parent_node_id
text data

If node_id is your primary key then this should be nice and simple, and in practice it is.  What happens, though, when you get a situation when you have a child node that is hundred of levels down the tree?  You have to do a lookup for every node at every level of the tree structure.  This can lead to serious performance issues on the lookup side of things.  For example, a tree 4 levels deep with each node having 4 children (240 nodes total) will require 241 separate lookups.

On the other hand, inserting new nodes, moving existing nodes (or even whole branches), and deleting single nodes are computationally cheap.  The only catch you might want to avoid is leaving orphan nodes, i.e.: those nodes with no valid parent.  Deleting whole branches will be expensive because you have to do a lookup on all children before you can delete them without leaving orphan nodes.

Assuming the depth and width (number of nodes at a given level) of your tree isn’t significant, this system works quite well.  It is quick, easy to understand, and easy to implement.

One way to optimize this system would be to use a table to keep track of the contents of each branch of the tree by using a table like this:

int branch_parent_node_id
int node_id

In this way you will have all the child (grand-child, great-grand-child, etc…) nodes for each node.  You can then do an inner-join on your select statement to get all the relevant information about each node and re-assemble the tree structure inside your application code.  This requires one query and, when in the code, only requires a single pass of the results to generate the tree.

The disadvantages to this method are fairly obvious.  If you have even a moderately-sized data set, the table will become quite large.  There is a large computational cost to generate the table and it will be incurred whenever the data set changes.  This is especially significant in systems where updates to the data must be presented in real-time.

Until a few weeks ago, this was all I knew about storing hierarchical structures in databases.  I was reading about the new features of FogBugz, specifically how they added the ability to have sub-cases.  The post referenced an article on storing tree structures in SQL using nested sets.

The quick summary is that each the “left and right sides” of each node is numbered based on their overall position in the tree.  Specifically, “a node’s ‘left’ value is set whenever it is first seen during traversal, and the ‘right’ value is set when walking back up the tree away from the node”.  In order to get all of a node’s children (and grand-children, etc…) you just need to select those nodes with left (or right) values between the left and right values of the parent node.

The tree structure is not explicit but is instead implied by the left and right values of the respective nodes.  The lookup performance is excellent but insertions, deletions, and movement of nodes and branches is computationally expensive as you must recalculate many of the nodes’ right and left values.  For applications with a small number of writes (node creation, movement, deletion, etc…) this is definitely a logical way to go.  It is a bit more complex but very effective.

Here are some more links: