Effective search under a subtree
Almost every blogging system supports assigning articles into categories and subcategories. Listing articles which belong to a subtree of categories may be quite expensive operation especially if the number of articles is very high. In this article I would like to describe one of possible solutions.
Category Tree
Tree of article categories could look something like:
- Music
- Rock
- Pop
- Jazz
- Sport
- Snowboarding
- Skiing
- Cross-country skiing
- Downhill skiing
- Kiteboarding
- Traveling
- Adventure
- Leisure
The most simple way of assigning an article to a category would be to give each category an id (e.g. Music=1, Rock=2, Pop=3, Jazz=4, Sport=5,…) and add categoryId column into Article database table. Then if we wanted to get all articles from category Rock we could do it simply by running something like:
select * from Article where categoryId=2
But getting all articles from all subtree under Music category will be more complicated, we have to make list of category ids from all Music subtree and query something like:
select * from Article where categoryId in (1,2,3,4)
Imagine much larger category tree, the number of categories in a subtree may be high. We wanted something more efficient and still we would like to be able to add new categories into a tree.
Limited tree
In our project the efficiency was really top priority, therefore we were able to accept slight limitations to our category tree which could give us better performance:
- Maximum tree depth is defined once and then constant. Let's name this
parameter
maxLevels
. - Maximum number of children a parent may have is defined once and then
constant. This parameter can be called
maxChildren
.
Limited tree evaluation
Position of every tree node can be described by path to it from the tree root. Look at following tree:
- Music {1}
- Rock {1,1}
- Pop {1,2}
- Jazz {1,3}
- Sport {2}
- Snowboarding {2,1}
- Skiing {2,2}
- Cross-country skiing {2,2,1}
- Downhill skiing {2,2,2}
- Kiteboarding {2,3}
- Traveling {3}
- Adventure {3,1}
- Leisure {3,2}
Path can have maximally maxLevels
items and each item in path is
in range from 1 to maxChildren
. For simplification we filled paths
shorter then maxLevels
by 0s. So for example if maxLevels =
6
then path to Pop node is {1,2,0,0,0,0}
. Then we can
express a path as int
array of size maxLevels
.
Now for each node we can calculate a single integral value using this
equation (the operator ^
means power):
value(path) = path[0]*(maxChildren+1)^(maxLevels-1) + path[1]*(maxChildren+1)^(maxLevels-2) + path[2]*(maxChildren+1)^(maxLevels-3) + ... + path[maxLevels-1]*(maxChildren+1)
Does this equation resemble something to you? Yes you have seen it before, it is equation for conversion a number from numeric system with base (maxLevels+1) into decimal system.
Into Article table we add new column categoryValue
into which we
will put the calculated value based on a category an article belongs to.
Example
Now why we did all this? Let me show you on example. The tree we will use
looks the same as above. At first we have to define maxLevels
and
maxChildren
parameters. Let's define maxLevels=4
and
maxChildren=7
(When evaluating tree nodes we will therefore convert
number from octal system into decimal.)
We have written a new article and we want to assign it to category Skiing
(path is {2,2} which is equal to {2,2,0,0}). So we can calculate the
categoryValue
column value using equation defined above:
value(Skiing {2,2,0,0}) = 2*8^3 + 2*8^2 + 0*8^1 + 0*8^0 = 1024 + 128 = 1152
Then let's say we want to list all articles from category Sport {2} and below. It means everything between Sport {2} inclusive and Traveling {3} (which is next sibling to Sport) exclusive. So we calculate lower and upper limits:
lower = value(Sport {2,0,0,0}) = 2*8^3 + 0*8^2 + 0*8^1 + 0*8^0 = 1024 upper = value(Traveling {3,0,0,0}) = 3*8^3 + 0*8^2 + 0*8^1 + 0*8^0 = 1536
The sql to get all articles from under the Sport node is then:
select * from Article where categoryValue >= :lower and categoryValue < :upper
It looks more complicated then it really is. For testing we use
maxChildren=9
because then paths form direcly decimal numbers. For
example when maxLevels=5
then:
value(2,3,4) = value(2,3,4,0,0) = 23400
Real life
Tree parameters are limited by type we store calculated value in. We used
long
java type which is 64 bits type. We set
maxLevels=6
and maxChildren=10
. It is sufficient for
our needs and very fast.
Nice example of a limited tree implementation ;) indeed looks fast. Have you benchmarked it somehow?
Not yet, there are several techniques of handling trees in relational db, maybe it would be nice to have some overall comparison. However speed of listing articles from under a subtree is not the only thing to measure, there are other factors such as speed of inserting new subcategory, speed of assigning an article to a subcategory, etc. The mechanism described here is probably faster than other in assigning article to a category, adding new category and listing articles from category. But it is because it limits the depth and width of a tree.