## Effective search under a subtree

Almost every blogging system supports assigning articles into categories and subcategories. Listing articles which belong to a subtree of categories may be quite expensive operation especially if the number of articles is very high. In this article I would like to describe one of possible solutions.

### Category Tree

Tree of article categories could look something like:

- Music

- Rock
- Pop
- Jazz
- Sport

- Snowboarding
- Skiing

- Cross-country skiing
- Downhill skiing
- Kiteboarding
- Traveling

- Adventure
- Leisure

The most simple way of assigning an article to a category would be to give each category an id (e.g. Music=1, Rock=2, Pop=3, Jazz=4, Sport=5,…) and add categoryId column into Article database table. Then if we wanted to get all articles from category Rock we could do it simply by running something like:

`select * from Article where categoryId=2`

But getting all articles from all subtree under Music category will be more complicated, we have to make list of category ids from all Music subtree and query something like:

`select * from Article where categoryId in (1,2,3,4)`

Imagine much larger category tree, the number of categories in a subtree may be high. We wanted something more efficient and still we would like to be able to add new categories into a tree.

### Limited tree

In our project the efficiency was really top priority, therefore we were able to accept slight limitations to our category tree which could give us better performance:

- Maximum tree depth is defined once and then constant. Let's name this
parameter
`maxLevels`

. - Maximum number of children a parent may have is defined once and then
constant. This parameter can be called
`maxChildren`

.

### Limited tree evaluation

Position of every tree node can be described by path to it from the tree root. Look at following tree:

- Music {1}

- Rock {1,1}
- Pop {1,2}
- Jazz {1,3}
- Sport {2}

- Snowboarding {2,1}
- Skiing {2,2}

- Cross-country skiing {2,2,1}
- Downhill skiing {2,2,2}
- Kiteboarding {2,3}
- Traveling {3}

- Adventure {3,1}
- Leisure {3,2}

Path can have maximally `maxLevels`

items and each item in path is
in range from 1 to `maxChildren`

. For simplification we filled paths
shorter then `maxLevels`

by 0s. So for example if ```
maxLevels =
6
```

then path to Pop node is `{1,2,0,0,0,0}`

. Then we can
express a path as `int`

array of size `maxLevels`

.

Now for each node we can calculate a single integral value using this
equation (the operator `^`

means power):

value(path) = path[0]*(maxChildren+1)^(maxLevels-1) + path[1]*(maxChildren+1)^(maxLevels-2) + path[2]*(maxChildren+1)^(maxLevels-3) + ... + path[maxLevels-1]*(maxChildren+1)

Does this equation resemble something to you? Yes you have seen it before, it is equation for conversion a number from numeric system with base (maxLevels+1) into decimal system.

Into Article table we add new column `categoryValue`

into which we
will put the calculated value based on a category an article belongs to.

### Example

Now why we did all this? Let me show you on example. The tree we will use
looks the same as above. At first we have to define `maxLevels`

and
`maxChildren`

parameters. Let's define `maxLevels=4`

and
`maxChildren=7`

(When evaluating tree nodes we will therefore convert
number from octal system into decimal.)

We have written a new article and we want to assign it to category Skiing
(path is {2,2} which is equal to {2,2,0,0}). So we can calculate the
`categoryValue`

column value using equation defined above:

value(Skiing {2,2,0,0}) = 2*8^3 + 2*8^2 + 0*8^1 + 0*8^0 = 1024 + 128 = 1152

Then let's say we want to list all articles from category Sport {2} and below. It means everything between Sport {2} inclusive and Traveling {3} (which is next sibling to Sport) exclusive. So we calculate lower and upper limits:

lower = value(Sport {2,0,0,0}) = 2*8^3 + 0*8^2 + 0*8^1 + 0*8^0 = 1024 upper = value(Traveling {3,0,0,0}) = 3*8^3 + 0*8^2 + 0*8^1 + 0*8^0 = 1536

The sql to get all articles from under the Sport node is then:

select * from Article where categoryValue >= :lower and categoryValue < :upper

It looks more complicated then it really is. For testing we use
`maxChildren=9`

because then paths form direcly decimal numbers. For
example when `maxLevels=5`

then:

value(2,3,4) = value(2,3,4,0,0) = 23400

### Real life

Tree parameters are limited by type we store calculated value in. We used
`long`

java type which is 64 bits type. We set
`maxLevels=6`

and `maxChildren=10`

. It is sufficient for
our needs and very fast.

Nice example of a limited tree implementation ;) indeed looks fast. Have you benchmarked it somehow?

Not yet, there are several techniques of handling trees in relational db, maybe it would be nice to have some overall comparison. However speed of listing articles from under a subtree is not the only thing to measure, there are other factors such as speed of inserting new subcategory, speed of assigning an article to a subcategory, etc. The mechanism described here is probably faster than other in assigning article to a category, adding new category and listing articles from category. But it is because it limits the depth and width of a tree.