Archív pro ‘Programming’ Kategorie

Effective search under a subtree

Almost every blogging system supports assigning articles into categories and subcategories. Listing articles which belong to a subtree of categories may be quite expensive operation especially if the number of articles is very high. In this article I would like to describe one of possible solutions.

Category Tree

Tree of article categories could look something like:

  • Music
    • Rock
    • Pop
    • Jazz
  • Sport
    • Snowboarding
    • Skiing
      • Cross-country skiing
      • Downhill skiing
    • Kiteboarding
  • Traveling
    • Adventure
    • Leisure

The most simple way of assigning an article to a category would be to give each category an id (e.g. Music=1, Rock=2, Pop=3, Jazz=4, Sport=5,…) and add categoryId column into Article database table. Then if we wanted to get all articles from category Rock we could do it simply by running something like:

select * from Article where categoryId=2

But getting all articles from all subtree under Music category will be more complicated, we have to make list of category ids from all Music subtree and query something like:

select * from Article where categoryId in (1,2,3,4)

Imagine much larger category tree, the number of categories in a subtree may be high. We wanted something more efficient and still we would like to be able to add new categories into a tree.

Limited tree

In our project the efficiency was really top priority, therefore we were able to accept slight limitations to our category tree which could give us better performance:

  • Maximum tree depth is defined once and then constant. Let's name this parameter maxLevels.
  • Maximum number of children a parent may have is defined once and then constant. This parameter can be called maxChildren.

Limited tree evaluation

Position of every tree node can be described by path to it from the tree root. Look at following tree:

  • Music {1}
    • Rock {1,1}
    • Pop {1,2}
    • Jazz {1,3}
  • Sport {2}
    • Snowboarding {2,1}
    • Skiing {2,2}
      • Cross-country skiing {2,2,1}
      • Downhill skiing {2,2,2}
    • Kiteboarding {2,3}
  • Traveling {3}
    • Adventure {3,1}
    • Leisure {3,2}

Path can have maximally maxLevels items and each item in path is in range from 1 to maxChildren. For simplification we filled paths shorter then maxLevels by 0s. So for example if maxLevels = 6 then path to Pop node is {1,2,0,0,0,0}. Then we can express a path as int array of size maxLevels.

Now for each node we can calculate a single integral value using this equation (the operator ^ means power):

value(path) = path[0]*(maxChildren+1)^(maxLevels-1)
 + path[1]*(maxChildren+1)^(maxLevels-2)
 + path[2]*(maxChildren+1)^(maxLevels-3)
 + ...
 + path[maxLevels-1]*(maxChildren+1)

Does this equation resemble something to you? Yes you have seen it before, it is equation for conversion a number from numeric system with base (maxLevels+1) into decimal system.

Into Article table we add new column categoryValue into which we will put the calculated value based on a category an article belongs to.


Now why we did all this? Let me show you on example. The tree we will use looks the same as above. At first we have to define maxLevels and maxChildren parameters. Let's define maxLevels=4 and maxChildren=7 (When evaluating tree nodes we will therefore convert number from octal system into decimal.)

We have written a new article and we want to assign it to category Skiing (path is {2,2} which is equal to {2,2,0,0}). So we can calculate the categoryValue column value using equation defined above:

value(Skiing {2,2,0,0}) = 2*8^3 + 2*8^2 + 0*8^1 + 0*8^0 = 1024 + 128 = 1152

Then let's say we want to list all articles from category Sport {2} and below. It means everything between Sport {2} inclusive and Traveling {3} (which is next sibling to Sport) exclusive. So we calculate lower and upper limits:

lower = value(Sport {2,0,0,0}) = 2*8^3 + 0*8^2 + 0*8^1 + 0*8^0 = 1024
upper = value(Traveling {3,0,0,0}) = 3*8^3 + 0*8^2 + 0*8^1 + 0*8^0 = 1536

The sql to get all articles from under the Sport node is then:

select * from Article where categoryValue >= :lower and categoryValue < :upper

It looks more complicated then it really is. For testing we use maxChildren=9 because then paths form direcly decimal numbers. For example when maxLevels=5 then:

value(2,3,4) = value(2,3,4,0,0) = 23400

Real life

Tree parameters are limited by type we store calculated value in. We used long java type which is 64 bits type. We set maxLevels=6 and maxChildren=10. It is sufficient for our needs and very fast.

Jan Šmuk Programming