Overview     Usage     Performance Notes     Links tst 0.68 - 23 Jan 2009

Implementation Details

(In the following, "original" and "DDJ" code refers to the article by Bentley/Sedgewick published in Dr Dobb's Journal, and the accompanying C source code - see Links)

In most implementations, a ternary tree node has the following members:

 struct node {
    char splitchar;  // key letter, or 0 at end of key
    node *hikid;     // subtree of keys with higher value than splitchar
    node *eqkid;     // subtree matching splitchar (pointer to mapped value at end-of-key node)
    node *lokid;     // subtree less than splitchar
    node *parent;    // necessary for iteration (not needed for insert/find)
 }; 

This means that each node is 1 char plus three or four pointers size. On many systems, struct member alignment makes the char member consume size of one pointer as well, so we have 4 (or 5) x sizeof(pointer) per node in the tree. With several kinds of dictionaries, the node count ends up at around 0.3-0.5 times total key length (since keys share nodes). This is even more expensive on 64-bit machines.

There are several variation points in the node class:

  1. the DDJ C code designates an invalid value of zero to indicate end-of-string. We want to allow any string as key, so the end-of-string representation should change. We note that on many platforms, C/C++ struct member alignment leaves a "hole" in the binary representation of the node, between the char and the first pointer ("hikid"). On such systems there is no space cost to use another char-sized value to indicate end node. This also works for wchar_t strings on 32- or 64-bit systems.
  2. The original code stores a value for each string in the terminal node's "equal" pointer. The value in DDJ code is always a pointer to the terminated string. This is used to make advanced searches work (they return an array of pointers to strings stored in end-nodes). In reality this means that strings may need to be copied on insertion (not reflected in DDJ timings).
  3. Original DDJ code does not support iterating over strings in the tree. Idiomatic STL-like container style strongly suggests that iteration should be supported. This is fairly simple to implement if a parent pointer is added to the node struct: Because when an end-node is reached, the iterator must backtrack to find the previous branch point.

The parent pointer also makes it possible to recover the inserted string by walking nodes backward from a terminal node to the root. Complexity is key length, plus log(tree.size), but it means inserted keys do not have to be copied to the end node. We opt to cache keys in iterators, at no measurable extra cost in iteration.

Instead of the key, an arbitrary value can be associated with endnodes. However, it should not be allowed to increase node size, since most nodes in the tree are not endnodes. In this library we store the mapped value directly in end-node if it is <= sizeof(void*). Larger objects are allocated on the heap, and a pointer to the copy is stored in end-node (the copy is managed by the tree).

Now for some optimization

We use a vector<node> as pool allocator, and record eq-hi-lo links as vector index instead of pointers. The pool allocation essentially follows original C code insert2() principle. For us, it also simplifies reallocation, since pointers do not have to be rebound; the indices are always valid. This has the following consequences:

(In our binary-cognizant version where zero is a regular char value, this still holds, we just change the end-node test accordingly)

In the final cut, our node struct data members appear roughly like this:

 struct node {
    CharType splitchar;  // key letter, or 0 at end of key (to make sure lokid is never allowed)
    CharType endflag;    // zero on normal nodes, 1 at end nodes, 2 at erased nodes.
    node_index hikid;    // subtree of keys with higher value than splitchar
    node_index lokid;    // subtree less than splitchar
    node_index parent;   // necessary for iteration (not needed for insert/find)
 }; 

where CharType is defined by template Key::value_type, and treated as an unsigned type (so 0 is the lowest value); and node_index is a size_t -like type used by the node storage backend (currently std::vector).

This optimization could also be applied to C version, trimming space requirement in DDJ code to 3-word nodes.


ternary_tree 0.68 -- by rasmus ekman -- Page generated with Doxygen 1.5.6 on 23 Jan 2009