Structured Associative Containers

Ternary Search Tree containers to replace set<string> and map<string, Value>

Introduction
Advanced searches overview
Tutorial
Reference: Structured Container concept
Class structured_set
Class structured_map
Class structured_multiset
Class structured_multimap
Implementation class ternary_tree
Performance notes
Implementation details
Links
Test Suite

Download: Latest version (0.684) http://abc.se/~re/code/tst/ternary_tree.zip

Copyleft: rasmus ekman 2007-2009
Weblink: http://abc.se/~re/code/tst

Introduction

Structured containers are map and set -like containers specialized for strings. They are commonly used for dictionaries.
Structured containers have two major benefits:

They offer near-match searches (wildcard search, partial match etc) that are hard to implement with other containers.
Lookup performance is on a par with hashed containers for many common applications, and 2-5 times faster than standard maps and sets (with string-like keys).

Of course there is a price to pay: structured containers use much more memory than other containers: Around 6-8 bytes per letter inserted (whether char or wchar_t); an English 150 k word dictionary uses eg 7.3 MB to store 1.2 MB words (2.4 MB of wchar_t words).

The container classes in this library can be used as drop-in replacements for set and map (or unordered_set, unordered_map):

structured_set: This stores unique keys and allows structured key searches.
structured_multiset: This stores non-unique keys.
structured_map: This is a Pair Associative Container, as it allows associating a value with each key.
structured_multimap: Technically, a Multiple, Sorted, Pair Associative Container - it allows storing several values with each key.

While the STL standard associative containers are normally backed by a binary tree structure, Structured Containers are backed by a Ternary Search Tree, as presented by Jon Bentley and Robert Sedgewick in [1].

Class ternary_tree<Key, Value, Comp, Alloc> provides the implementation backend. Due to its internals, its interface cannot easily be made to conform with standard STL concepts, so it is used internally by the structured* wrapper classes (much like STL's internal rb_tree class).

Basically, if you have code using sets or maps, you have code to use structured containers. And with 1-3 lines of code, you're ready to make advanced imprecise searches in your dictionaries.
See the usage section for examples of how to use these classes.

Library status

Compatibility: Note that the file tst_concept_checks.cpp is currently broken. Will investigate.

version 0.684: (Jan 2009) Fix standard-breakage in multimap/multiset return from insert(const value_type&).
Added operator-> to iterator wrapper for C++0x compatibility. Thanks to Geoffrey Noel for reports.

version 0.683: (March 2007) Fix portability issues for GCC and non-STLport libraries. Fix longest_match.
Thanks to Arjen Wagenaar for several reports, fixes and encouragement. Thanks also to Michel Tourn for reports.

version 0.68: (Dec 2006) Implement TST_NODE_COUNT_TYPE macro, which can be used to control node size on 64-bit systems. See class ternary_tree

version 0.68 (alpha): Reimplemented node type. Do proper management of value type (was inconsistent, partly unimplemented - duh!)

Library status
Compatibility:	Note that the file tst_concept_checks.cpp is currently broken. Will investigate.
version 0.684: (Jan 2009)	Fix standard-breakage in multimap/multiset return from `insert(const value_type&)`. Added `operator->` to iterator wrapper for C++0x compatibility. Thanks to Geoffrey Noel for reports.
version 0.683: (March 2007)	Fix portability issues for GCC and non-STLport libraries. Fix longest_match. Thanks to Arjen Wagenaar for several reports, fixes and encouragement. Thanks also to Michel Tourn for reports.
version 0.68: (Dec 2006)	Implement TST_NODE_COUNT_TYPE macro, which can be used to control node size on 64-bit systems. See class ternary_tree
version 0.68 (alpha):	Reimplemented node type. Do proper management of value type (was inconsistent, partly unimplemented - duh!)

Sub-key, or Structure Searches

(a new interface for these searches will be specified in the future)

Ternary trees allow searches that match parts of keys and ignores mismatches in other parts.
In the current interface we specify a small number of searches facilitated by the tree structure; the Partial Match and Hamming searches are defined in several other implementations (showcased in Bentley and Sedgewick code). The Levenshtein and combinatorial searches are not found in other ternary trees (that I know of).

Name (function name) Description

Prefix match (prefix_range) Finds keys sharing a common prefix, returns a pair of iterators.

Longest match (longest_match) Finds the longest key that matches beginning of search string. A typical application is to tokenize a string using the ternary tree as dictionary.

Partial match, or wildcard search (partial_match_search) Accepts a search string with wildcard characters that will match any letter, eg "b?nd" would match "band", "bend", "bind", "bond" in an English dictionary.

Search allowing N mismatches, (hamming_search) Accepts a search string and an integer dist indicating how many non-matching letters are allowed, then finds keys matching search string that have at most dist mismatches. This works like a partial match search with all combinations of dist wildcards in the search string.
hamming_search("band", 1) matches the wildcard search plus "bald", "bane" and "wand", etc.
The version here, following DDJ code, extends the strict Hamming search by also allowing shorter and longer strings; a search for "band", dist = 1, also finds "ban" and "bandy" etc.
See also http://wikipedia.org/wiki/Hamming_distance

Levenshtein distance search (levenshtein_search - consider descriptive name)
Hamming search matches characters in fixed position, allowing substitution of dist chars. Levenshtein search also allows shifting parts of the search string by insertion or skipping chars (in dist places). So levenshtein_search("band", 1) extends the hamming_search set with "and" and "bland", etc. A typical application is to match mispelt words.
See also http://wikipedia.org/wiki/Levenshtein_distance

Combinatorial or "scrabble" search (combinatorial_search) Finds all keys using the characters in search string. combinatorial_search("band") finds "ad", "and", "bad", "dab", "nab", etc. A count of wildcards can be added, also allowing nonmatching characters (use with care, values over 10% of average key length may cause the algorithm to traverse a large part of the tree).

Name (function name)	Description
Prefix match (prefix_range)	Finds keys sharing a common prefix, returns a pair of iterators.
Longest match (longest_match)	Finds the longest key that matches beginning of search string. A typical application is to tokenize a string using the ternary tree as dictionary.
Partial match, or wildcard search (partial_match_search)	Accepts a search string with wildcard characters that will match any letter, eg "b?nd" would match "band", "bend", "bind", "bond" in an English dictionary.
Search allowing `N` mismatches, (hamming_search)	Accepts a search string and an integer `dist` indicating how many non-matching letters are allowed, then finds keys matching search string that have at most dist mismatches. This works like a partial match search with all combinations of dist wildcards in the search string. `hamming_search`("band", 1) matches the wildcard search plus "bald", "bane" and "wand", etc. The version here, following DDJ code, extends the strict Hamming search by also allowing shorter and longer strings; a search for "band", dist = 1, also finds "ban" and "bandy" etc. See also http://wikipedia.org/wiki/Hamming_distance
Levenshtein distance search (levenshtein_search - consider descriptive name)	Hamming search matches characters in fixed position, allowing substitution of dist chars. Levenshtein search also allows shifting parts of the search string by insertion or skipping chars (in dist places). So `levenshtein_search("band", 1)` extends the hamming_search set with "and" and "bland", etc. A typical application is to match mispelt words. See also http://wikipedia.org/wiki/Levenshtein_distance
Combinatorial or "scrabble" search (combinatorial_search)	Finds all keys using the characters in search string. `combinatorial_search`("band") finds "ad", "and", "bad", "dab", "nab", etc. A count of wildcards can be added, also allowing nonmatching characters (use with care, values over 10% of average key length may cause the algorithm to traverse a large part of the tree).

See advanced search overview in the tutorial.

These searches are defined for all containers in this library. But they are also marked as deprecated (to be replaced by generic algorithms with same interface). For a relative performance comparison of imprecise searches, see the second table in Performance Notes.

Future directions

The searches currently defined are clearly special cases in a sea of search possibilities. We have only defined searches that are relatively efficient, compared to other combinations of containers and algorithms. But there can be many variations on the available searches: increasing Hamming/Levenshtein distance at the end of words, or matching limited ranges of characters (eg allowing mismatches only in vowels), etc.

The next step for this project is to support a more flexible low-level interface for traversing and filtering tree nodes. The interface for these "structured searches" is open for consideration, but it will basically define sub-key iterators, conversion of full-key from sub-key iterators, and a small collection of algorithms operating on these sub-key iterators.

At least the following operations are needed:

sub-key match: matching a part of a key (prefix, or starting from current char position)
key element range increment: from a sub-key position, match a range of characters in next position (returns a list of sub-key iterators? - or iterator-like operation?)
conversion from sub-key iterator to full-key iterator range (nearest and post-furthest keys in the subtree)
is_key(subkey_iterator pos): true if end-of-key exists at iterator position.
count_elements(subkey_iterator pos): returns number of available key elements at position.
In all predefined algorithms above, either a specific, or any char is matched, we would also support arbitrary char sets (possibly with special case for char ranges).

Structured Associative Containers

Table of contents

Introduction

Sub-key, or Structure Searches

Future directions