Generic indices in GenieDB

Posted on September 27, 2010 by GenieDB

We’re gearing up to release v0.5 of GenieDB right now, and the biggest new feature in v0.5 is the generic indexing framework.

Originally, GenieDB provided indices on fields of records. As we do not mandate a per-table schema, each record in a table could have totally different fields; so when an index is created on the field “foo” of the table “bar”, we index any record in “bar” that has a “foo” field under that value in the index. If the record did not have a “foo” field, then it doesn’t appear in the index at all (whereas records with a NULL value for “foo” are indexed under NULL, of course).

However, expecting future interesting developments, we made sure this part of the code – figuring out how a given record should appear in a given index – would be easy to extend in future. In general, there could be any function from records to zero or more index entries.

For v0.5 we filled in some more of that infrastructure; we’ve not brought it to its logical conclusion yet. But to start with, we’ve extended the type system with a compound type, which basically represents a sequence of typed component values. This means that any field of a record might be a compound of other values – which, in turn, might be compounds; arbitrary tree structures can therefore be stored.

Our next trick was to extend the indexing system to support compound indices. These refer to a list of fields in their specification, and the indexed value is then a compound made by assembling the values of the named fields from the records. Since each element in the compound has to be associated with a particular field, we have to “stuff” missing fields in the record with NULLs.

The consequence of this is that we can have compound indices, as found in SQL. Which is useful, as we’ve used this facility to extend our MySQL interface to support compound indices in MySQL using it.

However, here at GenieDB, we don’t like to do things by half! Rather than tinkering with how indices work in the low-level indexing engine, we’ve just extended the function that extracts a record’s “index image” to create compound values, which are then indexed. In future, we can add new kinds of indexing function. How about one that splits a string field into words, stems them, maps them to synonym root words with a thesaurus, removes any words found in a stoplist, and then indexes the record under that list of words? That would mean that a record with the indexed field containing “GenieDB developers work too hard!” would appear in the index under “geniedb”, “developer”, “work”, “too”, and “hard”. We could expose this type of index as a free-text index under MySQL, and then support free-text queries with relatively little work. Or how about an index that maps two numeric fields representing latitude and longitude (with an optional third for altitude?) into a three-dimensional Cartesian coordinate relative to the center of the planet, and then combines those three numbers into a Z-order curve and indexes the record under the resulting integer? We’ll then have a geospatial index. How about an index that applies the Double Metaphone algorithm to the words of a string field and indexes the record under the metaphone codes as well as the actual words, with different prefix characters to distinguish them? That’ll give us a fuzzy index that we can search for a word by searching for the literal word (with the appropriate prefix character) as well as for the Double Metaphone codes of the word (with the other prefix character), meaning we will find direct matches as well as approximate sounds-like matches.

But adding more and more index-generation functions into the core isn’t what we want to do. That way lies bloat! What will be much more useful is to embed a suitably sandboxed and lightweight embedded language, so that users can provide their own indexing function that does delightful application-specific things. Things like free-text indices will need special support in the MySQL plugin to appear as MySQL free-text indices, but any type of index might be exposed as a virtual read-only column to be searched on to work around MySQL’s limited view of what an index can do; MySQL doesn’t allow indices on general expressions, but we can provide that functionality directly when we interface with databases that do.

GenieDB’s provides cloud MySQL database-as-a-service that improves database performance with increased availability, fast response time, and is elastically scalable. Database administrative tasks are automated so you can focus on your core business and application development.

  • http://www.hybrid-cluster.com/ Luke Marsden

    Seriously cool. Good work!