Indexing and Searching Objects in the ZODB

Author:Sebastian Ware (jhsware)
Version:unkown

Indexing the contents of your objects allow
you to perform fast complex search operations.

Introduction

Relational databases provide ad hoc search capability by means of SQL queries. In order to perform search operations on objects stored in your ZODB you need to explicitly create indexes. These indexes will update automatically when an object is modified provided it fires the IObjectModified event. The upside to this approach is that you will be be inclined to create simple and well designed indexes that in turn will scale well.

Grok supports the vanilla indexing services available in Zope 3 straight out of the box.

  • FieldIndex: search matching an entire field
  • SetIndex: search for keywords in a field
  • TextIndex: full-text searching
  • ValueSearch: search for values using ranges

You won’t be able to perform SQL-style joins to search related objects. Instead you could index an adapter with calculated properties.

Setup

The egg (package) containing the indexing functionality is called [zc.catalog- x.x.x-py2.x.egg]. The package is installed by including “zc.catalog” in the list “install_requires” in [setup.py]:

install_requires=['setuptools',
              'grok',
              'zc.catalog',
              'hurry.query',
              ],

The “hurry.query” package gives you some simple tools to perform advanced searching.

VERSION PROBLEMS: If you are using Grok <1.1 you need to pin down an earlier version of hurry.query in your buildout.cfg file. The error you will experience is: ComponentLookupError: (<InterfaceClass zope.app.intid.interfaces.IIntIds>,’‘)

[buildout]
...
versions = versions
[versions]
hurry.query = 0.9.2

Don’t forget to re-run buildout.

Example

# interfaces.py
class IProtonObject(Interface):
    """
    This is an interface to the class who's objects I want to index.
    """
    body = schema.Text(title=u'Body', required=False)
# protonobject.py
class ProtonObject(grok.Model):
    """
    This is the actual class.
    """
    interface.implements(interfaces.IProtonObject)

    def __init__(self, body):
        self.body = body
# app.py
import grok
from grok import index
from hurry import query
from hurry.query.query import Query, Text
# hurry.query is a simplified search query language that
# allows you to create ANDs and ORs.

class ContentIndexes(grok.Indexes):
    """
    This is where I setup my indexes. I have two indexes;
    one full-text index called "text_body",
    one field index called "body".
    """
    grok.site(ProtonCMS)

    grok.context(interfaces.IProtonObject)
    # grok.context() tells Grok that objects implementing
    # the interface IProtonObject should be indexed.

    grok.name('proton_catalog')
    # grok.name() tells Grok what to call the catalog.
    # if you have named the catalog anything but "catalog"
    # you need to specify the name of the catalog in your
    # queries.

    text_body = index.Text(attribute='body')
    body = index.Field(attribute='body')
    # The attribute='body' parameter is actually unnecessary if the attribute to
    # be indexed has the same name as the index.

class Index(grok.View):
    grok.context(ProtonCMS)

    def search_content(self, search_query):
            # The following query does a search on the field index "body".
            # It will return a list of object where the entire content of the body attribute
            # matches the search term exactly. I.e. search_query == body
            result_a = Query().searchResults(
                               query.Eq(('proton_catalog', 'body'), search_query)
                               )

            # The following query does a search on the full-text index "text_body".
            # It will return objects that match the search_query. You can use wildcards and
            # boolean operators.
            #
            # Examples:
            # "grok AND zope" returns objects where "body" contains the words "grok" and "zope"
            # "grok or dev*" returns objects where "body" contains the word "grok" or any word
            # beginning with "dev"
            result_b = Query().searchResults(
                               Text( ('proton_catalog', 'text_body'), search_query)
                               )

            return result_a, result_b

Setting up a value index

You need to import zc.catalog to index values. First you need to create a Grok compatible index class.

from zc.catalog.catalogindex import ValueIndex
from grok.index import IndexDefinition
class Value(IndexDefinition):
    index_class = ValueIndex

Then you can use this to create your actual value index in your catalog.

class SiteCatalog(grok.Indexes):
    grok.site(Testvalueindex)
    grok.context(MyObject)
    grok.name('my_catalog')

    counter = Value()

This will index the property “counter” on objects of type “MyObject”. This index supports searches such as greater than, less than, in between. It also supports sorting.

Adding an index to an existing appplication

In the above example, the indexes are only added when a new application is installed. If you want to add an index to an existing application and you have a catalog this is what you do:

import grok
from zope.catalog.interfaces import ICatalog
from zope.component import getUtility
from zope.catalog.field import FieldIndex

app = grok.getSite()
catalog = getUtility(ICatalog, context=app)
name = 'new_index_name'
if not name in catalog:
    catalog[name] = FieldIndex(name, IMyObjects)

This finds your catalog and adds a FieldIndex that will index objects implementing the interface IMyObjects.

Querying the Index Using hurry.query

from zope.component import getUtility
from hurry.query.interfaces import IQuery
from hurry.query import value

class Index(grok.View):
    grok.context(MyApp)
    def render(self):
        mini = int(self.request.form.get('mini', 1))
        maxi = int(self.request.form.get('maxi', 99))

        query = getUtility(IQuery)
        q = value.Between(('content_index', 'counter'), mini, maxi)
        res = query.searchResults(q)
        outp = [e.counter for e in r]
        return "%s" % outp

This will display a list of values. If you are using hurry.query 1.1.0 or higher, you can pass sorting options to the query method. If not, you need to get the catalog and sort calling the index directly.

from zope.component import getUtility
from zope.catalog.interfaces import ICatalog

class Dates(grok.View):
    grok.context(MyApp)
    def render(self):
        mini = int(self.request.form.get('mini', 1))
        maxi = int(self.request.form.get('maxi', 12))
        limit = int(self.request.form.get('limit', 10))

        # Perform the query, returning a result set
        res = self.findMe(d_mini, d_maxi)

        # get the catalog
        content_catalog = getUtility(ICatalog, 'my_catalog')

        # sort the result and return limited result set
        tmp = content_catalog['published'].sort(res.uids, limit=limit)

        # create list of objects
        objs = [res.uidutil.getObject(o) for o in tmp]
        return "%s" % [e.counter for e in objs]

    def findMe(self, mini, maxi):
        q = value.Between(('content_index', 'published'), mini, maxi)
        query = getUtility(IQuery)
        r = query.searchResults(q)
        return r

This also shows how to find a catalog, which is useful if you want to check statistics on the index or need to update (reindex) the index.

If you want to index datetime properties, there is a datetime normalizer which I never got to work. Instead I did something like this.

from zope.interface import Interface
from zope import schema
class IPublished(Interface):
    published = schema.Int(title=u'Normalized datetime')

def _minuteNormalizer(dt):
    tmpin = dt.utctimetuple()[:5]
    multi = (535680, 44640, 1440, 60, 1) # Resolution in minutes
    value = sum(i*j for i,j in zip(tmpin, multi))
    return value

class Published(grok.Adapter):
    grok.implements(IPublished)
    grok.context(MyObj)
    def _published(self):
        return _minuteNormalizer(self.context.published)
    published = property(_published)

class SitePublishCatalog(grok.Indexes):
    grok.site(MyApp)
    grok.context(IPublished)
    grok.name('my_catalog')

    published = Value()

The SitePublishCatalog uses the IPublished() adapter to convert the datetime property “published” to an integer. In order to perform a query you will need to normalize your parameters too. Don’t forget timezones or you might get unexpected results. I use the “pytz” egg to get preconfigured timezones.

from pytz import timezone

d_mini = datetime(2010, 1, 1, tzinfo = timezone('CET'))
d_maxi = datetime(2010, 12, 31, tzinfo = timezone('CET'))
q = value.Between(('content_index', 'published'), _minuteNormalizer(d_mini), _minuteNormalizer(d_maxi))
query = getUtility(IQuery)
r = query.searchResults(q)

Learning More

The “hurry.query” package contains the DocTest “query.txt” that shows how to perform more complex search queries.