Indexing and Searching Objects in the ZODB ========================================== :Author: Sebastian Ware (jhsware) :Version: unkown | | Indexing the contents of your objects allow | you to perform fast complex search operations. Introduction ------------ Relational databases provide ad hoc search capability by means of SQL queries. In order to perform search operations on objects stored in your ZODB you need to explicitly create indexes. These indexes will update automatically when an object is modified provided it fires the IObjectModified event. The upside to this approach is that you will be be inclined to create simple and well designed indexes that in turn will scale well. Grok supports the vanilla indexing services available in Zope 3 straight out of the box. * FieldIndex: search matching an entire field * SetIndex: search for keywords in a field * TextIndex: full-text searching * ValueSearch: search for values using ranges You won’t be able to perform SQL-style joins to search related objects. Instead you could index an adapter with calculated properties. Setup ----- The egg (package) containing the indexing functionality is called [zc.catalog- x.x.x-py2.x.egg]. The package is installed by including "zc.catalog" in the list "install_requires" in [setup.py]: .. code-block:: python install_requires=['setuptools', 'grok', 'zc.catalog', 'hurry.query', ], The "hurry.query" package gives you some simple tools to perform advanced searching. VERSION PROBLEMS: If you are using Grok <1.1 you need to pin down an earlier version of hurry.query in your buildout.cfg file. The error you will experience is: ComponentLookupError: (,'') .. code-block:: python [buildout] ... versions = versions [versions] hurry.query = 0.9.2 Don't forget to re-run buildout. Example ------- .. code-block:: python # interfaces.py class IProtonObject(Interface): """ This is an interface to the class who's objects I want to index. """ body = schema.Text(title=u'Body', required=False) .. code-block:: python # protonobject.py class ProtonObject(grok.Model): """ This is the actual class. """ interface.implements(interfaces.IProtonObject) def __init__(self, body): self.body = body .. code-block:: python # app.py import grok from grok import index from hurry import query from hurry.query.query import Query, Text # hurry.query is a simplified search query language that # allows you to create ANDs and ORs. class ContentIndexes(grok.Indexes): """ This is where I setup my indexes. I have two indexes; one full-text index called "text_body", one field index called "body". """ grok.site(ProtonCMS) grok.context(interfaces.IProtonObject) # grok.context() tells Grok that objects implementing # the interface IProtonObject should be indexed. grok.name('proton_catalog') # grok.name() tells Grok what to call the catalog. # if you have named the catalog anything but "catalog" # you need to specify the name of the catalog in your # queries. text_body = index.Text(attribute='body') body = index.Field(attribute='body') # The attribute='body' parameter is actually unnecessary if the attribute to # be indexed has the same name as the index. class Index(grok.View): grok.context(ProtonCMS) def search_content(self, search_query): # The following query does a search on the field index "body". # It will return a list of object where the entire content of the body attribute # matches the search term exactly. I.e. search_query == body result_a = Query().searchResults( query.Eq(('proton_catalog', 'body'), search_query) ) # The following query does a search on the full-text index "text_body". # It will return objects that match the search_query. You can use wildcards and # boolean operators. # # Examples: # "grok AND zope" returns objects where "body" contains the words "grok" and "zope" # "grok or dev*" returns objects where "body" contains the word "grok" or any word # beginning with "dev" result_b = Query().searchResults( Text( ('proton_catalog', 'text_body'), search_query) ) return result_a, result_b Setting up a value index ------------------------ You need to import zc.catalog to index values. First you need to create a Grok compatible index class. .. code-block:: python from zc.catalog.catalogindex import ValueIndex from grok.index import IndexDefinition class Value(IndexDefinition): index_class = ValueIndex Then you can use this to create your actual value index in your catalog. .. code-block:: python class SiteCatalog(grok.Indexes): grok.site(Testvalueindex) grok.context(MyObject) grok.name('my_catalog') counter = Value() This will index the property "counter" on objects of type "MyObject". This index supports searches such as greater than, less than, in between. It also supports sorting. Adding an index to an existing appplication ------------------------------------------- In the above example, the indexes are only added when a new application is installed. If you want to add an index to an existing application and you have a catalog this is what you do:: import grok from zope.catalog.interfaces import ICatalog from zope.component import getUtility from zope.catalog.field import FieldIndex app = grok.getSite() catalog = getUtility(ICatalog, context=app) name = 'new_index_name' if not name in catalog: catalog[name] = FieldIndex(name, IMyObjects) This finds your catalog and adds a FieldIndex that will index objects implementing the interface *IMyObjects*. Querying the Index Using hurry.query ------------------------------------ .. code-block:: python from zope.component import getUtility from hurry.query.interfaces import IQuery from hurry.query import value class Index(grok.View): grok.context(MyApp) def render(self): mini = int(self.request.form.get('mini', 1)) maxi = int(self.request.form.get('maxi', 99)) query = getUtility(IQuery) q = value.Between(('content_index', 'counter'), mini, maxi) res = query.searchResults(q) outp = [e.counter for e in r] return "%s" % outp This will display a list of values. If you are using hurry.query 1.1.0 or higher, you can pass sorting options to the query method. If not, you need to get the catalog and sort calling the index directly. .. code-block:: python from zope.component import getUtility from zope.catalog.interfaces import ICatalog class Dates(grok.View): grok.context(MyApp) def render(self): mini = int(self.request.form.get('mini', 1)) maxi = int(self.request.form.get('maxi', 12)) limit = int(self.request.form.get('limit', 10)) # Perform the query, returning a result set res = self.findMe(d_mini, d_maxi) # get the catalog content_catalog = getUtility(ICatalog, 'my_catalog') # sort the result and return limited result set tmp = content_catalog['published'].sort(res.uids, limit=limit) # create list of objects objs = [res.uidutil.getObject(o) for o in tmp] return "%s" % [e.counter for e in objs] def findMe(self, mini, maxi): q = value.Between(('content_index', 'published'), mini, maxi) query = getUtility(IQuery) r = query.searchResults(q) return r This also shows how to find a catalog, which is useful if you want to check statistics on the index or need to update (reindex) the index. If you want to index datetime properties, there is a datetime normalizer which I never got to work. Instead I did something like this. .. code-block:: python from zope.interface import Interface from zope import schema class IPublished(Interface): published = schema.Int(title=u'Normalized datetime') def _minuteNormalizer(dt): tmpin = dt.utctimetuple()[:5] multi = (535680, 44640, 1440, 60, 1) # Resolution in minutes value = sum(i*j for i,j in zip(tmpin, multi)) return value class Published(grok.Adapter): grok.implements(IPublished) grok.context(MyObj) def _published(self): return _minuteNormalizer(self.context.published) published = property(_published) class SitePublishCatalog(grok.Indexes): grok.site(MyApp) grok.context(IPublished) grok.name('my_catalog') published = Value() The SitePublishCatalog uses the IPublished() adapter to convert the datetime property "published" to an integer. In order to perform a query you will need to normalize your parameters too. Don't forget timezones or you might get unexpected results. I use the "pytz" egg to get preconfigured timezones. .. code-block:: python from pytz import timezone d_mini = datetime(2010, 1, 1, tzinfo = timezone('CET')) d_maxi = datetime(2010, 12, 31, tzinfo = timezone('CET')) q = value.Between(('content_index', 'published'), _minuteNormalizer(d_mini), _minuteNormalizer(d_maxi)) query = getUtility(IQuery) r = query.searchResults(q) Learning More ------------- The "hurry.query" package contains the DocTest "query.txt" that shows how to perform more complex search queries.