Architecture¶
plone.pgcatalog replaces ZCatalog’s BTree-based indexes with SQL queries against
PostgreSQL.
Rather than maintaining thousands of BTree objects in ZODB, catalog data
lives in columns on the object_state table – the same table that zodb-pgjsonb
uses to store ZODB object pickles.
This means catalog writes are atomic with object
writes: there is no window where an object is committed but its catalog entry is stale.
This page explains how data flows through the system on writes and reads, how the pieces fit together, and the reasoning behind key design choices.
Key files¶
File |
Purpose |
|---|---|
|
|
|
Query translation: ZCatalog dict -> SQL WHERE + ORDER BY |
|
|
|
SQL write operations ( |
|
Thread-local pending store + |
|
Connection pool discovery + request-scoped connection reuse |
|
|
|
|
|
DDL for catalog columns, functions, and indexes |
|
|
|
|
|
|
|
|
|
|
|
GenericSetup install: snapshot, replace, restore indexes |
|
Addon compatibility adapters (eea.facetednavigation, DateRangeInRangeIndex) |
Overview¶
The high-level data flow looks like this:
flowchart LR
A[Plone object] -->|catalog_object / reindexObject| B[PlonePGCatalogTool]
B -->|extract index data| C[set_pending]
C --> D[Thread-local PendingStore]
D -->|tpc_vote| E[CatalogStateProcessor]
E -->|UPDATE object_state| F[(PostgreSQL)]
Plone content changes flow through the catalog tool, which extracts index values,
stashes them in a thread-local store, and marks the object dirty.
When ZODB commits,
the state processor picks up the pending data and writes it as extra columns on the
same row in object_state.
Write path¶
Step by step¶
catalog_object()orreindexObject()is called. This happens when Plone’s event subscribers fire on content creation, modification, or workflow transitions. The IndexQueue may batch these calls, but the annotation is set immediately (not deferred tobefore_commit).Index data is extracted from the object.
PlonePGCatalogTool._extract_idx()iterates the dynamicIndexRegistry(populated at startup from each Plone site’s ZCatalog indexes). For each index, it reads the configuredsource_attrsfrom theplone.indexer-wrapped object. Custom index types not in the registry are handled byIPGIndexTranslatornamed utilities.Data is stored in the thread-local pending store.
set_pending(zoid, data)writes the extracted index dict, path, and searchable text into athreading.local()dict keyed by ZOID. This is the critical design choice – see “Why thread-local?” below.The object is marked
_p_changed = True. This tells ZODB that the object needs to be serialized and stored during the next transaction commit. Without this, the state processor would never see the object.During ZODB
tpc_vote,CatalogStateProcessor.process()runs. The zodb-pgjsonb storage calls registered state processors for every object being stored. The catalog processor pops the pending data for the object’s ZOID and returns column values (path, idx JSONB, searchable_text tsvector, and any backend-specific columns like BM25 vectors).zodb-pgjsonb writes catalog columns atomically. The
ExtraColumnvalues returned by the processor are included in the sameUPDATE object_statestatement that writes the object’s pickle. One SQL statement, one transaction, zero consistency gaps.
sequenceDiagram
participant P as Plone
participant C as PlonePGCatalogTool
participant PS as PendingStore
participant SP as CatalogStateProcessor
participant PG as PostgreSQL
P->>C: catalog_object(obj)
C->>C: Extract index data
C->>PS: set_pending(zoid, data)
C->>P: obj._p_changed = True
Note over P,PG: ZODB transaction commit
P->>SP: process(zoid, state)
SP->>PS: pop_pending(zoid)
SP->>PG: UPDATE object_state SET path=, idx=, searchable_text=
Why thread-local instead of object annotations?¶
The original design stored pending catalog data as an annotation on the persistent
object itself (its __dict__).
This caused a real problem with CMFEditions: when
Plone creates a version snapshot, it clones the object’s state – including any
annotations.
The cloned annotation would then be processed during commit,
producing duplicate or incorrect catalog entries.
Thread-local storage avoids this entirely.
The pending data lives in
threading.local() and is keyed by ZOID.
Only the state processor reads it, and
it pops each entry exactly once during tpc_vote.
Cloned objects get a different
ZOID and have no pending entry.
The PendingDataManager joins the ZODB transaction to participate in savepoints
and cleanup: if the transaction is aborted, pending data is cleared.
If a
savepoint is rolled back, pending data reverts to its snapshot.
Partial reindex¶
When reindexObject(idxs=["review_state"]) is called with a specific list of
indexes, plone.pgcatalog avoids the overhead of full re-extraction.
Instead:
_partial_reindex()extracts only the requested index values.set_partial_pending(zoid, idx_updates)registers a JSONB merge update.The processor’s
finalize()method executes these asUPDATE object_state SET idx = idx || %(patch)s::jsonb– a lightweight JSONB merge using the||operator.The object is NOT marked
_p_changed. No ZODB serialization happens.
This matters for frequent, targeted reindexes like workflow state changes, where re-serializing the entire object and re-extracting all 30+ indexes would be wasteful.
Special indexes with idx_key=None (SearchableText, effectiveRange, path) cannot be
partially updated because they use dedicated columns, not idx JSONB keys.
When any
requested index is special, _partial_reindex() returns False and the full write path
runs instead.
Interaction with full pending: If a full set_pending() already exists for a zoid
(for example, from a catalog_object call in the same transaction), the partial update merges
into the full pending’s idx dict.
Conversely, a subsequent set_pending() removes any
partial pending for the same zoid – full always supersedes partial.
Savepoint safety: set_partial_pending() uses non-mutating merges ({**old, **new})
because PendingSavepoint snapshots are shallow copies.
Mutating shared dicts would
corrupt rollback state.
Uncataloging¶
When an object is deleted, uncatalog_object() registers a None sentinel in the
pending store.
The state processor sees this sentinel and NULLs all catalog columns
(path, idx, searchable_text, and any backend-specific columns). The base
object_state row is preserved – ZODB still tracks the object’s lifecycle.
Pending-store lookup for security reindex¶
unrestrictedSearchResults extends PG results with objects from the thread-local
pending store when the query includes a path filter.
This is needed because
CMFCatalogAware.reindexObjectSecurity searches
catalog.unrestrictedSearchResults(path=path) to find all objects in a subtree and
reindex their allowedRolesAndUsers.
Newly created objects exist only in the pending
store (not yet committed to PG), so without this merging step security indexes would
never be updated for new objects during workflow transitions.
_pending_brains_for_path() scans the pending store, matches paths against the
query, and returns lightweight _PendingBrain instances with just enough interface
(getPath(), _unrestrictedGetObject()) for reindexObjectSecurity to work.
Read path¶
Step by step¶
catalog.searchResults(query_dict)is called. This is the standard Plone catalog search entry point, triggered by collection views, search forms, listing tiles, and REST API endpoints.Security filters are injected.
apply_security_filters()addsallowedRolesAndUsers(the current user’s roles) andeffectiveRange(the current timestamp for publication date filtering). These are added to the query dict before SQL translation, ensuring every search respects Plone’s security model.build_query()translates the query dict to parameterized SQL. Each key in the query dict is resolved against theIndexRegistryto determine its type (FieldIndex, KeywordIndex, DateIndex, etc.) and the corresponding SQL handler. Unknown indexes fall back toIPGIndexTranslatorutilities, then to simple JSONB field queries. The result is a WHERE clause, ORDER BY expression, LIMIT, OFFSET, and a params dict.Each index key is routed to its handler. The
_QueryBuilderdispatches to type-specific methods:_handle_fieldfor FieldIndex/GopipIndex,_handle_keywordfor KeywordIndex (using JSONB?|overlap),_handle_datefor DateIndex (usingpgcatalog_to_timestamptz()expression indexes),_handle_textfor ZCTextIndex (delegating to the active search backend),_handle_pathfor ExtendedPathIndex (with support for subtree, depth, and navtree queries), and so on.SQL is executed via psycopg with prepared statement support.
_run_search()assembles the final SELECT and executes it withprepare=True, which saves PG parse overhead when the same query shape is repeated within a connection.Results are wrapped in
CatalogSearchResultswithPGCatalogBrainobjects. Each brain is a lightweight wrapper around a PG row dict. It implementsICatalogBrainfor Plone compatibility and supports attribute access into theidxJSONB for catalog metadata. Non-JSON-native metadata (such as ZopeDateTimeobjects) is stored underidx["@meta"]via the Rust codec and decoded on first access with per-brain caching (see Database schema reference for the@metastructure).
Lazy loading¶
The initial catalog query selects only zoid and path – not the full idx
JSONB column.
This is the biggest single performance optimization in the read path.
When brain attribute access first touches a metadata field (for example, brain.Title,
brain.portal_type), the brain delegates to its parent CatalogSearchResults, which
calls _load_idx_batch().
This issues a single SELECT zoid, idx FROM object_state WHERE zoid = ANY(%(zoids)s) for ALL brains in the result set, populating every
brain’s idx in one round-trip.
Why this matters:
Many search results pages never access brain metadata at all (for example, count-only queries, batched listings where only the first page is rendered).
When metadata IS accessed, a single batch query is far cheaper than selecting idx for every row in the initial query (JSONB decompression is expensive for wide rows).
The batch load uses the same PG connection and thus the same REPEATABLE READ snapshot as the original search, guaranteeing consistency.
Connection routing¶
plone.pgcatalog obtains its PostgreSQL connection through a deliberate preference order:
Storage connection (preferred).
get_storage_connection()readscontext._p_jar._storage.pg_connection– the same connection that the zodb-pgjsonb storage instance uses for ZODB object loads. Since this connection is inside a REPEATABLE READ transaction, catalog queries see exactly the same data snapshot as object traversals. No phantom reads, no inconsistencies.Request-scoped pool connection (fallback). When no storage connection is available (for example, the catalog tool has not yet been traversed through a ZODB connection),
get_request_connection()borrows a connection from the pool and stores it inthreading.local(). Subsequent catalog queries within the same Zope request reuse this connection, avoiding pool lock overhead. The connection is returned by anIPubEndsubscriber when the request ends.Pool borrow (last resort). For scripts, tests, and maintenance operations that run outside a Zope request, a connection is borrowed from the pool and returned in a context manager.
Dynamic index registration¶
The IndexRegistry is the bridge between ZCatalog’s index definitions and
plone.pgcatalog’s SQL query builder.
Here is how it gets populated:
At Zope startup, the IDatabaseOpenedWithRoot subscriber fires.
2. _sync_registry_from_db() opens a temporary ZODB connection and traverses the
root to find Plone sites.
3.
For each portal_catalog, registry.sync_from_catalog(catalog) reads
catalog._catalog.indexes and maps each index’s meta_type to an IndexType
enum value.
The getIndexSourceNames() method provides the attribute names to
extract at indexing time.
4.
DateRecurringIndex and DateRangeInRangeIndex instances are autodiscovered and
registered as IPGIndexTranslator utilities.
5.
For each TEXT-type index with a JSONB key (for example, Title, Description, addon
ZCTextIndex fields), _ensure_text_indexes() creates GIN expression indexes
using to_tsvector('simple', idx->>'{key}').
The registry is a module-level singleton.
Once populated, it is used by both the
write path (_extract_idx()) and the read path (build_query()).
Runtime registration¶
Addons can register new indexes after startup – either by calling
catalog.addIndex("my_field," "FieldIndex") directly or, more commonly, via
GenericSetup profile import (catalog.xml).
When addIndex() is called:
The index object is created and stored in _catalog.indexes (a
PersistentMapping), persisting it across restarts.
2.
The index’s meta_type is looked up in META_TYPE_MAP.
If found, the
in-memory IndexRegistry singleton is updated immediately via
registry.register().
Similarly, addColumn() calls registry.add_metadata() inline.
What works immediately (same transaction, no restart needed):
Queries. The query builder dispatches on the registry, so queries using the new index name work right away for standard types (FieldIndex, KeywordIndex, DateIndex, BooleanIndex, UUIDIndex, GopipIndex).
New writes.
extract_idx()iterates the registry, so any subsequentcatalog_object()orreindexObject()extracts and stores the new index value in theidxJSONB.
What requires a Zope restart:
GIN expression indexes for TEXT types.
_ensure_text_indexes()only runs during the startup subscriber. A new ZCTextIndex added at runtime will not have a GIN index until the next restart. Queries still work (PG falls back to a sequential scan), just without GIN acceleration.IPGIndexTranslator utilities for DRI / DRIRI.
_register_dri_translators()and_register_driri_translators()only run at startup. A DateRecurringIndex or DateRangeInRangeIndex added at runtime will fall through to the generic JSONB containment fallback, which does not handle range/recurrence semantics.
What requires a manual reindex:
Existing objects. Objects already in the catalog do not have the new field in their
idxJSONB. AclearFindAndRebuild()orrefreshCatalog()is needed to backfill. This matches ZCatalog’s behavior – adding a new index never autopopulates it.
No DDL needed for standard indexes. Since all index data lives in the
single idx JSONB column, no ALTER TABLE is required – a key benefit of
the JSONB design.
Base class architecture¶
PlonePGCatalogTool inherits from UniqueObject + Folder – not from ZCatalog.
This deliberate “clean break” eliminates the deep inheritance chain
(CatalogTool -> ZCatalog -> ObjectManager -> ..., roughly 15 classes) and the
associated overhead from attribute lookups, security checks, and Acquisition
wrapping in the query and write hot paths.
Benchmarks show ~2x improvement in
query latency after the clean break.
Folder provides ObjectManager containment for ZCatalog index objects and
lexicons (needed by PGCatalogIndexes._getOb() and GenericSetup’s
ZCatalogXMLAdapter). UniqueObject provides the standard getId() method.
A _CatalogCompat(Persistent) shim provides _catalog.indexes and
_catalog.schema for backward compatibility with code that reads ZCatalog
internal data structures.
Existing ZODB instances with the old _catalog (a
full Catalog object from before the clean break) continue to work without
migration because the code only reads .indexes and .schema attributes.
ZCatalog compatibility layer¶
Plone add-ons and core code access ZCatalog internal data structures directly. Since plone.pgcatalog stores no BTree index data in ZODB, these are replaced with PG-backed implementations:
PGCatalogIndexesreplaces theIndexescontainer. When code accessescatalog.Indexes["UID"], it returns aPGIndexproxy instead of the raw ZCatalog index object.PGIndexwraps each ZCatalog index and overrides_indexwith a_PGIndexMappingthat translates_index.get(value)into a PG query on theidxJSONB column. It also overridesuniqueValues()with aSELECT DISTINCTquery.getpath(rid)/getrid(path)use ZOID as the record ID. ZCatalog assigns sequential integer record IDs; plone.pgcatalog uses the object’s ZODB OID (already an integer primary key inobject_state), eliminating the need for a separate mapping table.Brain attribute resolution distinguishes known from unknown fields. Known catalog fields (registered indexes or metadata) are resolved first from the
idx["@meta"]dict (for non-JSON-native types likeDateTime), then from the top-levelidxJSONB. Fields missing from both returnNone– matching ZCatalog’s Missing Value behavior. Unknown fields raiseAttributeError, which triggers thegetObject()fallback inCatalogContentListingObject.__getattr__().Blocked methods: ZCatalog methods that would return wrong/empty data (
getAllBrains,searchAll,getobject, etc.) raiseNotImplementedError.Deprecated proxy:
search()emitsDeprecationWarningand delegates tosearchResults().