Database schema reference¶
This page documents the PostgreSQL schema extensions, indexes, JSONB structure, and SQL functions installed by plone.pgcatalog.
object_state table extensions¶
plone.pgcatalog extends the object_state table (owned by zodb-pgjsonb)
with the following columns:
Column |
Type |
Purpose |
|---|---|---|
|
|
Physical object path (for example, |
|
|
Parent path (for path queries) |
|
|
Path depth (for depth-limited queries) |
|
|
Index and metadata values (lightweight after extraction) |
|
|
Weighted full-text search vector |
|
|
Non-JSON-native metadata ( |
|
|
Interface-based lookups (extracted from idx) |
|
|
Dedicated security filter column (allowedRolesAndUsers) |
|
|
BM25 fallback column (when BM25 active) |
|
|
Per-language BM25 column (when BM25 active) |
The first eight columns are present after schema DDL has been applied
(automatically on first startup with a registered CatalogStateProcessor).
parent_path and path_depth are derived from path (the parent is
the path with its last segment removed; the depth is the number of
segments). They are stored as separate columns because the most
frequent path queries—direct children (depth=1) and navigation
trees—become simple equality checks (parent_path = X or
parent_path = ANY(...)) instead of LIKE prefix scans combined
with depth filtering.
The search_bm25 and
per-language search_bm25_{lang} columns are created only when
VectorChord-BM25 extensions are detected at startup.
PostgreSQL indexes¶
Core indexes¶
Index Name |
Type |
Expression |
Purpose |
|---|---|---|---|
|
B-tree |
|
Path column lookup (brain construction) |
|
GIN |
|
JSONB containment/existence queries |
|
GIN |
|
Full-text search |
Path expression indexes¶
Index Name |
Type |
Expression |
Purpose |
|---|---|---|---|
|
B-tree |
|
Path equality queries |
|
B-tree ( |
|
Path LIKE prefix queries |
|
B-tree |
|
Parent path queries (depth=1) |
|
B-tree |
|
Depth-limited queries |
Date expression indexes¶
Index Name |
Type |
Expression |
Purpose |
|---|---|---|---|
|
B-tree |
|
Date sorting/filtering |
|
B-tree |
|
Date sorting/filtering |
|
B-tree |
|
Date sorting/filtering |
|
B-tree |
|
Date sorting/filtering |
Field expression indexes¶
Index Name |
Type |
Expression |
Purpose |
|---|---|---|---|
|
B-tree |
|
Title sorting |
|
B-tree |
|
Type filtering |
|
B-tree |
|
Workflow state filtering |
|
B-tree |
|
UUID lookup |
Composite indexes¶
Index Name |
Type |
Expression |
Purpose |
|---|---|---|---|
|
B-tree |
|
Folder listings, navigation |
|
B-tree |
|
Collections, search |
|
B-tree |
|
Navigation tree |
|
B-tree |
|
Workflow-filtered listings |
Keyword GIN indexes¶
Index Name |
Type |
Expression |
Purpose |
|---|---|---|---|
|
GIN |
|
Security filter (every query) |
|
GIN |
|
Interface-based lookups |
|
GIN |
|
Subject keyword queries |
Partial indexes¶
Index Name |
Type |
Expression |
Purpose |
|---|---|---|---|
|
B-tree (partial) |
Navigation fields where |
Navigation listings (only ~1.6% of rows) |
|
B-tree (partial) |
Event fields where |
Calendar/event queries |
Text expression indexes¶
Index Name |
Type |
Expression |
Purpose |
|---|---|---|---|
|
GIN |
|
Title word-level matching |
|
GIN |
|
Description word-level matching |
Additional GIN expression indexes are autocreated at startup for any
addon ZCTextIndex fields discovered in portal_catalog.
These follow
the naming pattern idx_os_cat_{key}_tsv and use the 'simple'
regconfig for word-level matching.
Custom catalog indexes (FieldIndex, DateIndex, BooleanIndex, UUIDIndex,
DateRecurringIndex) also get btree expression indexes automatically at
startup based on the IndexRegistry. Date fields use the
pgcatalog_to_timestamptz() wrapper.
BM25 indexes (optional)¶
When VectorChord-BM25 extensions are installed:
Index Name |
Type |
Expression |
Purpose |
|---|---|---|---|
|
BM25 |
|
Fallback BM25 ranking |
|
BM25 |
|
Per-language BM25 ranking |
All indexes use WHERE idx IS NOT NULL or WHERE searchable_text IS NOT NULL
partial index predicates to exclude uncataloged rows.
idx JSONB structure¶
Standard Plone catalog indexes and JSON-native metadata are stored together in a single JSONB document. Heavy or high-cardinality keys are extracted into dedicated columns (see Extracted columns (ExtraIdxColumn) below). Example for a typical Plone Page:
{
"UID": "abc123-def456-789",
"Title": "My Document",
"Description": "A sample document",
"portal_type": "Document",
"review_state": "published",
"Creator": "admin",
"Subject": ["Python", "Plone"],
"created": "2025-01-15T10:00:00+00:00",
"modified": "2025-02-20T14:30:00+00:00",
"effective": "2025-01-15T10:00:00+00:00",
"expires": null,
"is_folderish": false,
"is_default_page": false,
"sortable_title": "my document",
"Language": "en",
"path": "/plone/my-document",
"path_parent": "/plone",
"path_depth": 2,
"getObjPositionInParent": 5
}
Key conventions:
Index values (used for PG queries) are converted to JSON-safe types: dates become ISO 8601 strings, etc. These live at the top level of idx.
Metadata values that are JSON-native (str, int, float, bool, None, and lists/dicts of these) also live at the top level of idx.
Multi-value fields (for example,
Subject) are stored as JSON arrays.Boolean fields are stored as JSON
true/false.nullvalues are stored explicitly where the object has no value for that field.Path fields (
path,path_parent,path_depth) are stored in both the dedicated table columns and the idx JSONB for unified path query support.Fields that are both indexes and metadata (for example
effective) appear in both places: top-level idx holds the converted ISO string (for PG queries), while themetacolumn holds the originalDateTime(for brain attribute access).
Extracted columns (ExtraIdxColumn)¶
To keep idx compact (below the PostgreSQL TOAST threshold of ~2 KB),
heavy or high-cardinality keys are extracted from the idx dict at
write time and stored in dedicated columns.
The extraction is managed by a
generic ExtraIdxColumn registry in columns.py.
idx key |
Column |
Type |
Reason |
|---|---|---|---|
|
|
|
Non-JSON-native metadata ( |
|
|
|
Interface-based lookups; native GIN on |
|
|
|
Security filter in every query; same optimization as |
Extracted keys are popped from the idx dict before it is written to
the idx column.
They do not appear in idx for newly written rows.
Pre-migration rows (written before this feature) may still contain these
keys in idx; the brain falls back to idx["@meta"] when the meta
column is NULL.
The meta column stores codec-encoded non-JSON-native metadata (Zope
DateTime, stdlib datetime, date, image_scales, etc.) via the Rust
codec (zodb-json-codec).
This preserves original Python types so that
brain.effective returns a DateTime object, not a string.
The meta dict uses codec type markers (for example @dt, @cls+@s) and
is decoded once per brain on first access (cached thereafter).
Note
Extracted keys are no longer accessible via brain.object_provides or
brain.allowedRolesAndUsers.
These are query-only index fields — not
catalog metadata — so brain attribute access is not needed.
If a custom addon registers one of these as metadata, the data is still in
the database (in its dedicated column) but would need a brain extension to
be surfaced.
Text Extraction Queue (Optional)¶
Created when PGCATALOG_TIKA_URL is set.
Provides a PostgreSQL-backed
job queue for asynchronous text extraction via Apache Tika.
text_extraction_queue Table¶
Column |
Type |
Default |
Purpose |
|---|---|---|---|
|
|
auto |
Primary key |
|
|
— |
Object zoid (references |
|
|
— |
Transaction ID (identifies the blob version) |
|
|
— |
MIME type (for example, |
|
|
|
Job status: |
|
|
|
Number of processing attempts |
|
|
|
Maximum retry attempts before marking as |
|
|
— |
Error message from the last failed attempt |
|
|
|
Job creation timestamp |
|
|
|
Last status change timestamp |
Constraints: UNIQUE(zoid, tid) prevents duplicate jobs for the same
object version.
Queue Indexes¶
Index Name |
Type |
Expression |
Purpose |
|---|---|---|---|
|
B-tree (partial) |
|
Fast dequeue of pending jobs |
Queue Trigger¶
A NOTIFY trigger fires on every INSERT, sending a
text_extraction_ready notification with the job ID.
This wakes the
extraction worker instantly without polling.
CREATE TRIGGER trg_notify_extraction
AFTER INSERT ON text_extraction_queue
FOR EACH ROW EXECUTE FUNCTION notify_extraction_ready();
See Tika text extraction architecture for the full architecture and Enable Tika text extraction for setup instructions.
SQL functions¶
See SQL functions reference for the full reference of
pgcatalog_to_timestamptz(), pgcatalog_lang_to_regconfig(),
pgcatalog_merge_extracted_text(), and rrule functions.
rrule_plpgsql schema¶
Installed automatically at startup. Provides a pure PL/pgSQL implementation of RFC 5545 RRULE expansion for DateRecurringIndex queries. No C extensions required.
The schema is created idempotently using CREATE SCHEMA IF NOT EXISTS
with exception handling for type definitions.
Functions are installed
from the vendored rrule_schema.sql file in the package.
See SQL functions reference for rrule."between"() and rrule."after"()
function signatures.
Schema installation¶
Schema is applied automatically at startup via the
CatalogStateProcessor.get_schema_sql() method, called by zodb-pgjsonb
when the state processor is registered.
The installation sequence is:
Catalog columns (ALTER TABLE ... ADD COLUMN IF NOT EXISTS)
2.
SQL functions (CREATE OR REPLACE FUNCTION)
3.
Catalog indexes (CREATE INDEX IF NOT EXISTS)
4. rrule schema (idempotent CREATE SCHEMA IF NOT EXISTS)
5.
BM25 extensions and columns (if detected)
6.
Text extraction queue and merge function (if PGCATALOG_TIKA_URL is set)
All DDL is idempotent and safe to re-execute on an existing database.
The install_catalog_schema() function in schema.py executes each
DDL block via conn.execute().