Full-text search deep dive¶
plone.pgcatalog supports four tiers of text search, from basic word matching to probabilistic relevance ranking. Each tier builds on the previous one, and the system automatically selects the best available tier at startup.
This page explains how PostgreSQL full-text search works under the hood, how plone.pgcatalog layers its search tiers on top, and how language-aware indexing and relevance ranking fit together.
How PostgreSQL full-text search works¶
PostgreSQL’s full-text search is built on two data types and a matching operator:
tsvector: A sorted list of normalized tokens (lexemes) with position information. Created from text viato_tsvector(), which applies language-specific rules: lowercasing, stop word removal, and stemming (reducing “running” to “run”).tsquery: A boolean expression of tokens, created from user input viaplainto_tsquery()(which inserts implicit AND between words) orto_tsquery()(which supports explicit boolean operators).@@: The match operator.tsvector @@ tsqueryreturns true if the document contains all required tokens.
A GIN index on a tsvector column makes the match operation fast: PostgreSQL maintains an inverted index mapping each token to the set of rows containing it. A search for “security policy” looks up both tokens in the GIN index and intersects the row sets – no table scan required.
Language configurations (called regconfig in PostgreSQL) control how text is
tokenized and stemmed.
The english configuration knows that “running,” “runs,” and
“ran” are forms of “run.” The simple configuration performs no stemming – it
matches exact word forms only.
Four tiers of text search¶
plone.pgcatalog implements a layered search architecture. Each tier adds capabilities without removing the previous tier’s functionality.
Tier |
Scope |
Mechanism |
Requires |
|---|---|---|---|
1 |
SearchableText |
Dedicated |
PostgreSQL (built-in) |
2 |
SearchableText |
BM25 probabilistic scoring via per-language columns |
VectorChord-BM25 extension |
3 |
Title / Description |
GIN expression indexes on |
PostgreSQL (built-in) |
4 |
Addon ZCTextIndex |
GIN expression indexes, |
PostgreSQL (built-in) |
Tiers 1 and 2 are alternatives for SearchableText – tier 2 upgrades tier 1 when the BM25 extension is available. Tiers 3 and 4 are independent and always active alongside whichever SearchableText tier is in use.
Tier 1: SearchableText (language-aware)¶
This is the primary search mechanism, always available on vanilla PostgreSQL.
Storage: A dedicated searchable_text column of type tsvector on the
object_state table.
This column stores a pre-computed, language-aware token list
with weight labels.
Indexing: At write time, the CatalogStateProcessor builds a weighted tsvector
by combining three sources:
Source |
Weight |
Purpose |
|---|---|---|
Title |
A (highest) |
Title matches should dominate results |
Description |
B |
Description matches are important but secondary |
SearchableText body |
D (lowest) |
Body text provides broad coverage |
The tsvector expression uses setweight() to assign different weights:
setweight(to_tsvector('simple', COALESCE(idx->>'Title', '')), 'A') ||
setweight(to_tsvector('simple', COALESCE(idx->>'Description', '')), 'B') ||
setweight(to_tsvector(
pgcatalog_lang_to_regconfig(idx->>'Language')::regconfig,
searchable_text
), 'D')
Title and Description use the simple configuration (no stemming) so their exact
word forms are preserved.
The body text uses the object’s language for stemming –
a German document’s body is stemmed with the German stemmer, while its title
retains exact words.
Querying: When SearchableText appears in the query dict without an explicit
sort_on, plone.pgcatalog automatically ranks results by relevance using
ts_rank_cd():
ts_rank_cd(
'{0.1, 0.2, 0.4, 1.0}'::float4[],
searchable_text,
plainto_tsquery(regconfig, search_text)
)
The weight array {0.1, 0.2, 0.4, 1.0} maps to weights D, C, B, A.
This means a
token match at weight A (Title) contributes 10x more to the score than a match at
weight D (body).
The ts_rank_cd function uses cover density ranking, which
additionally rewards term proximity – documents where the search terms appear close
together score higher.
Index: A GIN index on the searchable_text column provides fast boolean filtering.
The ranking computation runs only on the rows that pass the @@ filter.
Tier 2: SearchableText with BM25¶
When VectorChord-BM25 is installed, plone.pgcatalog upgrades from ts_rank_cd to
true BM25 scoring.
The tsvector infrastructure is kept for GIN-indexed boolean
pre-filtering; BM25 provides the ranking.
Storage: Per-language search_bm25_{lang} columns of type bm25vector. Each
column uses a language-specific tokenizer (stemmer and optional pre-tokenizer for
CJK scripts).
A fallback search_bm25 column handles unconfigured languages.
Indexing: The BM25Backend.process_search_data() method combines title (repeated
3x for field boosting), description, and body text into a single string, then stores
it in the appropriate language column via tokenize().
Querying: The query path uses a two-stage approach:
GIN pre-filter via searchable_text @@ plainto_tsquery(...) (same as Tier 1).
2.
BM25 ranking via search_bm25_{lang} <&> to_bm25query(index, tokenize(text, tokenizer)).
BM25 scoring considers three factors that ts_rank_cd ignores:
IDF (Inverse Document Frequency): Rare terms are weighted higher. In a corpus where “security” appears in 5,000 documents but “policy” appears in only 200, a match on “policy” contributes more to the score.
Term saturation: A document mentioning “security” 50 times does not score 50x higher than one mentioning it once. BM25’s k1 parameter controls the saturation curve – after a few occurrences, additional matches have diminishing returns.
Length normalization: Shorter documents are boosted relative to longer ones. A 200-word page titled “Security Policy” scores higher than a 10,000-word report that mentions security once, even though both match the query.
Tier 3: Title/Description (word-level)¶
Title and Description have their own search path, independent of SearchableText. This is used when Plone searches specifically for Title or Description content (for example, the “Title” field in a collection criterion).
Storage: Values are stored in the idx JSONB column under the Title and
Description keys.
Indexing: GIN expression indexes are created at schema setup time:
CREATE INDEX idx_os_cat_title_tsv ON object_state
USING gin (to_tsvector('simple'::regconfig, COALESCE(idx->>'Title', '')))
WHERE idx IS NOT NULL;
These use the simple configuration (no stemming, no stop words).
This is a
deliberate choice: Title and Description searches should match exact word forms.
Searching for “Run” should match documents titled “Run” but not “Running” – the
user is looking for a specific word.
SearchableText provides the stemmed search
when broader matching is desired.
Querying: The query builder generates:
to_tsvector('simple'::regconfig, COALESCE(idx->>'Title', ''))
@@ plainto_tsquery('simple'::regconfig, %(search_text)s)
This expression matches the GIN expression index, so PostgreSQL uses an index scan.
Tier 4: addon ZCTextIndex fields¶
Plone add-ons can register additional ZCTextIndex indexes (for example, for custom content types that need field-specific search). plone.pgcatalog autodiscovers these at startup.
Discovery: During _sync_registry_from_db(), the startup subscriber reads all
ZCatalog indexes.
Those with meta_type="ZCTextIndex" and a JSONB key (not
SearchableText) are registered in the IndexRegistry as TEXT-type indexes.
Index creation: _ensure_text_indexes() creates GIN expression indexes for each
discovered field using the simple configuration, following the same pattern as
Title and Description.
Querying: The query builder handles addon text indexes identically to
Title/Description – a to_tsvector('simple', idx->>'{key}') @@ plainto_tsquery()
clause that matches the GIN expression index.
Language support¶
Built-in PostgreSQL configurations¶
PostgreSQL ships with text search configurations for 30 languages. plone.pgcatalog
maps Plone’s ISO 639-1 language codes to PostgreSQL configuration names via the
pgcatalog_lang_to_regconfig() SQL function and its Python mirror
language_to_regconfig():
Languages |
PostgreSQL configs |
|---|---|
Western European |
danish, dutch, english, finnish, french, german, italian, norwegian, portuguese, spanish, swedish |
Eastern European |
estonian, hungarian, lithuanian, romanian, russian, serbian, turkish |
Other |
arabic, armenian, basque, catalan, greek, hindi, indonesian, irish, nepali, tamil, yiddish |
Languages not in the mapping (including all CJK languages at the tsvector level)
fall back to simple – no stemming, basic whitespace tokenization.
CJK support via pg_tokenizer¶
When VectorChord-BM25 is installed, CJK languages get proper word segmentation through pg_tokenizer:
Chinese: Jieba segmenter
Japanese: Lindera segmenter
Korean: Lindera segmenter
These are configured as per-language BM25 columns with dedicated tokenizers.
The
tsvector tier still uses simple for CJK (since PostgreSQL’s built-in tokenizers
do not support CJK segmentation), but the BM25 tier provides proper word-level
matching.
Per-object language selection¶
Each Plone content object has a Language field (typically set by
plone.app.multilingual). plone.pgcatalog reads this field during indexing and
applies the corresponding language configuration:
At tsvector write time:
to_tsvector(pgcatalog_lang_to_regconfig(Language), text)At BM25 write time: routes to the correct
search_bm25_{lang}columnAt query time:
plainto_tsquery(pgcatalog_lang_to_regconfig(Language), search_text)
If the Language field in the query differs from the object’s language, the stemmer
mismatch may produce suboptimal results.
This is inherent to language-specific
stemming – searching for German words with an English stemmer produces poor matches.
The BM25 fallback column (no stemmer, basic tokenization) provides a safety net for
cross-language searches.
Relevance ranking¶
Tsvector ranking (ts_rank_cd)¶
ts_rank_cd implements cover density ranking: it considers both the frequency of
matching terms and their proximity to each other within the document.
The weight multiplier array {0.1, 0.2, 0.4, 1.0} assigns scores:
Weight |
Score multiplier |
Assigned to |
|---|---|---|
D |
0.1 |
Body text (SearchableText) |
C |
0.2 |
(unused) |
B |
0.4 |
Description |
A |
1.0 |
Title |
A match in the Title contributes 10x the score of the same match in the body. Combined with cover density (proximity bonus), this produces good relevance ordering for most searches without any extensions.
Relevance ranking is autoapplied when SearchableText is queried without an
explicit sort_on.
If sort_on is set (for example, sort_on="modified"), the explicit
sort takes priority and relevance ranking is not applied.
BM25 ranking (optional)¶
When the BM25 backend is active, the ranking expression changes from ts_rank_cd
to the <&> operator, which computes a BM25 score.
Lower scores indicate higher
relevance (the operator returns a distance metric).
BM25’s parameters are controlled by the VectorChord-BM25 extension:
k1 (term saturation): Controls how quickly additional term occurrences saturate. Default is 1.2 – after 3-4 occurrences, additional matches contribute very little.
b (length normalization): Controls how much shorter documents are boosted. Default is 0.75 – a document half the average length gets a meaningful boost.
Field-level boosting in BM25 is achieved by repeating the title text 3x in the combined input string. This is crude but effective: BM25 sees the title terms as more frequent relative to the document length, which increases their contribution to the score. Future versions may adopt VectorChord-BM25’s field boosting API if one is added.