Full-text search deep dive

plone.pgcatalog supports four tiers of text search, from basic word matching to probabilistic relevance ranking. Each tier builds on the previous one, and the system automatically selects the best available tier at startup.

This page explains how PostgreSQL full-text search works under the hood, how plone.pgcatalog layers its search tiers on top, and how language-aware indexing and relevance ranking fit together.

How PostgreSQL full-text search works

PostgreSQL’s full-text search is built on two data types and a matching operator:

  • tsvector: A sorted list of normalized tokens (lexemes) with position information. Created from text via to_tsvector(), which applies language-specific rules: lowercasing, stop word removal, and stemming (reducing “running” to “run”).

  • tsquery: A boolean expression of tokens, created from user input via plainto_tsquery() (which inserts implicit AND between words) or to_tsquery() (which supports explicit boolean operators).

  • @@: The match operator. tsvector @@ tsquery returns true if the document contains all required tokens.

A GIN index on a tsvector column makes the match operation fast: PostgreSQL maintains an inverted index mapping each token to the set of rows containing it. A search for “security policy” looks up both tokens in the GIN index and intersects the row sets – no table scan required.

Language configurations (called regconfig in PostgreSQL) control how text is tokenized and stemmed. The english configuration knows that “running,” “runs,” and “ran” are forms of “run.” The simple configuration performs no stemming – it matches exact word forms only.

Language support

Built-in PostgreSQL configurations

PostgreSQL ships with text search configurations for 30 languages. plone.pgcatalog maps Plone’s ISO 639-1 language codes to PostgreSQL configuration names via the pgcatalog_lang_to_regconfig() SQL function and its Python mirror language_to_regconfig():

Languages

PostgreSQL configs

Western European

danish, dutch, english, finnish, french, german, italian, norwegian, portuguese, spanish, swedish

Eastern European

estonian, hungarian, lithuanian, romanian, russian, serbian, turkish

Other

arabic, armenian, basque, catalan, greek, hindi, indonesian, irish, nepali, tamil, yiddish

Languages not in the mapping (including all CJK languages at the tsvector level) fall back to simple – no stemming, basic whitespace tokenization.

CJK support via pg_tokenizer

When VectorChord-BM25 is installed, CJK languages get proper word segmentation through pg_tokenizer:

  • Chinese: Jieba segmenter

  • Japanese: Lindera segmenter

  • Korean: Lindera segmenter

These are configured as per-language BM25 columns with dedicated tokenizers. The tsvector tier still uses simple for CJK (since PostgreSQL’s built-in tokenizers do not support CJK segmentation), but the BM25 tier provides proper word-level matching.

Per-object language selection

Each Plone content object has a Language field (typically set by plone.app.multilingual). plone.pgcatalog reads this field during indexing and applies the corresponding language configuration:

  • At tsvector write time: to_tsvector(pgcatalog_lang_to_regconfig(Language), text)

  • At BM25 write time: routes to the correct search_bm25_{lang} column

  • At query time: plainto_tsquery(pgcatalog_lang_to_regconfig(Language), search_text)

If the Language field in the query differs from the object’s language, the stemmer mismatch may produce suboptimal results. This is inherent to language-specific stemming – searching for German words with an English stemmer produces poor matches. The BM25 fallback column (no stemmer, basic tokenization) provides a safety net for cross-language searches.

Relevance ranking

Tsvector ranking (ts_rank_cd)

ts_rank_cd implements cover density ranking: it considers both the frequency of matching terms and their proximity to each other within the document.

The weight multiplier array {0.1, 0.2, 0.4, 1.0} assigns scores:

Weight

Score multiplier

Assigned to

D

0.1

Body text (SearchableText)

C

0.2

(unused)

B

0.4

Description

A

1.0

Title

A match in the Title contributes 10x the score of the same match in the body. Combined with cover density (proximity bonus), this produces good relevance ordering for most searches without any extensions.

Relevance ranking is autoapplied when SearchableText is queried without an explicit sort_on. If sort_on is set (for example, sort_on="modified"), the explicit sort takes priority and relevance ranking is not applied.

BM25 ranking (optional)

When the BM25 backend is active, the ranking expression changes from ts_rank_cd to the <&> operator, which computes a BM25 score. Lower scores indicate higher relevance (the operator returns a distance metric).

BM25’s parameters are controlled by the VectorChord-BM25 extension:

  • k1 (term saturation): Controls how quickly additional term occurrences saturate. Default is 1.2 – after 3-4 occurrences, additional matches contribute very little.

  • b (length normalization): Controls how much shorter documents are boosted. Default is 0.75 – a document half the average length gets a meaningful boost.

Field-level boosting in BM25 is achieved by repeating the title text 3x in the combined input string. This is crude but effective: BM25 sees the title terms as more frequent relative to the document length, which increases their contribution to the score. Future versions may adopt VectorChord-BM25’s field boosting API if one is added.