Configure multilingual full-text search¶
How language detection works¶
plone.pgcatalog reads the Language field from each object during indexing.
This field is set by plone.app.multilingual for multilingual sites, or can be set manually on content types.
The language code is mapped via pgcatalog_lang_to_regconfig() (a PL/pgSQL function) to select the correct PostgreSQL text search configuration (stemmer) at both index time and query time.
Supported languages¶
30 languages are supported out of the box:
Language |
ISO Code |
PG Config |
BM25 Tokenizer |
|---|---|---|---|
Arabic |
|
|
Snowball |
Armenian |
|
|
Snowball |
Basque |
|
|
Snowball |
Catalan |
|
|
Snowball |
Danish |
|
|
Snowball |
Dutch |
|
|
Snowball |
English |
|
|
Snowball (Porter2) |
Estonian |
|
|
Snowball |
Finnish |
|
|
Snowball |
French |
|
|
Snowball |
German |
|
|
Snowball |
Greek |
|
|
Snowball |
Hindi |
|
|
Snowball |
Hungarian |
|
|
Snowball |
Indonesian |
|
|
Snowball |
Irish |
|
|
Snowball |
Italian |
|
|
Snowball |
Lithuanian |
|
|
Snowball |
Nepali |
|
|
Snowball |
Norwegian |
|
|
Snowball |
Portuguese |
|
|
Snowball |
Romanian |
|
|
Snowball |
Russian |
|
|
Snowball |
Serbian |
|
|
Snowball |
Spanish |
|
|
Snowball |
Swedish |
|
|
Snowball |
Tamil |
|
|
Snowball |
Turkish |
|
|
Snowball |
Yiddish |
|
|
Snowball |
Chinese |
|
|
jieba |
Japanese |
|
|
lindera |
Korean |
|
|
lindera |
Objects with unmapped or empty Language fall back to 'simple' config (no stemming, no stop words).
Tsvector (default, no configuration)¶
Language-aware stemming works automatically for the searchable_text column.
No configuration is needed beyond installing plone.app.multilingual.
The tsvector is built with per-object language detection:
Title and Description use
'simple'config (no stemming, weight A/B)SearchableText body uses the object’s
Languagefield mapped to the appropriate regconfig (weight D)
BM25 per-language columns¶
To enable per-language BM25 ranking:
# Explicit list
export PGCATALOG_BM25_LANGUAGES=en,de,fr,zh
# Auto-detect from site languages
export PGCATALOG_BM25_LANGUAGES=auto
Requires VectorChord-BM25 extensions. See Enable BM25 ranking.
Each configured language gets a dedicated search_bm25_{lang} column.
At query time, the search language determines which column is used for BM25 scoring.
The fallback search_bm25 column handles unconfigured languages and cross-language search.
CJK languages¶
Chinese, Japanese, and Korean use specialized segmenters (jieba/lindera) instead of Snowball stemmers.
These are provided by pg_tokenizer and work automatically with BM25Backend.
For tsvector (without BM25), CJK languages fall back to 'simple' config, which provides basic whitespace tokenization.
For better CJK search quality, enable BM25.