plone.pgcatalog plone.pgcatalog plone.pgcatalog
  • GitHub
  • PyPI
/

Documentation

  • Tutorials
    • Quickstart: run plone.pgcatalog in 5 minutes
    • Tutorial: migrate from ZCatalog to plone.pgcatalog
    • Tutorial: set up multilingual search
  • How-to guides
    • Install plone.pgcatalog
    • Deploy in production
    • Rebuild or reindex the catalog
    • Enable BM25 ranking
    • Enable Tika text extraction
    • Configure multilingual full-text search
    • Write a custom index translator
    • Query the catalog via raw SQL
    • Use with plone.distribution
  • Reference
    • Configuration reference
    • Query API reference
    • Index types reference
    • ZCatalog compatibility
    • Permissions
    • Database schema reference
    • IPGIndexTranslator interface reference
    • Search backends reference
    • SQL functions reference
    • Changelog
  • Explanation
    • Architecture
    • Why PostgreSQL for the catalog?
    • Full-text search deep dive
    • Tika text extraction architecture
    • Performance characteristics
    • Security hardening
    • BM25 design decisions
    • Addon integrations

On this page

  • How language detection works
  • Supported languages
  • Tsvector (default, no configuration)
  • BM25 per-language columns
  • CJK languages
  1. plone.pgcatalog /
  2. How-to guides /
  3. Configure multilingual full-text search

Configure multilingual full-text search¶

How language detection works¶

plone.pgcatalog reads the Language field from each object during indexing. This field is set by plone.app.multilingual for multilingual sites, or can be set manually on content types.

The language code is mapped via pgcatalog_lang_to_regconfig() (a PL/pgSQL function) to select the correct PostgreSQL text search configuration (stemmer) at both index time and query time.

Supported languages¶

30 languages are supported out of the box:

Language

ISO Code

PG Config

BM25 Tokenizer

Arabic

ar

arabic

Snowball

Armenian

hy

armenian

Snowball

Basque

eu

basque

Snowball

Catalan

ca

catalan

Snowball

Danish

da

danish

Snowball

Dutch

nl

dutch

Snowball

English

en

english

Snowball (Porter2)

Estonian

et

estonian

Snowball

Finnish

fi

finnish

Snowball

French

fr

french

Snowball

German

de

german

Snowball

Greek

el

greek

Snowball

Hindi

hi

hindi

Snowball

Hungarian

hu

hungarian

Snowball

Indonesian

id

indonesian

Snowball

Irish

ga

irish

Snowball

Italian

it

italian

Snowball

Lithuanian

lt

lithuanian

Snowball

Nepali

ne

nepali

Snowball

Norwegian

nb/nn/no

norwegian

Snowball

Portuguese

pt

portuguese

Snowball

Romanian

ro

romanian

Snowball

Russian

ru

russian

Snowball

Serbian

sr

serbian

Snowball

Spanish

es

spanish

Snowball

Swedish

sv

swedish

Snowball

Tamil

ta

tamil

Snowball

Turkish

tr

turkish

Snowball

Yiddish

yi

yiddish

Snowball

Chinese

zh

simple

jieba

Japanese

ja

simple

lindera

Korean

ko

simple

lindera

Objects with unmapped or empty Language fall back to 'simple' config (no stemming, no stop words).

Tsvector (default, no configuration)¶

Language-aware stemming works automatically for the searchable_text column. No configuration is needed beyond installing plone.app.multilingual. The tsvector is built with per-object language detection:

  • Title and Description use 'simple' config (no stemming, weight A/B)

  • SearchableText body uses the object’s Language field mapped to the appropriate regconfig (weight D)

BM25 per-language columns¶

To enable per-language BM25 ranking:

# Explicit list
export PGCATALOG_BM25_LANGUAGES=en,de,fr,zh

# Auto-detect from site languages
export PGCATALOG_BM25_LANGUAGES=auto

Requires VectorChord-BM25 extensions. See Enable BM25 ranking.

Each configured language gets a dedicated search_bm25_{lang} column. At query time, the search language determines which column is used for BM25 scoring. The fallback search_bm25 column handles unconfigured languages and cross-language search.

CJK languages¶

Chinese, Japanese, and Korean use specialized segmenters (jieba/lindera) instead of Snowball stemmers. These are provided by pg_tokenizer and work automatically with BM25Backend.

For tsvector (without BM25), CJK languages fall back to 'simple' config, which provides basic whitespace tokenization. For better CJK search quality, enable BM25.

Previous
Enable Tika text extraction
Next
Write a custom index translator

2026, BlueDynamics Alliance

Made with Sphinx and Shibuya theme.