plone.pgcatalog plone.pgcatalog plone.pgcatalog
  • GitHub
  • PyPI
/

Documentation

  • Tutorials
    • Quickstart: run plone.pgcatalog in 5 minutes
    • Tutorial: migrate from ZCatalog to plone.pgcatalog
    • Tutorial: set up multilingual search
  • How-to guides
    • Install plone.pgcatalog
    • Deploy in production
    • Rebuild or reindex the catalog
    • Enable BM25 ranking
    • Enable Tika text extraction
    • Configure multilingual full-text search
    • Write a custom index translator
    • Query the catalog via raw SQL
    • Use with plone.distribution
  • Reference
    • Configuration reference
    • Query API reference
    • Index types reference
    • ZCatalog compatibility
    • Permissions
    • Database schema reference
    • IPGIndexTranslator interface reference
    • Search backends reference
    • SQL functions reference
    • Changelog
  • Explanation
    • Architecture
    • Why PostgreSQL for the catalog?
    • Full-text search deep dive
    • Tika text extraction architecture
    • Performance characteristics
    • Security hardening
    • BM25 design decisions
    • Addon integrations

On this page

  • What you will build
  • Prerequisites
  • Step 1: verify language configuration
  • Step 2: understand how stemming works
  • Step 3: create test content
  • Step 4: test language-aware search
    • Via the REST API
    • Via Python
  • Step 5: enable BM25 for better CJK search (optional)
  • Step 6: verify with SQL
  • What you learned
  • Next steps
  1. plone.pgcatalog /
  2. Tutorials /
  3. Tutorial: set up multilingual search

Tutorial: set up multilingual search¶

What you will build¶

A Plone site with English, German, and Chinese content, where search results are ranked using language-specific stemming. German “Vulkan” will match “Vulkane,” English “running” will match “run,” and Chinese text is properly segmented.

By the end of this tutorial you will understand how plone.pgcatalog uses per-object language information to choose the right stemmer, and how to verify that stemming is working correctly.

Prerequisites¶

  • A working plone.pgcatalog installation

  • plone.app.multilingual installed and configured

  • Content in multiple languages

Tip

The Quickstart: run plone.pgcatalog in 5 minutes tutorial sets up a multilingual site with ~800 example articles in three languages. Follow that first if you do not have a multilingual site ready.

Step 1: verify language configuration¶

In Plone’s Site Setup > Language, ensure your desired languages are enabled. plone.pgcatalog reads the Language field from each content object at index time.

You can verify the configured languages from a zconsole shell:

from plone.i18n.interfaces import ILanguageSchema
from plone.registry.interfaces import IRegistry
from zope.component import getUtility

registry = getUtility(IRegistry)
settings = registry.forInterface(ILanguageSchema, prefix="plone")
print(f"Available: {settings.available_languages}")
print(f"Default:   {settings.default_language}")

Every content object carries a Language field. When plone.pgcatalog indexes an object, it uses this field to select the PostgreSQL text search configuration for stemming.

Step 2: understand how stemming works¶

plone.pgcatalog maps each object’s Language field to a PostgreSQL text search configuration via the pgcatalog_lang_to_regconfig() SQL function. Here are some common mappings:

Language

ISO Code

PG Configuration

What It Does

English

en

english

“running” -> “run,” removes “the,” “is”

German

de

german

“Vulkane” -> “vulkan,” removes “der,” “die”

French

fr

french

“coureurs” -> “coureur,” removes “le,” “la”

Spanish

es

spanish

“corriendo” -> “corr,” removes “el,” “la”

Chinese

zh

simple

Basic tokenization (BM25 adds jieba segmentation)

PostgreSQL ships with built-in support for about 30 languages. The simple configuration performs basic whitespace tokenization without stemming – used as a fallback for languages without a dedicated stemmer.

This means a German search for “Vulkan” will find articles containing “Vulkane,” “Vulkans,” or “vulkanisch” – the stemmer reduces them all to the same root form.

Step 3: create test content¶

If you followed the Quickstart: run plone.pgcatalog in 5 minutes tutorial, you already have multilingual content. Otherwise, create a few test documents to see stemming in action.

In each language folder, create a Document with content that includes different word forms:

  • English (/Plone/en/): Create a Document titled “Volcanic Activity” with body text mentioning “volcano,” “volcanoes,” “volcanic,” and “volcanism.”

  • German (/Plone/de/): Create a Document titled “Vulkanische Aktivitaet” with body text mentioning “Vulkan,” “Vulkane,” “Vulkans,” and “vulkanisch.”

Publish both documents so they appear in search results.

Step 4: test language-aware search¶

Via the REST API¶

# English: "volcano" matches "volcanoes", "volcanic", "volcanism"
curl -s "http://localhost:8081/Plone/@search?SearchableText=volcano&sort_limit=5" \
  -H "Accept: application/json" -u admin:admin | python -m json.tool

# German: "Vulkan" matches "Vulkane", "Vulkans", "vulkanisch"
curl -s "http://localhost:8081/Plone/de/@search?SearchableText=Vulkan&sort_limit=5" \
  -H "Accept: application/json" -u admin:admin | python -m json.tool

When searching within a language folder (for example, /Plone/de/), Plone automatically restricts results to that path. The Language index is also available as an explicit query parameter.

Via Python¶

from plone import api

catalog = api.portal.get_tool("portal_catalog")

# Search across all languages
results = catalog(SearchableText="volcano")
print(f"All languages: {len(results)} results")

# Search only German content
results = catalog(SearchableText="Vulkan", Language="de")
print(f"German only: {len(results)} results")

# Verify stemming -- singular and plural should return the same results
singular = catalog(SearchableText="volcano", Language="en")
plural = catalog(SearchableText="volcanoes", Language="en")
print(f"'volcano': {len(singular)}, 'volcanoes': {len(plural)}")

Both “volcano” and “volcanoes” should return the same result set because the English stemmer reduces both to the same root.

Step 5: enable BM25 for better CJK search (optional)¶

PostgreSQL’s simple text search configuration provides basic whitespace tokenization for Chinese, Japanese, and Korean. This works for queries where the user types exact character sequences, but it cannot segment continuous text into words.

For proper word segmentation, enable BM25 by setting an environment variable in your zope.conf:

<environment>
    PGCATALOG_BM25_LANGUAGES en,de,zh
</environment>

This tells plone.pgcatalog to create per-language BM25 columns with specialized tokenizers:

Language

BM25 Column

Tokenizer

Segmenter

English

search_bm25_en

pgcatalog_en

Porter2 stemmer

German

search_bm25_de

pgcatalog_de

German Snowball stemmer

Chinese

search_bm25_zh

pgcatalog_zh

jieba segmentation

After changing the configuration, restart Zope and rebuild the catalog:

import transaction
from plone import api

catalog = api.portal.get_tool("portal_catalog")
catalog.clearFindAndRebuild()
transaction.commit()

Note

BM25 requires the tensorchord/vchord-suite:pg17-latest Docker image (or equivalent PostgreSQL installation with pg_tokenizer and vchord_bm25 extensions). plone.pgcatalog autodetects these extensions at startup and falls back to tsvector ranking when they are not available.

Step 6: verify with SQL¶

Connect to PostgreSQL and inspect the indexed data directly.

psql -h localhost -p 5433 -U zodb -d zodb
-- Check language distribution
SELECT idx->>'Language' AS lang, COUNT(*)
FROM object_state
WHERE idx IS NOT NULL AND idx->>'portal_type' = 'Document'
GROUP BY idx->>'Language'
ORDER BY COUNT(*) DESC;

-- German stemming in action -- all forms of "Vulkan" match
SELECT path, idx->>'Title' AS title
FROM object_state, plainto_tsquery('german', 'Vulkan') q
WHERE searchable_text @@ q
LIMIT 5;

-- English stemming -- "volcano" and "volcanoes" produce the same query
SELECT plainto_tsquery('english', 'volcano');
SELECT plainto_tsquery('english', 'volcanoes');
-- Both return 'volcano':* (the stemmed form)

-- Compare with 'simple' (no stemming)
SELECT plainto_tsquery('simple', 'volcano');
SELECT plainto_tsquery('simple', 'volcanoes');
-- Returns 'volcano' and 'volcanoes' respectively (no stemming)

What you learned¶

  • plone.pgcatalog uses each object’s Language field to select a PostgreSQL text search configuration at index time

  • About 30 languages are supported out of the box via PostgreSQL’s built-in stemmers

  • Stemming reduces inflected forms to a common root, so searches match regardless of grammatical form

  • CJK languages benefit from BM25 with specialized word segmenters (jieba for Chinese)

  • Language filtering works via the standard Language query parameter or by searching within a language folder path

Next steps¶

  • Quickstart: run plone.pgcatalog in 5 minutes to try a full multilingual setup with example content

  • Tutorial: migrate from ZCatalog to plone.pgcatalog to migrate an existing site

Previous
Tutorial: migrate from ZCatalog to plone.pgcatalog
Next
How-to guides

2026, BlueDynamics Alliance

Made with Sphinx and Shibuya theme.