Tutorial: set up multilingual search¶
What you will build¶
A Plone site with English, German, and Chinese content, where search results are ranked using language-specific stemming. German “Vulkan” will match “Vulkane,” English “running” will match “run,” and Chinese text is properly segmented.
By the end of this tutorial you will understand how plone.pgcatalog uses per-object language information to choose the right stemmer, and how to verify that stemming is working correctly.
Prerequisites¶
A working plone.pgcatalog installation
plone.app.multilingual installed and configured
Content in multiple languages
Tip
The Quickstart: run plone.pgcatalog in 5 minutes tutorial sets up a multilingual site with ~800 example articles in three languages. Follow that first if you do not have a multilingual site ready.
Step 1: verify language configuration¶
In Plone’s Site Setup > Language, ensure your desired languages are
enabled. plone.pgcatalog reads the Language field from each content object
at index time.
You can verify the configured languages from a zconsole shell:
from plone.i18n.interfaces import ILanguageSchema
from plone.registry.interfaces import IRegistry
from zope.component import getUtility
registry = getUtility(IRegistry)
settings = registry.forInterface(ILanguageSchema, prefix="plone")
print(f"Available: {settings.available_languages}")
print(f"Default: {settings.default_language}")
Every content object carries a Language field.
When plone.pgcatalog indexes
an object, it uses this field to select the PostgreSQL text search
configuration for stemming.
Step 2: understand how stemming works¶
plone.pgcatalog maps each object’s Language field to a PostgreSQL text search
configuration via the pgcatalog_lang_to_regconfig() SQL function.
Here are
some common mappings:
Language |
ISO Code |
PG Configuration |
What It Does |
|---|---|---|---|
English |
|
|
“running” -> “run,” removes “the,” “is” |
German |
|
|
“Vulkane” -> “vulkan,” removes “der,” “die” |
French |
|
|
“coureurs” -> “coureur,” removes “le,” “la” |
Spanish |
|
|
“corriendo” -> “corr,” removes “el,” “la” |
Chinese |
|
|
Basic tokenization (BM25 adds jieba segmentation) |
PostgreSQL ships with built-in support for about 30 languages.
The simple
configuration performs basic whitespace tokenization without stemming – used as
a fallback for languages without a dedicated stemmer.
This means a German search for “Vulkan” will find articles containing “Vulkane,” “Vulkans,” or “vulkanisch” – the stemmer reduces them all to the same root form.
Step 3: create test content¶
If you followed the Quickstart: run plone.pgcatalog in 5 minutes tutorial, you already have multilingual content. Otherwise, create a few test documents to see stemming in action.
In each language folder, create a Document with content that includes different word forms:
English (
/Plone/en/): Create a Document titled “Volcanic Activity” with body text mentioning “volcano,” “volcanoes,” “volcanic,” and “volcanism.”German (
/Plone/de/): Create a Document titled “Vulkanische Aktivitaet” with body text mentioning “Vulkan,” “Vulkane,” “Vulkans,” and “vulkanisch.”
Publish both documents so they appear in search results.
Step 4: test language-aware search¶
Via the REST API¶
# English: "volcano" matches "volcanoes", "volcanic", "volcanism"
curl -s "http://localhost:8081/Plone/@search?SearchableText=volcano&sort_limit=5" \
-H "Accept: application/json" -u admin:admin | python -m json.tool
# German: "Vulkan" matches "Vulkane", "Vulkans", "vulkanisch"
curl -s "http://localhost:8081/Plone/de/@search?SearchableText=Vulkan&sort_limit=5" \
-H "Accept: application/json" -u admin:admin | python -m json.tool
When searching within a language folder (for example, /Plone/de/), Plone
automatically restricts results to that path.
The Language index is also
available as an explicit query parameter.
Via Python¶
from plone import api
catalog = api.portal.get_tool("portal_catalog")
# Search across all languages
results = catalog(SearchableText="volcano")
print(f"All languages: {len(results)} results")
# Search only German content
results = catalog(SearchableText="Vulkan", Language="de")
print(f"German only: {len(results)} results")
# Verify stemming -- singular and plural should return the same results
singular = catalog(SearchableText="volcano", Language="en")
plural = catalog(SearchableText="volcanoes", Language="en")
print(f"'volcano': {len(singular)}, 'volcanoes': {len(plural)}")
Both “volcano” and “volcanoes” should return the same result set because the English stemmer reduces both to the same root.
Step 5: enable BM25 for better CJK search (optional)¶
PostgreSQL’s simple text search configuration provides basic whitespace
tokenization for Chinese, Japanese, and Korean.
This works for queries where
the user types exact character sequences, but it cannot segment continuous text
into words.
For proper word segmentation, enable BM25 by setting an environment variable
in your zope.conf:
<environment>
PGCATALOG_BM25_LANGUAGES en,de,zh
</environment>
This tells plone.pgcatalog to create per-language BM25 columns with specialized tokenizers:
Language |
BM25 Column |
Tokenizer |
Segmenter |
|---|---|---|---|
English |
|
|
Porter2 stemmer |
German |
|
|
German Snowball stemmer |
Chinese |
|
|
jieba segmentation |
After changing the configuration, restart Zope and rebuild the catalog:
import transaction
from plone import api
catalog = api.portal.get_tool("portal_catalog")
catalog.clearFindAndRebuild()
transaction.commit()
Note
BM25 requires the tensorchord/vchord-suite:pg17-latest Docker image (or
equivalent PostgreSQL installation with pg_tokenizer and vchord_bm25
extensions). plone.pgcatalog autodetects these extensions at startup and
falls back to tsvector ranking when they are not available.
Step 6: verify with SQL¶
Connect to PostgreSQL and inspect the indexed data directly.
psql -h localhost -p 5433 -U zodb -d zodb
-- Check language distribution
SELECT idx->>'Language' AS lang, COUNT(*)
FROM object_state
WHERE idx IS NOT NULL AND idx->>'portal_type' = 'Document'
GROUP BY idx->>'Language'
ORDER BY COUNT(*) DESC;
-- German stemming in action -- all forms of "Vulkan" match
SELECT path, idx->>'Title' AS title
FROM object_state, plainto_tsquery('german', 'Vulkan') q
WHERE searchable_text @@ q
LIMIT 5;
-- English stemming -- "volcano" and "volcanoes" produce the same query
SELECT plainto_tsquery('english', 'volcano');
SELECT plainto_tsquery('english', 'volcanoes');
-- Both return 'volcano':* (the stemmed form)
-- Compare with 'simple' (no stemming)
SELECT plainto_tsquery('simple', 'volcano');
SELECT plainto_tsquery('simple', 'volcanoes');
-- Returns 'volcano' and 'volcanoes' respectively (no stemming)
What you learned¶
plone.pgcatalog uses each object’s
Languagefield to select a PostgreSQL text search configuration at index timeAbout 30 languages are supported out of the box via PostgreSQL’s built-in stemmers
Stemming reduces inflected forms to a common root, so searches match regardless of grammatical form
CJK languages benefit from BM25 with specialized word segmenters (jieba for Chinese)
Language filtering works via the standard
Languagequery parameter or by searching within a language folder path
Next steps¶
Quickstart: run plone.pgcatalog in 5 minutes to try a full multilingual setup with example content
Tutorial: migrate from ZCatalog to plone.pgcatalog to migrate an existing site