Tika text extraction architecture¶

The problem¶

Plone’s default search indexes text from rich-text fields: Title, Description, and the HTML body of a Page or News Item. This works because these fields contain plain text that Plone can read directly.

Binary files—PDFs, Word documents, spreadsheets, images—contain text that is locked inside proprietary or compressed formats. Plone cannot extract it natively. Without extraction, uploading a PDF titled “Q4 Financial Report” makes it findable by title, but the 50 pages of content inside the PDF are invisible to search.

Elasticsearch solves this with its Tika ingest pipeline. plone.pgcatalog brings the same capability to PostgreSQL.

Design decisions¶

Why Apache Tika?¶

Tika extracts text from over 1400 file formats via a single stateless HTTP API. It handles PDFs (including scanned ones via Tesseract OCR), Office documents, OpenDocument formats, images, and more. It is the same technology Elasticsearch uses internally.

Why PostgreSQL as the job queue?¶

Redis or RabbitMQ would add operational complexity. Since plone.pgcatalog already depends on PostgreSQL, we use it as the queue too:

Transactional enqueue: Jobs are inserted in the same transaction as the ZODB commit. If the transaction rolls back, the job disappears too. No orphaned jobs.
LISTEN/NOTIFY: PostgreSQL’s built-in pub/sub wakes the worker instantly when a new job arrives. No polling delay.
SKIP LOCKED: Multiple workers can dequeue safely without contention. Each worker claims one job at a time; others skip locked rows.
Visibility: Queue state is queryable via standard SQL. No separate monitoring infrastructure needed.

Why asynchronous?¶

Text extraction is slow—a large PDF can take seconds. Running it synchronously during catalog_object() would block the Zope request thread, making content saves unacceptably slow. The asynchronous approach keeps the synchronous path fast (Title/Description/body are indexed immediately) while extraction runs in the background.

Why not store the full extracted text?¶

The extracted text is not stored as a column. Instead, it is transformed into a tsvector (and optionally BM25 vectors) and merged into the existing searchable_text column. This is more space-efficient and matches how PostgreSQL full-text search works: the search engine operates on tsvectors, not raw text.

Data flow¶

        sequenceDiagram
    participant Plone as Plone (catalog_object)
    participant Proc as CatalogStateProcessor
    participant PG as PostgreSQL
    participant Worker as TikaWorker
    participant Tika as Apache Tika

    Plone->>Proc: process(zoid, state)
    Note over Proc: Extract content_type<br/>from primary field
    Proc->>Proc: Accumulate candidate<br/>if extractable type
    Plone->>Proc: finalize(cursor)
    Proc->>PG: SELECT blob_state WHERE zoid IN (...)
    PG-->>Proc: rows with blob data
    Proc->>PG: INSERT INTO text_extraction_queue
    Note over PG: NOTIFY trigger fires

    PG-->>Worker: NOTIFY text_extraction_ready
    Worker->>PG: UPDATE ... FOR UPDATE SKIP LOCKED<br/>RETURNING job
    Worker->>PG: SELECT data FROM blob_state
    PG-->>Worker: blob bytes
    Worker->>Tika: PUT /tika (blob bytes)
    Tika-->>Worker: extracted text
    Worker->>PG: SELECT pgcatalog_merge_extracted_text(zoid, text)
    Worker->>PG: UPDATE status = 'done'

Step-by-Step¶

catalog_object() extracts index data including the mime_type catalog index (from the Plone mime_type FieldIndex). The MIME type is stored in the idx JSONB as part of the pending annotation.
CatalogStateProcessor.process() reads idx["mime_type"] from the pending data and checks if PGCATALOG_TIKA_URL is set and the MIME type is in the extractable set. If so, the zoid is added to self._tika_candidates.
CatalogStateProcessor.finalize() runs in the same PostgreSQL transaction as the ZODB commit. It queries blob_state to find which candidates actually have blobs, then inserts jobs into text_extraction_queue. An ON CONFLICT DO NOTHING clause makes this idempotent.

The NOTIFY trigger on the queue table fires, sending a text_extraction_ready notification with the job ID.

The TikaWorker receives the notification (or wakes up on its poll interval). It dequeues one job using UPDATE ... FOR UPDATE SKIP LOCKED RETURNING, which atomically claims the job. Other workers skip this row.

The worker fetches the blob from blob_state (PG bytea) or S3 (for S3-tiered blobs above the size threshold).

The worker sends the blob to Tika via PUT /tika with the content type header. Tika returns plain text.

The worker calls pgcatalog_merge_extracted_text(zoid, text), a PL/pgSQL function that appends the extracted text to the existing searchable_text tsvector at weight C. When BM25 is active, the function also rebuilds BM25 vectors with the Title/Description/extracted text combined.

The job status is updated to done. On failure, the job returns to pending (up to max_attempts retries).

Weight hierarchy¶

The searchable_text tsvector uses PostgreSQL’s four weight classes to rank content by importance:

Weight	Content	BM25 Boost	Source
A	Title	3x (repeated 3 times)	Synchronous (catalog_object)
B	Description	1x	Synchronous (catalog_object)
C	Extracted blob text	1x	Asynchronous (Tika worker)
D	Rich-text body	1x	Synchronous (catalog_object)

A search for “quantum computing” ranks a document with that phrase in the title higher than one where it only appears in an attached PDF. PostgreSQL’s ts_rank_cd() (and BM25’s scoring) respect these weights automatically.

Queue table¶

The text_extraction_queue table is created when PGCATALOG_TIKA_URL is set. See Database schema reference for the full schema.

Key design choices:

UNIQUE(zoid, tid): Prevents duplicate jobs for the same object version.
Partial index on status = 'pending': Makes dequeue queries fast regardless of how many completed jobs exist.
NOTIFY trigger: Fires on every INSERT, waking the worker instantly.
attempts/max_attempts: Built-in retry with configurable limit (default: 3). Failed jobs stay visible for debugging.

Worker modes¶

In-process (development)¶

When PGCATALOG_TIKA_INPROCESS=true, the worker runs as a daemon thread inside the Zope process. It opens its own PostgreSQL connection and HTTP client—it shares nothing with Zope’s ZODB connections or transaction machinery.

The thread is marked daemon=True, meaning it dies automatically when the Zope process exits. No separate shutdown handling is needed.

This mode is convenient for development and small deployments. The trade-off is that extraction work competes with Zope for CPU and memory.

Standalone (production)¶

The pgcatalog-tika-worker CLI runs as a separate process (or container). It depends only on psycopg and httpx—no Zope, no Plone, no ZODB. This makes it lightweight and easy to deploy.

Multiple workers can run concurrently. The SKIP LOCKED dequeue pattern ensures each job is processed exactly once, even under concurrent load.

Image indexing¶

Tika includes Tesseract OCR, which can extract text from images (JPEG, PNG, TIFF, WebP, GIF). By default, plone.pgcatalog configures all common image types as extractable.

This means that after enabling Tika:

A photo of a whiteboard becomes searchable by the text on the board
A scanned invoice becomes searchable by its content
An infographic becomes searchable by its labels and annotations

Plone does not make image blobs searchable by default (there was no extraction mechanism). With Tika, this happens automatically for all Image content types that have blobs.

Interaction with existing search¶

Enabling Tika does not change how existing search works:

Title and Description are still indexed synchronously during catalog_object(), with immediate availability.
Rich-text body (SearchableText from portal_transforms) is still indexed synchronously for non-File content types. For IFile objects, portal_transforms is skipped when Tika is active—the expensive pdftotext/wv calls and BFS graph traversal of the transform registry are avoided entirely. See Custom types with blob fields and Tika for custom types.
Tika extraction adds to the existing tsvector asynchronously. A brief window (seconds to minutes, depending on queue depth and Tika processing time) exists where the blob content is not yet searchable.

Sites that do not set PGCATALOG_TIKA_URL see no change in behavior, schema, or performance. The queue table is not even created.