Security

The codec processes pickle data from ZODB, which is a trusted internal format generated by the application’s own persistence layer. Unlike Python’s pickle.loads(), the codec does not execute arbitrary code – it parses pickle opcodes into a data-only AST (PickleValue) and never calls REDUCE targets.

That said, defense-in-depth is good practice. Malformed data can arrive through database corruption, storage bugs, or migration errors. The following limits ensure the codec fails gracefully rather than consuming unbounded resources.

These measures were introduced in v1.2.2 following a security review.

CODEC-C1: Non-negative length validation

Opcodes: LONG4, BINSTRING

Problem: These opcodes carry a 4-byte signed length prefix. A negative length value would be interpreted as a very large unsigned value, causing the decoder to attempt reading billions of bytes from the input.

Mitigation: The decoder validates that the length is non-negative before using it as a read size. Negative lengths produce an immediate decode error.

CODEC-C2: Memo size cap

Limit: 100,000 entries

Problem: The pickle LONG_BINPUT opcode stores a value in the memo at an arbitrary integer index. A malicious pickle could issue LONG_BINPUT with index 2,000,000,000, causing the memo Vec to allocate gigabytes of memory (Rust vectors are contiguous, so the allocation must cover the full index range).

Mitigation: The decoder rejects any memo index that would bring the total memo size above 100,000 entries. Normal ZODB records use at most a few hundred memo entries, so this limit has no effect on legitimate data.

CODEC-H1: Recursion depth limit

Limit: 1,000 levels

Applies to: Encoder (encode.rs) and PyObject converter (pyconv.rs)

Problem: Deeply nested Python objects (dicts containing dicts containing dicts…) cause recursive function calls in the encoder. Without a limit, pathological nesting could overflow the Rust thread stack, which by default is 8 MB on most platforms.

Mitigation: Both the encoder and the PyObject converter track recursion depth and return an error if it exceeds 1,000. Normal ZODB objects rarely exceed 10 levels of nesting. The limit is generous enough to handle any legitimate data while preventing stack overflow.

CODEC-H2: Pre-scan dict keys

Problem: When encoding a Python dict to pickle, the encoder must check whether any keys are JSON marker keys (@cls, @dt, @ref, etc.). The original code checked keys one at a time, and for dicts with a mix of marker and non-marker keys, this could lead to quadratic re-processing: the fast path would start, discover a marker partway through, restart on the marker path, and reprocess already-visited keys.

Mitigation: The encoder pre-scans all dict keys in a single pass before choosing a code path. This is O(n) regardless of key distribution. For dicts with no @-prefixed keys (>99% of ZODB state dicts), the scan exits immediately on the first key.

CODEC-M1: LONG opcode text limit

Limit: 10,000 characters

Problem: The LONG opcode (not LONG1/LONG4) represents an integer as a text string like 12345L. Without a limit, a malformed pickle could contain a LONG opcode with millions of digits, causing the big-integer parser to consume excessive CPU and memory.

Mitigation: The decoder rejects LONG text representations exceeding 10,000 characters. A 10,000-digit integer is approximately 33,000 bits – far beyond any integer that appears in ZODB data.

CODEC-M2: BTree bucket validation

Problem: BTree bucket data is stored as a flat list of alternating keys and values: [k1, v1, k2, v2, ...]. An odd-length list would leave a key without a value, which could cause a panic in the chunked iterator or produce silently corrupted output.

Mitigation: The format_flat_data() function in btrees.rs rejects odd-length item lists with an explicit error before processing.

CODEC-M3: Large string/bytes allocation cap

Limit: 256 MB

Opcodes: BINUNICODE8, BINBYTES8

Problem: These protocol 4/5 opcodes carry an 8-byte length prefix, allowing lengths up to 2^64 bytes. A malformed length could cause the decoder to attempt allocating terabytes of memory.

Mitigation: The decoder caps the allocation at 256 MB. Any single string or bytes value larger than 256 MB in a ZODB record would be exceptional (blobs are stored separately, not inline in pickle state). This limit prevents unbounded allocation while being generous enough for any legitimate data.

What the codec does NOT do

For context, here is what the codec intentionally does not guard against:

  • Arbitrary code execution: The codec never executes pickle REDUCE targets. It records them as PickleValue::Instance data structures. This is fundamentally safer than Python’s pickle.loads(), which calls arbitrary callables.

  • Untrusted input: The codec is designed for ZODB data produced by the application itself. It is not hardened for processing pickles from untrusted sources. The limits above are defense-in-depth against corruption, not a sandbox for hostile input.

  • Protocol 4/5 full support: ZODB uses zodbpickle, which supports up to protocol 3. Protocol 4/5 opcodes are partially handled (enough for interoperability) but are not the primary target.