Optimization Journal¶
This is a chronological record of every significant performance optimization applied to the codec, from the initial v1.0.0 development through v1.5.0. Each entry describes the technique, explains why it helps (the underlying insight), states the measured impact, and notes any lessons learned.
This document exists as a reference for our future selves. When you wonder “why does the encode path skip PickleValue for datetimes?” or “why not use Rc for the decoder stack?”, the answer is here.
Early optimizations (v1.0.0 development)¶
1. PyO3 boundary crossing¶
Technique: Replaced try_extract() type dispatch with is_instance_of()
checks. Pre-collected PyList with PyList::new() instead of
empty-plus-append. Eliminated state.clone() by taking ownership.
Why it helps: try_extract() on type mismatch creates and immediately
discards a Python exception object. In a type-dispatch loop that tries string,
then int, then float, etc., most branches fail – and each failure allocates
an exception on the Python heap. is_instance_of() is a cheap pointer
comparison against the type object, with no allocation on failure.
Pre-collecting lists avoids the overhead of repeated PyList::append() calls,
each of which crosses the PyO3 boundary. Taking ownership of state instead
of cloning avoids a deep copy of the entire PickleValue tree.
Impact: ~1.8x faster encode path.
2. Bypass serde_json intermediate¶
Technique: Added src/pyconv.rs for direct PickleValue to/from
PyObject conversion, eliminating serde_json::Value as an intermediate
layer. Persistent reference compact/expand happens inline during a single tree
walk.
Why it helps: The original path was: pickle bytes -> PickleValue ->
serde_json::Value -> Python dict. The serde_json step allocates a second tree
of heap objects (Value::Object, Value::Array, Value::String) that mirror
the PickleValue tree, only to be immediately consumed by the PyO3 conversion.
By going directly from PickleValue to PyObject, we eliminate an entire tree
allocation and traversal.
Impact: 2-3.5x faster across most categories.
3. Decode path tuning¶
Technique: Single-pass Dict decode, pre-allocated Vec::with_capacity(),
set/frozenset move semantics (no Vec clone), StackItem removal, @
prefix fast path to skip marker checks on plain dicts.
Why it helps: The decoder’s inner loop runs once per pickle opcode. Each
opcode that pushes to the stack or memo must be as cheap as possible. Pre-
allocating vectors avoids repeated reallocation during growth. Move semantics
for sets avoids cloning the element vector when constructing FrozenSet from
Set. The @ prefix check lets >99% of ZODB dicts (which have plain string
keys like "title", "description") skip all 15+ marker key comparisons.
Impact: Codec competitive with CPython pickle decode.
4. Encode marker fast path¶
Technique: Single-key dicts use direct key extraction plus match instead
of 15 sequential dict.get_item() calls. 2-4 key dicts use a single-pass
@ scan that builds pairs inline. Multi-key dicts check @cls first.
Why it helps: In ZODB records, dicts containing JSON markers are the
norm: every record has {"@cls": ..., "@s": ...}, and state dicts contain
{"@dt": ...}, {"@ref": ...}, etc. The original code called
dict.get_item() for each possible marker key. Each call crosses the PyO3
boundary and does a Python dict lookup. For a simple {"@dt": "..."} dict,
that was 15+ failed lookups plus 1 success.
The fast path reduces this: a single-key dict does 1 key extraction + 1 Rust
string match. A {"@cls": ..., "@s": ...} dict does a single scan for @
prefixed keys and matches on the first character after @.
Impact: Reduced get_item() calls for @cls + @s from ~16 to 2.
5. Direct PyObject encoder¶
Technique: pyconv.rs encode path writes pickle opcodes straight from
Python objects, bypassing the PickleValue tree allocation. Handles common
types directly, falls back for complex markers.
Why it helps: The original encode path was: Python dict -> PickleValue tree -> pickle bytes. The PickleValue tree is a complete copy of the data structure in Rust heap memory, allocated node by node, only to be immediately serialized and dropped. By writing opcodes directly from the Python objects, we eliminate the entire intermediate tree.
Impact: Encode competitive with CPython C pickler.
v1.2.0¶
7. GIL release¶
Technique: Call py.detach() during pure-Rust phases: pickle parsing,
class extraction, reference collection. No Python API calls happen in these
phases.
Why it helps: Zope/Plone deployments typically run multiple threads. While the codec is parsing pickle bytes (pure Rust computation), holding the GIL blocks all other Python threads. Releasing it lets the web server handle other requests concurrently.
Impact: Other Python threads can run during CPU-bound parse phases. No single-thread speedup, but significant throughput improvement in multi-threaded deployments.
8. Single-pass PG decode¶
Technique: decode_zodb_record_for_pg() combines decode + persistent
reference extraction + null-byte sanitization in one function, eliminating two
separate Python-level tree walks.
Why it helps: The PostgreSQL storage backend previously called three functions: decode the record (tree walk 1), extract persistent references (tree walk 2), and sanitize null bytes for JSONB compatibility (tree walk 3). Each tree walk crosses the PyO3 boundary for every node. Combining them into a single Rust function that does all three operations in one pass eliminates two full traversals.
Impact: 2.5x faster PG path (0.35ms to 0.14ms per 100 objects).
v1.3.0¶
9. Box Instance variant¶
Technique: Changed Instance(InstanceData) to Instance(Box<InstanceData>)
in the PickleValue enum.
Why it helps: Rust enums are sized to their largest variant. InstanceData
contains a Global (two Strings) plus a PickleValue for state, making it
the largest variant at 56 bytes. Since every PickleValue node pays this size
cost regardless of which variant it actually is, the 56-byte enum wastes space
for the common scalar variants (Int, Float, Bool, None, String).
Boxing the Instance data behind a pointer reduces the enum to 48 bytes. The Instance variant itself becomes a single pointer (8 bytes). This is a net win because Instance values are rare compared to scalars, so the per-node savings (8 bytes * thousands of nodes) far exceed the occasional extra indirection.
Smaller enum -> fewer cache misses, less stack/memo memory, better CPU cache utilization.
Impact: -13% weighted benchmark improvement.
v1.3.1¶
10. Thin LTO + codegen-units=1¶
Technique: Set lto = "thin" and codegen-units = 1 in the Cargo release
profile.
Why it helps: By default, cargo splits a crate into multiple codegen
units for parallel compilation. This prevents the optimizer from inlining
across unit boundaries. Setting codegen-units = 1 gives the optimizer
visibility into the entire crate. Thin LTO extends this across crate
boundaries (e.g., into serde, ryu, base64), enabling cross-crate
inlining without the full-program compile time of fat LTO.
This is zero code changes for measurable improvement.
Impact: Free 6-9% on both decode and encode.
Lesson: Always enable thin LTO for release builds. The compile time increase is modest and the performance gain is consistent.
v1.4.0¶
11. Direct pickle-to-JSON string path¶
Technique: decode_zodb_record_for_pg_json() converts ZODB pickle records
entirely in Rust, producing a JSON string directly. The GIL is released for
the entire conversion.
Why it helps: The previous PG path was: pickle bytes -> Rust AST ->
Python dict (GIL held) -> json.dumps() (GIL held) -> JSON string. This
crossed the PyO3 boundary for every value (building the Python dict) and then
did a second full traversal in Python’s json.dumps().
The new path is: pickle bytes -> Rust AST -> JSON string (all in Rust, GIL released). No Python objects are allocated. No PyO3 boundary crossings. No second traversal. The JSON string is returned to Python as a single string allocation.
Impact: 1.3x faster full pipeline on real-world data, plus improved multi-threaded throughput from GIL release.
v1.5.0 – Encode Performance Rounds 1-4¶
Round 1¶
12. BigInt elimination¶
Technique: Direct i64 LONG1 encoding from i64::to_le_bytes() with
sign-extension trimming. Buffer reserve() before multi-part writes.
O(1) @cls hash lookup replaces O(n) key scan.
Why it helps: The pickle LONG1 opcode stores arbitrary-precision integers
as little-endian bytes with a length prefix. The original code used the
num-bigint crate, which heap-allocates a BigInt for every integer, even
though 99%+ of ZODB integers fit in i64. Direct i64 encoding avoids the
heap allocation entirely: convert to bytes, trim trailing sign-extension bytes,
write.
Buffer reserve() before writing multi-byte sequences (opcode + length +
data) eliminates mid-write Vec reallocations. The reallocation cost is not
just the memcpy but also the allocator overhead of finding a larger block.
The @cls hash lookup replaces a linear scan over all possible marker keys
when encoding dicts with many keys. For typical 20-50 key ZODB state dicts,
this replaces ~20 string comparisons with a single hash probe.
Impact: Measurable improvement across all encode benchmarks, especially for integer-heavy and large-dict payloads.
13. Profile-Guided Optimization (PGO)¶
Technique: Two-pass instrumented build. First pass collects execution profiles using real FileStorage data (5 MB Wikipedia database, 1,692 records) plus synthetic benchmarks. Second pass uses the profiles to optimize branch prediction, function layout, and inlining decisions.
Why it helps: The compiler normally makes generic optimization decisions based on heuristics. PGO tells it which branches are actually taken, which functions are actually hot, and which call patterns actually occur. For a codec that has many dispatch branches (opcode dispatch, type dispatch, marker dispatch), this information is especially valuable.
Using real FileStorage data alongside synthetic data ensures the profiles reflect actual ZODB usage patterns (dominated by PersistentMapping and OOBucket records with mixed string/int/datetime fields).
Impact: Additional 5-15% on top of code optimizations. Integrated into CI release workflow.
Lesson: PGO with real data profiles gives better results than synthetic-only. The synthetic benchmarks exercise all code paths equally, but real data has strong skew (70% PersistentMapping, 20% OOBucket) that PGO can exploit.
Round 2¶
14. Direct known-type encoding¶
Technique: For @dt, @date, @time, @td, @dec markers: write
pickle opcodes directly to the output buffer instead of allocating
PickleValue intermediate nodes. Inline the timezone REDUCE chain for
@dt.
Why it helps: A datetime encode previously built a PickleValue::Instance
containing a PickleValue::Global and a PickleValue::Tuple of arguments –
at least 6 heap allocations – only to immediately serialize them to opcodes
and drop them. Direct encoding writes the same opcodes without any
intermediate allocation.
The timezone REDUCE chain (which encodes pytz.timezone("UTC") or similar)
is particularly expensive to build as PickleValue nodes because it nests
multiple GLOBAL + REDUCE operations. Writing the opcodes inline is a single
sequence of buffer writes.
Impact: special_types 9.2x faster than Python pickle.
15. Thread-local buffer reuse¶
Technique: ENCODE_BUF thread-local Vec<u8> retains capacity across
calls. On each encode, the buffer is cleared (length reset to 0) but the
allocated memory is kept. The same pattern was later applied to the JSON
writer path.
Why it helps: Each ZODB transaction encodes dozens to thousands of objects.
Without buffer reuse, each encode allocates a fresh Vec, grows it through
several reallocations as data is written, and frees it. With reuse, the first
encode in a transaction grows the buffer to its final size, and all subsequent
encodes reuse that capacity with zero allocation.
For a typical Plone page save that writes 20 objects, this eliminates ~20 allocations + ~40 reallocations (assuming 2 reallocations per encode on average).
Impact: Significant improvement on repeated encode calls. The benefit compounds with transaction size.
Lesson: Thread-local buffers are a big win for hot paths called thousands of times per transaction. The pattern is simple, safe (no cross-thread sharing), and the memory cost is bounded (one buffer per thread).
Round 3¶
16. Direct JSON string writer¶
Technique: json_writer.rs writes JSON tokens directly from the
PickleValue AST to a String buffer. Fast-path string escaping (scan for
characters that need escaping, fast-copy runs of safe characters). ryu crate
for float-to-string conversion.
Why it helps: The serde path was: PickleValue -> serde_json::Value (heap tree) -> serde_json::to_string (traversal + formatting). The direct writer does: PickleValue -> JSON tokens written to String (single traversal, no intermediate tree).
The serde_json::Value tree mirrors the PickleValue tree in structure but
uses different types (Value::Object(Map<String, Value>) vs
PickleValue::Dict(Vec<(PickleValue, PickleValue)>)). Eliminating it removes
an entire tree of heap allocations.
Fast-path string escaping checks whether a string contains any characters
that need JSON escaping (", \, control characters). For the common case
(ASCII strings with no special characters), it writes the string with a single
memcpy instead of character-by-character escaping. The ryu crate converts
floats to strings without going through format!(), avoiding a temporary
String allocation.
Impact: wide_dict -55%, PG path 1.4x faster.
Round 4¶
17. Class pickle cache¶
Technique: Thread-local cache per (module, name) pair. Linear search
over ~6 entries (faster than HashMap for small sets). First call builds and
caches ~50 bytes of pickle opcodes, subsequent calls do a single memcpy.
Why it helps: Every ZODB record starts with a class pickle that encodes
the object’s Python class. In a typical Plone database, 70%+ of records are
persistent.mapping.PersistentMapping, meaning the same ~50-byte class pickle
is regenerated thousands of times. The class pickle is deterministic (same
module + name always produces the same bytes), so caching it is safe.
Linear search over 6 entries is faster than HashMap because there is no
hashing overhead and the small array fits in a single cache line. The typical
ZODB database has only 5-10 distinct classes, so the cache is both small and
highly effective.
Impact: ~2-4% encode improvement, 99.6% cache hit rate on real data.
Cumulative result¶
Operation |
vs CPython pickle |
|---|---|
Encode (synthetic) |
1.7-9.2x faster |
Encode (real FileStorage) |
3-5x faster |
Decode (synthetic) |
1.0-2.3x faster |
Decode (real FileStorage) |
Near parity |
PG JSON path |
1.3-3.3x faster, GIL-free |
Lessons learned¶
The following insights emerged from the optimization work and should inform future changes:
- Rc<PickleValue> for stack+memo was a regression.
Early experiments tried
Rc<PickleValue>to share values between the stack and memo without cloning. The result was slower becauseRc::new()per stack push adds a heap allocation that exceeds the savings from avoiding memo clones. Most values in a pickle stack machine are created once and consumed once – they are never actually shared. Value semantics beat reference counting when sharing is rare.- The decode bottleneck is string allocation, not memo cloning.
Profiling showed that string allocation (both Rust
Stringand Pythonstrvia PyO3) dominates decode time. Optimizing memo handling or stack operations yields diminishing returns once the string path is fast.- Thin LTO is essentially free performance.
lto = "thin"adds modest compile time but consistently yields 6-9% runtime improvement. There is no reason not to enable it for release builds.- PGO with real data is better than synthetic-only.
Synthetic benchmarks exercise all code paths equally. Real ZODB data has strong type skew (70% PersistentMapping, 20% OOBucket) that PGO exploits for better branch prediction. Always include real data in PGO profiles.
- Thread-local buffers compound across transactions.
The first encode in a transaction pays the allocation cost; all subsequent encodes are nearly free. For large transactions (bulk imports, catalog rebuilds), this is a significant win.
- Small-set linear search beats HashMap.
For the class pickle cache (5-10 entries), linear search over a
Vecis faster thanHashMapbecause there is no hash computation and the entire array fits in L1 cache. Only switch toHashMapif the working set exceeds ~20 entries.