From ad47ee05bd5b28a357f8c70708ca2c4acfc9c795 Mon Sep 17 00:00:00 2001 From: Jared Miller Date: Tue, 10 Feb 2026 15:11:35 -0500 Subject: [PATCH] Document z-machine performance analysis and optimization roadmap --- docs/how/if-journey.rst | 2 + docs/how/zmachine-performance.rst | 120 ++++++++++++++++++++++++++++++ 2 files changed, 122 insertions(+) create mode 100644 docs/how/zmachine-performance.rst diff --git a/docs/how/if-journey.rst b/docs/how/if-journey.rst index 1c63cce..8b07e3d 100644 --- a/docs/how/if-journey.rst +++ b/docs/how/if-journey.rst @@ -338,6 +338,8 @@ What was needed: Lost Pig trace: 101K instructions, 61 unique opcodes, full gameplay loop working (room descriptions, parser, object manipulation). ``scripts/trace_lostpig.py`` for tracing. +Performance profiling and optimization work is documented in ``docs/how/zmachine-performance.rst``. Baseline: ~30K opcodes/command at 60 ops/ms (500ms steady-state). Optimization brought cold-start from 4720ms to 786ms (6x improvement). + What this enables: - modern IF games compiled with Inform 6/7 to Z-machine V5/V8 format are now playable diff --git a/docs/how/zmachine-performance.rst b/docs/how/zmachine-performance.rst new file mode 100644 index 0000000..8618e5e --- /dev/null +++ b/docs/how/zmachine-performance.rst @@ -0,0 +1,120 @@ +Z-Machine Interpreter Performance +================================== + +The Problem +----------- + +Lost Pig (V8, ~30K opcodes per command) was taking ~500ms steady-state and +~4700ms cold-start. Zork (V3) felt fast because it uses dfrotz (C subprocess), +but Lost Pig uses the embedded Python interpreter. Pure Python was interpreting +~30K opcodes at 60 ops/ms. + + +Profiling Results (Baseline) +----------------------------- + +103K steps for 3 commands, 9.6M function calls total. + +Top time consumers from cProfile: + +- Logging/tracing: ~1.14s (22%) + + log(), debug(), isEnabledFor(), log_disasm(), str.join for trace strings. + Even with logging disabled, 1M+ calls to log() that check isEnabledFor() + and return. + +- BitField: ~0.67s (13%) + + 878K __getitem__ calls, 304K __init__ allocations, 1.8M isinstance() calls. + +- Memory bounds checking: ~0.55s (11%) + + 512K _check_bounds() calls on every memory read. + +Total overhead: ~2.36s out of 5.1s interpreter time (46%). + + +What We Optimized +----------------- + +- step_fast(): hot loop without trace/log string building. Main run() uses + this by default. + +- Pre-resolved dispatch table: opcode handlers resolved once at init, no + per-instruction isinstance/version checks. + +- Inline bit operations: replaced BitField object allocation + slice ops with + direct & and >> in opcode decoder, branch offset, _make_signed, type byte + parsing. + +Results: cold start 4720ms -> 786ms (6x), steady state ~500ms -> ~460ms (~8%). + + +Future Work (Prioritized) +-------------------------- + +1. Strip log() calls from opcode decoder internals (zopdecoder.py) + + 974K calls still executing in fast path. These are inside _parse_operand, + get_branch_offset, _parse_opcode_*, _get_pc, _write_result, etc. Even with + logging disabled, the function call + isEnabledFor check costs ~0.5s. + +2. Remove memory bounds checking on internal access + + _check_bounds in zmemory.__getitem__ — 512K calls, ~0.28s. Trust story + file for internal decode path, keep bounds checks only at API boundary. + +3. Local variable caching in hot methods + + Cache self._memory, self._stack as locals in tight loops to avoid repeated + attribute lookups. + +4. BitField still used in zobjectparser.py + + get_all_properties, get_prop_addr_len, get_attribute — 200K+ calls remain. + +5. Operand parsing local bindings + + Cache pop_stack, get_local_variable, read_global as locals. + +6. Consider bytearray/memoryview for memory access + + Instead of custom __getitem__. + + +Measurement Tools +----------------- + +- scripts/time_lostpig.py + + Per-command wall clock + opcode count, shows ops/ms rate. + +- scripts/profile_lostpig.py + + cProfile wrapper, shows top functions by cumtime and tottime. + +- scripts/trace_lostpig.py + + Step-by-step opcode tracer with frequency counter. + + +Architecture Notes +------------------ + +- step() (debug) vs step_fast() (production) + + step() still exists with full trace/logging for debugging. + +- _build_dispatch_table() runs at init + + Pre-resolves version-specific handlers. + +- Dispatch table structure + + Dict of lists: dispatch[opcode_class][opcode_number] = (implemented, func) + or None. + +- Threading overhead + + The embedded interpreter runs in a thread; async plumbing (Event.wait) adds + negligible overhead — the bottleneck is pure z-machine execution time.