From ad47ee05bd5b28a357f8c70708ca2c4acfc9c795 Mon Sep 17 00:00:00 2001
From: Jared Miller <jared@smell.flowers>
Date: Tue, 10 Feb 2026 15:11:35 -0500
Subject: [PATCH] Document z-machine performance analysis and optimization
 roadmap

---
 docs/how/if-journey.rst           |   2 +
 docs/how/zmachine-performance.rst | 120 ++++++++++++++++++++++++++++++
 2 files changed, 122 insertions(+)
 create mode 100644 docs/how/zmachine-performance.rst

diff --git a/docs/how/if-journey.rst b/docs/how/if-journey.rst
index 1c63cce..8b07e3d 100644
--- a/docs/how/if-journey.rst
+++ b/docs/how/if-journey.rst
@@ -338,6 +338,8 @@ What was needed:
 
 Lost Pig trace: 101K instructions, 61 unique opcodes, full gameplay loop working (room descriptions, parser, object manipulation). ``scripts/trace_lostpig.py`` for tracing.
 
+Performance profiling and optimization work is documented in ``docs/how/zmachine-performance.rst``. Baseline: ~30K opcodes/command at 60 ops/ms (500ms steady-state). Optimization brought cold-start from 4720ms to 786ms (6x improvement).
+
 What this enables:
 
 - modern IF games compiled with Inform 6/7 to Z-machine V5/V8 format are now playable
diff --git a/docs/how/zmachine-performance.rst b/docs/how/zmachine-performance.rst
new file mode 100644
index 0000000..8618e5e
--- /dev/null
+++ b/docs/how/zmachine-performance.rst
@@ -0,0 +1,120 @@
+Z-Machine Interpreter Performance
+==================================
+
+The Problem
+-----------
+
+Lost Pig (V8, ~30K opcodes per command) was taking ~500ms steady-state and
+~4700ms cold-start. Zork (V3) felt fast because it uses dfrotz (C subprocess),
+but Lost Pig uses the embedded Python interpreter. Pure Python was interpreting
+~30K opcodes at 60 ops/ms.
+
+
+Profiling Results (Baseline)
+-----------------------------
+
+103K steps for 3 commands, 9.6M function calls total.
+
+Top time consumers from cProfile:
+
+- Logging/tracing: ~1.14s (22%)
+
+  log(), debug(), isEnabledFor(), log_disasm(), str.join for trace strings.
+  Even with logging disabled, 1M+ calls to log() that check isEnabledFor()
+  and return.
+
+- BitField: ~0.67s (13%)
+
+  878K __getitem__ calls, 304K __init__ allocations, 1.8M isinstance() calls.
+
+- Memory bounds checking: ~0.55s (11%)
+
+  512K _check_bounds() calls on every memory read.
+
+Total overhead: ~2.36s out of 5.1s interpreter time (46%).
+
+
+What We Optimized
+-----------------
+
+- step_fast(): hot loop without trace/log string building. Main run() uses
+  this by default.
+
+- Pre-resolved dispatch table: opcode handlers resolved once at init, no
+  per-instruction isinstance/version checks.
+
+- Inline bit operations: replaced BitField object allocation + slice ops with
+  direct & and >> in opcode decoder, branch offset, _make_signed, type byte
+  parsing.
+
+Results: cold start 4720ms -> 786ms (6x), steady state ~500ms -> ~460ms (~8%).
+
+
+Future Work (Prioritized)
+--------------------------
+
+1. Strip log() calls from opcode decoder internals (zopdecoder.py)
+
+   974K calls still executing in fast path. These are inside _parse_operand,
+   get_branch_offset, _parse_opcode_*, _get_pc, _write_result, etc. Even with
+   logging disabled, the function call + isEnabledFor check costs ~0.5s.
+
+2. Remove memory bounds checking on internal access
+
+   _check_bounds in zmemory.__getitem__ — 512K calls, ~0.28s. Trust story
+   file for internal decode path, keep bounds checks only at API boundary.
+
+3. Local variable caching in hot methods
+
+   Cache self._memory, self._stack as locals in tight loops to avoid repeated
+   attribute lookups.
+
+4. BitField still used in zobjectparser.py
+
+   get_all_properties, get_prop_addr_len, get_attribute — 200K+ calls remain.
+
+5. Operand parsing local bindings
+
+   Cache pop_stack, get_local_variable, read_global as locals.
+
+6. Consider bytearray/memoryview for memory access
+
+   Instead of custom __getitem__.
+
+
+Measurement Tools
+-----------------
+
+- scripts/time_lostpig.py
+
+  Per-command wall clock + opcode count, shows ops/ms rate.
+
+- scripts/profile_lostpig.py
+
+  cProfile wrapper, shows top functions by cumtime and tottime.
+
+- scripts/trace_lostpig.py
+
+  Step-by-step opcode tracer with frequency counter.
+
+
+Architecture Notes
+------------------
+
+- step() (debug) vs step_fast() (production)
+
+  step() still exists with full trace/logging for debugging.
+
+- _build_dispatch_table() runs at init
+
+  Pre-resolves version-specific handlers.
+
+- Dispatch table structure
+
+  Dict of lists: dispatch[opcode_class][opcode_number] = (implemented, func)
+  or None.
+
+- Threading overhead
+
+  The embedded interpreter runs in a thread; async plumbing (Event.wait) adds
+  negligible overhead — the bottleneck is pure z-machine execution time.