Z-Machine Interpreter Performance ================================== The Problem ----------- Lost Pig (V8, ~30K opcodes per command) was taking ~500ms steady-state and ~4700ms cold-start. Zork (V3) felt fast because it uses dfrotz (C subprocess), but Lost Pig uses the embedded Python interpreter. Pure Python was interpreting ~30K opcodes at 60 ops/ms. Profiling Results (Baseline) ----------------------------- 103K steps for 3 commands, 9.6M function calls total. Top time consumers from cProfile: - Logging/tracing: ~1.14s (22%) log(), debug(), isEnabledFor(), log_disasm(), str.join for trace strings. Even with logging disabled, 1M+ calls to log() that check isEnabledFor() and return. - BitField: ~0.67s (13%) 878K __getitem__ calls, 304K __init__ allocations, 1.8M isinstance() calls. - Memory bounds checking: ~0.55s (11%) 512K _check_bounds() calls on every memory read. Total overhead: ~2.36s out of 5.1s interpreter time (46%). What We Optimized ----------------- - step_fast(): hot loop without trace/log string building. Main run() uses this by default. - Pre-resolved dispatch table: opcode handlers resolved once at init, no per-instruction isinstance/version checks. - Inline bit operations: replaced BitField object allocation + slice ops with direct & and >> in opcode decoder, branch offset, _make_signed, type byte parsing. Results: cold start 4720ms -> 786ms (6x), steady state ~500ms -> ~460ms (~8%). Future Work (Prioritized) -------------------------- 1. Strip log() calls from opcode decoder internals (zopdecoder.py) 974K calls still executing in fast path. These are inside _parse_operand, get_branch_offset, _parse_opcode_*, _get_pc, _write_result, etc. Even with logging disabled, the function call + isEnabledFor check costs ~0.5s. 2. Remove memory bounds checking on internal access _check_bounds in zmemory.__getitem__ — 512K calls, ~0.28s. Trust story file for internal decode path, keep bounds checks only at API boundary. 3. Local variable caching in hot methods Cache self._memory, self._stack as locals in tight loops to avoid repeated attribute lookups. 4. BitField still used in zobjectparser.py get_all_properties, get_prop_addr_len, get_attribute — 200K+ calls remain. 5. Operand parsing local bindings Cache pop_stack, get_local_variable, read_global as locals. 6. Consider bytearray/memoryview for memory access Instead of custom __getitem__. Measurement Tools ----------------- - scripts/time_lostpig.py Per-command wall clock + opcode count, shows ops/ms rate. - scripts/profile_lostpig.py cProfile wrapper, shows top functions by cumtime and tottime. - scripts/trace_lostpig.py Step-by-step opcode tracer with frequency counter. Architecture Notes ------------------ - step() (debug) vs step_fast() (production) step() still exists with full trace/logging for debugging. - _build_dispatch_table() runs at init Pre-resolves version-specific handlers. - Dispatch table structure Dict of lists: dispatch[opcode_class][opcode_number] = (implemented, func) or None. - Threading overhead The embedded interpreter runs in a thread; async plumbing (Event.wait) adds negligible overhead — the bottleneck is pure z-machine execution time.