Document z-machine performance analysis and optimization roadmap
This commit is contained in:
parent
4f570ae1af
commit
ad47ee05bd
2 changed files with 122 additions and 0 deletions
|
|
@ -338,6 +338,8 @@ What was needed:
|
|||
|
||||
Lost Pig trace: 101K instructions, 61 unique opcodes, full gameplay loop working (room descriptions, parser, object manipulation). ``scripts/trace_lostpig.py`` for tracing.
|
||||
|
||||
Performance profiling and optimization work is documented in ``docs/how/zmachine-performance.rst``. Baseline: ~30K opcodes/command at 60 ops/ms (500ms steady-state). Optimization brought cold-start from 4720ms to 786ms (6x improvement).
|
||||
|
||||
What this enables:
|
||||
|
||||
- modern IF games compiled with Inform 6/7 to Z-machine V5/V8 format are now playable
|
||||
|
|
|
|||
120
docs/how/zmachine-performance.rst
Normal file
120
docs/how/zmachine-performance.rst
Normal file
|
|
@ -0,0 +1,120 @@
|
|||
Z-Machine Interpreter Performance
|
||||
==================================
|
||||
|
||||
The Problem
|
||||
-----------
|
||||
|
||||
Lost Pig (V8, ~30K opcodes per command) was taking ~500ms steady-state and
|
||||
~4700ms cold-start. Zork (V3) felt fast because it uses dfrotz (C subprocess),
|
||||
but Lost Pig uses the embedded Python interpreter. Pure Python was interpreting
|
||||
~30K opcodes at 60 ops/ms.
|
||||
|
||||
|
||||
Profiling Results (Baseline)
|
||||
-----------------------------
|
||||
|
||||
103K steps for 3 commands, 9.6M function calls total.
|
||||
|
||||
Top time consumers from cProfile:
|
||||
|
||||
- Logging/tracing: ~1.14s (22%)
|
||||
|
||||
log(), debug(), isEnabledFor(), log_disasm(), str.join for trace strings.
|
||||
Even with logging disabled, 1M+ calls to log() that check isEnabledFor()
|
||||
and return.
|
||||
|
||||
- BitField: ~0.67s (13%)
|
||||
|
||||
878K __getitem__ calls, 304K __init__ allocations, 1.8M isinstance() calls.
|
||||
|
||||
- Memory bounds checking: ~0.55s (11%)
|
||||
|
||||
512K _check_bounds() calls on every memory read.
|
||||
|
||||
Total overhead: ~2.36s out of 5.1s interpreter time (46%).
|
||||
|
||||
|
||||
What We Optimized
|
||||
-----------------
|
||||
|
||||
- step_fast(): hot loop without trace/log string building. Main run() uses
|
||||
this by default.
|
||||
|
||||
- Pre-resolved dispatch table: opcode handlers resolved once at init, no
|
||||
per-instruction isinstance/version checks.
|
||||
|
||||
- Inline bit operations: replaced BitField object allocation + slice ops with
|
||||
direct & and >> in opcode decoder, branch offset, _make_signed, type byte
|
||||
parsing.
|
||||
|
||||
Results: cold start 4720ms -> 786ms (6x), steady state ~500ms -> ~460ms (~8%).
|
||||
|
||||
|
||||
Future Work (Prioritized)
|
||||
--------------------------
|
||||
|
||||
1. Strip log() calls from opcode decoder internals (zopdecoder.py)
|
||||
|
||||
974K calls still executing in fast path. These are inside _parse_operand,
|
||||
get_branch_offset, _parse_opcode_*, _get_pc, _write_result, etc. Even with
|
||||
logging disabled, the function call + isEnabledFor check costs ~0.5s.
|
||||
|
||||
2. Remove memory bounds checking on internal access
|
||||
|
||||
_check_bounds in zmemory.__getitem__ — 512K calls, ~0.28s. Trust story
|
||||
file for internal decode path, keep bounds checks only at API boundary.
|
||||
|
||||
3. Local variable caching in hot methods
|
||||
|
||||
Cache self._memory, self._stack as locals in tight loops to avoid repeated
|
||||
attribute lookups.
|
||||
|
||||
4. BitField still used in zobjectparser.py
|
||||
|
||||
get_all_properties, get_prop_addr_len, get_attribute — 200K+ calls remain.
|
||||
|
||||
5. Operand parsing local bindings
|
||||
|
||||
Cache pop_stack, get_local_variable, read_global as locals.
|
||||
|
||||
6. Consider bytearray/memoryview for memory access
|
||||
|
||||
Instead of custom __getitem__.
|
||||
|
||||
|
||||
Measurement Tools
|
||||
-----------------
|
||||
|
||||
- scripts/time_lostpig.py
|
||||
|
||||
Per-command wall clock + opcode count, shows ops/ms rate.
|
||||
|
||||
- scripts/profile_lostpig.py
|
||||
|
||||
cProfile wrapper, shows top functions by cumtime and tottime.
|
||||
|
||||
- scripts/trace_lostpig.py
|
||||
|
||||
Step-by-step opcode tracer with frequency counter.
|
||||
|
||||
|
||||
Architecture Notes
|
||||
------------------
|
||||
|
||||
- step() (debug) vs step_fast() (production)
|
||||
|
||||
step() still exists with full trace/logging for debugging.
|
||||
|
||||
- _build_dispatch_table() runs at init
|
||||
|
||||
Pre-resolves version-specific handlers.
|
||||
|
||||
- Dispatch table structure
|
||||
|
||||
Dict of lists: dispatch[opcode_class][opcode_number] = (implemented, func)
|
||||
or None.
|
||||
|
||||
- Threading overhead
|
||||
|
||||
The embedded interpreter runs in a thread; async plumbing (Event.wait) adds
|
||||
negligible overhead — the bottleneck is pure z-machine execution time.
|
||||
Loading…
Reference in a new issue