mud/docs/how/zmachine-garbled-output-investigation.rst

Z-Machine Garbled Output Investigation
========================================

Context
-------

Branch: ``zmachine-zork1-integration``

The hybrid z-machine interpreter runs Zork 1 but produces garbled output:

- Room name "West of House" prints correctly
- Room description shows "er " instead of full text
- After "look" command, output starts correctly then becomes garbage with ? characters
- "There is a small mailbox here." prints correctly

Investigation
-------------

The z-string decoder (ZStringTranslator, ZCharTranslator, ZsciiTranslator in
zstring.py) is correct. The abbreviation table loads correctly. All 96
abbreviations decode to the right text.

The diagnostic script (``scripts/debug_zstrings.py``) revealed the problem:

- ``op_print_paddr paddr=0x0051 (byte=0x00a2)`` decodes to "er "
- Address 0x00a2 is in the abbreviation table area, not where the room
  description should be
- The z-string at 0x00a2 really is "er " - the decoder is right, the address
  is wrong

Root cause: bugs in ``zobjectparser.py`` property parsing.

Bugs Found
----------

All bugs are in ``zobjectparser.py``.

Bug 1: Off-by-one in shortname skip (ROOT CAUSE)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In ``get_prop_addr_len()`` line 392::

    addr += 2 * self._memory[addr]

This skips the shortname data but not the length byte. Compare with
``get_all_properties()`` which does it correctly::

    shortname_length = self._memory[addr]
    addr += 1                      # skip length byte
    addr += 2 * shortname_length   # skip shortname data

The off-by-one means the property table scan starts 1 byte too early, reading
the last byte of the shortname z-string as if it were a property header. This
corrupts every subsequent property lookup.

Bug 2: BitField slice wrong width for V3 property number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In ``get_prop_addr_len()`` line 399 and ``get_all_properties()`` line 446::

    pnum = bf[4:0]

BitField slice ``[4:0]`` has start > stop, so it swaps to ``[0:4]`` = 4 bits.
But V3 property numbers are 5 bits (bits 0-4). Should be ``bf[0:5]``.

This means property numbers 16-31 are read as 0-15 (bit 4 is lost).

Bug 3: BitField slice wrong width for V3 property size
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In ``get_prop_addr_len()`` line 400, ``get_all_properties()`` line 447, and
``get_property_length()`` line 555::

    size = bf[7:5] + 1

BitField slice ``[7:5]`` swaps to ``[5:7]`` = 2 bits. But V3 property sizes are
3 bits (bits 5-7). Should be ``bf[5:8]``.

This means properties with size > 4 bytes are read with the wrong size (e.g.,
size 5 reads as size 1), causing the property iterator to skip too few bytes.

Bug 4: Same BitField slice issue in V5 path
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In ``get_prop_addr_len()`` lines 409, 413::

    pnum = bf[5:0]    # swaps to [0:5] = 5 bits, should be [0:6] = 6 bits
    size = bf2[5:0]   # swaps to [0:5] = 5 bits, should be [0:6] = 6 bits

Compare with ``get_all_properties()`` V5 path which correctly uses ``bf[0:6]``.

Also in ``get_property_length()`` line 563: same ``bf2[5:0]`` issue.

How to Fix
----------

All fixes are in ``zobjectparser.py``:

1. Line 392: change ``addr += 2 * self._memory[addr]`` to::

       addr += 1 + 2 * self._memory[addr]

2. Lines 399, 446: change ``bf[4:0]`` to ``bf[0:5]``

3. Lines 400, 447, 555: change ``bf[7:5]`` to ``bf[5:8]``

4. Lines 409, 413, 563: change ``bf[5:0]`` to ``bf[0:6]``

The pattern: BitField uses Python slice semantics ``[start:stop)`` where the
number of bits extracted is ``stop - start``. The buggy code used ``[high:low]``
notation thinking it would extract bits high through low inclusive, but the swap
logic makes it extract fewer bits than intended.

This is the same class of bug as the write_global fix in commit e5329d6
(BitField slice extracting 7 bits not 8).

Diagnostic Tools
----------------

``scripts/debug_zstrings.py`` traces all z-string decoding with caller info.

Run with::

    echo -e "look\nquit\ny" | python3 scripts/debug_zstrings.py 2>/tmp/trace.log

Trace goes to stderr, game output to stdout.

Session 2: Additional Bugs Found and Fixed
-------------------------------------------

Bug 5: Same off-by-one in get_next_property
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``get_next_property()`` had the same shortname skip bug as Bug 1::

    addr += 2 * self._memory[addr]

Same fix: ``addr += 1 + 2 * self._memory[addr]``

Also had the same BitField slice bugs (Bugs 2, 4) for the V3/V5
property number return values.

Bug 6: get_property_length reads wrong byte for V5 two-byte headers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the V5 two-byte property header path, the code read the size from
the first header byte (which contains the property number) instead of
the second header byte (which contains the size)::

    bf2 = BitField(self._memory[data_address - 2])  # WRONG: first byte
    size = bf2[0:6]

The second byte was already read into ``bf`` at ``data_address - 1``.
Fix: use ``bf[0:6]`` directly instead of creating ``bf2``.

Bug 7: op_dec_chk and op_inc_chk use unsigned comparison
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In ``zcpu.py``, both ``op_dec_chk`` and ``op_inc_chk`` compared 16-bit
values as unsigned integers instead of signed. The Z-machine spec
requires signed comparison.

Effect: when a counter decremented from 0 to 65535 (wrapping), the check
``65535 < 0`` evaluated false (unsigned), causing infinite loops in Z-code
that counted down past zero. This made Zork 1's word-printing loop run
forever, dumping raw memory bytes to the screen as characters.

Fix: wrap both operands with ``_make_signed()`` before comparing.

What's Fixed Now
-----------------

After all fixes:

- Room descriptions display correctly on startup
- The infinite garbage output loop is fixed (dec_chk now works with signed)
- ``just check`` passes (516 tests)

What's Still Broken
--------------------

Commands like "look", "open mailbox", "quit" all produce::

    You used the word "look" in a way that I don't understand.

The parser finds words in the dictionary (dict_addr is correct, e.g.
"look" at 0x44f2) and stores them in the parse buffer correctly. But the
game's Z-code parser can't match them to valid verb syntax.

Instruction tracing after ``op_sread`` shows the game's parser:

1. Reads parse buffer correctly (loadw(0x2551, 1) -> dict_addr 17650)
2. Checks dict_addr against specific word addresses (je comparisons) -
   these are special-word checks, none match
3. Calls a function with the dict_addr that reads ``loadb(17650, 4)`` to
   get the word's flags byte (0x41) from the dictionary entry
4. Tests individual flag bits (16, 64, 8, 32, 128, 4) against 0x41
5. Bit 6 (64) matches - the function extracts data from byte 5 (0xc2)
6. Despite finding the word type, the parser still rejects the command

The bug is likely in how the parser uses the extracted word-type data to
look up verb syntax. Possible areas:

- The ``and`` / ``je`` sequence that extracts a sub-field from the flags
  byte may produce the wrong value
- The verb number extracted from byte 5 (0xc2 = 194) may need signed
  interpretation or bit masking
- Another opcode (``op_jl``, ``op_jg``, ``op_sub``) may have a similar
  unsigned-vs-signed bug
- The ``call`` opcode may have incorrect return value handling

Debugging approach for next session:

1. Extend the instruction trace past 100 steps to see what happens after
   the word-type function returns
2. Compare the execution trace against dfrotz instruction trace (add
   ``-t`` flag to dfrotz for tracing)
3. Check ``op_jl`` and ``op_jg`` for the same unsigned comparison bug
4. Check ``op_sub`` and ``op_mul`` for signed arithmetic issues
5. Look at how the ``and`` result (step 21 in trace) flows through to
   the verb syntax lookup

Diagnostic Tools
~~~~~~~~~~~~~~~~

``scripts/debug_zstrings.py`` traces all z-string decoding with caller info.

Run with::

    echo -e "look\nquit\ny" | python3 scripts/debug_zstrings.py 2>/tmp/trace.log

Trace goes to stderr, game output to stdout.

To add instruction tracing after sread, add to ``zcpu.py``::

    # In step(), after decoding:
    print(f"PC={pc:#06x} {handler_name}({', '.join(str(a) for a in args)})",
          file=sys.stderr)

Session 3: Parser Fixed - Interpreter Now Playable
---------------------------------------------------

Bug 8: op_jl uses unsigned comparison
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In ``zcpu.py``, ``op_jl`` (jump if less-than) compared raw 16-bit values
without ``_make_signed()``, the same class of bug as Bug 7 (op_dec_chk
and op_inc_chk).

Additionally, ``op_jl`` had a non-standard signature::

    def op_jl(self, a, *others):
        for b in others:
            if a < b:
                ...

The Z-machine spec says JL takes exactly 2 operands. Compare with
``op_jg`` which correctly uses ``def op_jg(self, a, b)``.

Fix: add ``_make_signed()`` calls to both operands, normalize signature
to ``(self, a, b)`` matching ``op_jg``.

This alone did not fix the parser - needed in combination with Bug 9.

Bug 9: finish_routine return value storage - THE PARSER BUG
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Two sub-bugs in ``finish_routine()`` in ``zstackmanager.py``:

**Range check used decimal instead of hex**::

    if result_variable < 10:  # WRONG: should be 0x10 (16)
        self.set_local_variable(result_variable, result)

Z-machine variable numbering: 0 = stack push, 1-15 = locals, 16+ = globals.
The check ``< 10`` meant local variables 10-15 were written as globals.

**Missing index adjustment for 1-based variable numbering**::

    self.set_local_variable(result_variable, result)  # WRONG

Z-machine variables are 1-based (variable 1 = first local), but
``set_local_variable()`` uses 0-based indexing (index 0 = first local).
Should be ``set_local_variable(result_variable - 1, result)``.

Both ``_read_variable()`` and ``_write_result()`` in ``zcpu.py``
correctly used ``addr - 1`` and ``< 0x10``, but ``finish_routine()``
in ``zstackmanager.py`` did not.

Effect: when a function returned a value to a local variable, it went
to the wrong slot. The caller read the correct slot and got stale data
(0 or the initialization value from the function call).

This is why every verb was recognized by the dictionary but rejected by
the parser. The word-type checker function returned the correct type,
but the return value landed in the wrong local variable. The parser saw
0 and could not match any syntax pattern.

Room descriptions worked because those code paths used stack returns
(variable 0) or global returns, not local variable returns.

Fix: change ``< 10`` to ``< 0x10``, add ``- 1`` to the
``set_local_variable`` call.

Bug 10: zlexer does not truncate words for dictionary lookup
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In ``zlexer.py``, V3 dictionary entries store words truncated to 6
characters (4 bytes of Z-string = 2 Z-words). V4+ uses 9 characters.

``parse_input()`` looked up full-length input words, so "mailbox" (7
characters) did not match "mailbo" (6 characters) in the dictionary.

Fix: truncate lookup key to ``6 if version <= 3 else 9`` before
``dict.get()``. The original word is preserved in the return list for
correct parse buffer positions.

What's Fixed Now
~~~~~~~~~~~~~~~~

After all fixes:

- All Zork 1 commands work: look, open mailbox, read leaflet, go north,
  inventory, quit, take, drop
- Navigation, object manipulation, multi-word commands, game logic all
  functional
- The interpreter is playable

Diagnostic Tools
----------------

``scripts/debug_zstrings.py`` traces all z-string decoding with caller info.

Run with::

    echo -e "look\nquit\ny" | python3 scripts/debug_zstrings.py 2>/tmp/trace.log

Trace goes to stderr, game output to stdout.

To add instruction tracing after sread, add to ``zcpu.py``::

    # In step(), after decoding:
    print(f"PC={pc:#06x} {handler_name}({', '.join(str(a) for a in args)})",
          file=sys.stderr)