Add a doc on utf8 design

2026-02-09 15:37:21 -05:00 · 2026-02-09 15:37:21 -05:00 · 4b52051bed
commit 4b52051bed
parent 6fd19d769b
1 changed files with 133 additions and 0 deletions
--- a/docs/how/utf8-design-lesson.rst
+++ b/docs/how/utf8-design-lesson.rst
@ -0,0 +1,133 @@
+utf-8's ascii bias and what it means for muds
+==============================================
+
+the short version
+-----------------
+
+utf-8 was designed to be backwards compatible with ascii (a 1963 american
+standard). that compatibility is baked into the bit structure of every byte.
+english text passes through at 1 byte per character with zero overhead. every
+other language pays extra:
+
+    ascii (english letters, digits)     1 byte
+    latin accented chars (e, o, n)      2 bytes
+    cjk (chinese, japanese, korean)     3 bytes
+    emoji, historical scripts           4 bytes
+
+compare to native cjk encodings like big5 or gbk where those same characters
+are 2 bytes. utf-8 makes cjk text ~50% larger than its native encoding. the
+entire first byte's high bit (0xxxxxxx) is reserved for those 128 ascii
+characters, which are overwhelmingly english/american.
+
+why this matters for muds
+-------------------------
+
+old muds from the 80s/90s used latin-1 (iso 8859-1), which encodes accented
+characters (e, a, c, o) as single bytes in the 128-255 range. latin-1 worked
+fine when every terminal also spoke latin-1.
+
+utf-8 replaced latin-1 as the default everywhere, but utf-8 is only backwards
+compatible with the ascii subset (bytes 0-127). the upper half of latin-1 is
+encoded differently in utf-8 (as 2-byte sequences). when a utf-8 terminal
+connects to a latin-1 mud, the accented characters come through as mojibake.
+
+nobody goes back to update ancient mud codebases, so the accented characters
+just broke. the path of least resistance is to strip out accents and write in
+pure ascii. swedish scene folks had a term for it: "dumb swedish" -- writing
+swedish without accented characters, like someone who doesn't know the language
+properly. same thing happens with portuguese, french, german, any language that
+relied on latin-1's upper range.
+
+these muds aren't limited to ascii by design. they used latin-1 just fine for
+years. the problem is that utf-8 broke compatibility with those bytes, and since
+muds write to raw sockets, there are several layers working against getting
+printf("swedish") to come out correctly on the other end.
+
+the history
+-----------
+
+utf-8 was designed by ken thompson on a placemat in a new jersey diner one
+evening in september 1992. rob pike was there cheering him on. they went back
+to bell labs after dinner, and by the following monday they had plan 9 running
+(and only running) utf-8. the full system conversion took less than a week.
+
+the key design criterion that distinguished their version from the competing
+fss-utf proposal was #6: "it should be possible to find the start of a character
+efficiently starting from an arbitrary location in a byte stream." the original
+proposal lacked self-synchronization. thompson and pike's version has it -- any
+byte that doesn't start with 10xxxxxx is the start of a character.
+
+the bit packing:
+
+    0xxxxxxx                            1 byte,   7 free bits
+    110xxxxx 10xxxxxx                   2 bytes, 11 free bits
+    1110xxxx 10xxxxxx 10xxxxxx          3 bytes, 16 free bits
+    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4 bytes, 21 free bits
+
+the number of leading 1s in the first byte tells you how many bytes are in the
+sequence. simple, self-synchronizing, ascii-compatible. it won because of that
+ascii compatibility -- pragmatic adoption beat fairness to non-latin scripts.
+
+the alternative was utf-16 (2 bytes for most characters, fairer to cjk), but
+it's not ascii-compatible at all. pragmatism won.
+
+source: rob pike's email from april 2003, correcting the record that ibm
+designed utf-8. "UTF-8 was designed, in front of my eyes, on a placemat in a
+New Jersey diner one night in September or so 1992."
+
+the encoding design space
+-------------------------
+
+utf-8 couldn't have done the fairness thing without giving up the one property
+that made it win. the entire first byte design (0xxxxxxx = ascii) is what makes
+it backwards compatible. give that up to make room for 2-byte cjk and you
+basically reinvent utf-16 but worse.
+
+the three real options:
+
+utf-8: english wins, everyone else pays more. but ascii-compatible, so every
+existing unix tool, every C string function, every file path, every
+null-terminated string just works. that's why it won.
+
+utf-16: roughly fair across living languages. the entire basic multilingual
+plane (latin, cyrillic, arabic, hebrew, greek, cjk -- basically everything
+people actually type) is 2 bytes flat. supplementary stuff (emoji, historical
+scripts) goes to 4 bytes via surrogate pairs. cjk goes from 3 bytes (utf-8)
+down to 2, english goes from 1 byte up to 2. java and windows chose this
+internally. but it breaks every C string assumption (null bytes everywhere in
+english text), has byte-order issues (big endian? little endian? here's a BOM
+to sort it out), and it's STILL variable-length because of surrogate pairs, so
+you don't even get O(1) indexing.
+
+utf-32: perfectly fair, perfectly wasteful. everything is 4 bytes. dead simple
+to index into. but english text is 4x larger, cjk is 2x larger than their
+native encodings. nobody wants that for storage or transmission.
+
+thompson and pike's design criteria were about unix filesystem safety and ascii
+compatibility. fairness across scripts wasn't on the list -- criterion #1 was
+"don't break /" and criterion #2 was "no ascii bytes hiding inside multibyte
+sequences." the encoding is optimized for a world where the existing
+infrastructure was ascii, and the goal was to extend it without breaking
+anything.
+
+the irony is that utf-16 was supposed to be the "real" unicode encoding (it was
+originally fixed-width at 2 bytes when unicode only had 65536 codepoints), and
+utf-8 was supposed to be the filesystem-safe hack. but utf-8's unix
+compatibility made it take over the web, and utf-16 got stuck as an internal
+representation in java and windows.
+
+what we do about it
+-------------------
+
+telnetlib3 handles charset negotiation (see charset-vs-mtts.rst). we default to
+utf-8 and detect client capabilities via mtts. for clients that don't negotiate,
+we should assume utf-8 and accept that legacy latin-1 clients will see garbage
+for anything outside ascii. that's the world we live in.
+
+see also
+--------
+
+- charset-vs-mtts.rst in this directory
+- rob pike's utf-8 history email (2003): search "UTF-8 history rob pike"
+- the original fss-utf proposal from ken thompson's archives (sep 2 1992)
+- unicode.org utf-8 spec