Add a doc on utf8 design

This commit is contained in:
Jared Miller 2026-02-09 15:37:21 -05:00
parent 6fd19d769b
commit 4b52051bed
Signed by: shmup
GPG key ID: 22B5C6D66A38B06C

View file

@ -0,0 +1,133 @@
utf-8's ascii bias and what it means for muds
==============================================
the short version
-----------------
utf-8 was designed to be backwards compatible with ascii (a 1963 american
standard). that compatibility is baked into the bit structure of every byte.
english text passes through at 1 byte per character with zero overhead. every
other language pays extra:
ascii (english letters, digits) 1 byte
latin accented chars (e, o, n) 2 bytes
cjk (chinese, japanese, korean) 3 bytes
emoji, historical scripts 4 bytes
compare to native cjk encodings like big5 or gbk where those same characters
are 2 bytes. utf-8 makes cjk text ~50% larger than its native encoding. the
entire first byte's high bit (0xxxxxxx) is reserved for those 128 ascii
characters, which are overwhelmingly english/american.
why this matters for muds
-------------------------
old muds from the 80s/90s used latin-1 (iso 8859-1), which encodes accented
characters (e, a, c, o) as single bytes in the 128-255 range. latin-1 worked
fine when every terminal also spoke latin-1.
utf-8 replaced latin-1 as the default everywhere, but utf-8 is only backwards
compatible with the ascii subset (bytes 0-127). the upper half of latin-1 is
encoded differently in utf-8 (as 2-byte sequences). when a utf-8 terminal
connects to a latin-1 mud, the accented characters come through as mojibake.
nobody goes back to update ancient mud codebases, so the accented characters
just broke. the path of least resistance is to strip out accents and write in
pure ascii. swedish scene folks had a term for it: "dumb swedish" -- writing
swedish without accented characters, like someone who doesn't know the language
properly. same thing happens with portuguese, french, german, any language that
relied on latin-1's upper range.
these muds aren't limited to ascii by design. they used latin-1 just fine for
years. the problem is that utf-8 broke compatibility with those bytes, and since
muds write to raw sockets, there are several layers working against getting
printf("swedish") to come out correctly on the other end.
the history
-----------
utf-8 was designed by ken thompson on a placemat in a new jersey diner one
evening in september 1992. rob pike was there cheering him on. they went back
to bell labs after dinner, and by the following monday they had plan 9 running
(and only running) utf-8. the full system conversion took less than a week.
the key design criterion that distinguished their version from the competing
fss-utf proposal was #6: "it should be possible to find the start of a character
efficiently starting from an arbitrary location in a byte stream." the original
proposal lacked self-synchronization. thompson and pike's version has it -- any
byte that doesn't start with 10xxxxxx is the start of a character.
the bit packing:
0xxxxxxx 1 byte, 7 free bits
110xxxxx 10xxxxxx 2 bytes, 11 free bits
1110xxxx 10xxxxxx 10xxxxxx 3 bytes, 16 free bits
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4 bytes, 21 free bits
the number of leading 1s in the first byte tells you how many bytes are in the
sequence. simple, self-synchronizing, ascii-compatible. it won because of that
ascii compatibility -- pragmatic adoption beat fairness to non-latin scripts.
the alternative was utf-16 (2 bytes for most characters, fairer to cjk), but
it's not ascii-compatible at all. pragmatism won.
source: rob pike's email from april 2003, correcting the record that ibm
designed utf-8. "UTF-8 was designed, in front of my eyes, on a placemat in a
New Jersey diner one night in September or so 1992."
the encoding design space
-------------------------
utf-8 couldn't have done the fairness thing without giving up the one property
that made it win. the entire first byte design (0xxxxxxx = ascii) is what makes
it backwards compatible. give that up to make room for 2-byte cjk and you
basically reinvent utf-16 but worse.
the three real options:
utf-8: english wins, everyone else pays more. but ascii-compatible, so every
existing unix tool, every C string function, every file path, every
null-terminated string just works. that's why it won.
utf-16: roughly fair across living languages. the entire basic multilingual
plane (latin, cyrillic, arabic, hebrew, greek, cjk -- basically everything
people actually type) is 2 bytes flat. supplementary stuff (emoji, historical
scripts) goes to 4 bytes via surrogate pairs. cjk goes from 3 bytes (utf-8)
down to 2, english goes from 1 byte up to 2. java and windows chose this
internally. but it breaks every C string assumption (null bytes everywhere in
english text), has byte-order issues (big endian? little endian? here's a BOM
to sort it out), and it's STILL variable-length because of surrogate pairs, so
you don't even get O(1) indexing.
utf-32: perfectly fair, perfectly wasteful. everything is 4 bytes. dead simple
to index into. but english text is 4x larger, cjk is 2x larger than their
native encodings. nobody wants that for storage or transmission.
thompson and pike's design criteria were about unix filesystem safety and ascii
compatibility. fairness across scripts wasn't on the list -- criterion #1 was
"don't break /" and criterion #2 was "no ascii bytes hiding inside multibyte
sequences." the encoding is optimized for a world where the existing
infrastructure was ascii, and the goal was to extend it without breaking
anything.
the irony is that utf-16 was supposed to be the "real" unicode encoding (it was
originally fixed-width at 2 bytes when unicode only had 65536 codepoints), and
utf-8 was supposed to be the filesystem-safe hack. but utf-8's unix
compatibility made it take over the web, and utf-16 got stuck as an internal
representation in java and windows.
what we do about it
-------------------
telnetlib3 handles charset negotiation (see charset-vs-mtts.rst). we default to
utf-8 and detect client capabilities via mtts. for clients that don't negotiate,
we should assume utf-8 and accept that legacy latin-1 clients will see garbage
for anything outside ascii. that's the world we live in.
see also
--------
- charset-vs-mtts.rst in this directory
- rob pike's utf-8 history email (2003): search "UTF-8 history rob pike"
- the original fss-utf proposal from ken thompson's archives (sep 2 1992)
- unicode.org utf-8 spec