133 lines
6.3 KiB
ReStructuredText
133 lines
6.3 KiB
ReStructuredText
utf-8's ascii bias and what it means for muds
|
|
==============================================
|
|
|
|
the short version
|
|
-----------------
|
|
|
|
utf-8 was designed to be backwards compatible with ascii (a 1963 american
|
|
standard). that compatibility is baked into the bit structure of every byte.
|
|
english text passes through at 1 byte per character with zero overhead. every
|
|
other language pays extra:
|
|
|
|
ascii (english letters, digits) 1 byte
|
|
latin accented chars (e, o, n) 2 bytes
|
|
cjk (chinese, japanese, korean) 3 bytes
|
|
emoji, historical scripts 4 bytes
|
|
|
|
compare to native cjk encodings like big5 or gbk where those same characters
|
|
are 2 bytes. utf-8 makes cjk text ~50% larger than its native encoding. the
|
|
entire first byte's high bit (0xxxxxxx) is reserved for those 128 ascii
|
|
characters, which are overwhelmingly english/american.
|
|
|
|
why this matters for muds
|
|
-------------------------
|
|
|
|
old muds from the 80s/90s used latin-1 (iso 8859-1), which encodes accented
|
|
characters (e, a, c, o) as single bytes in the 128-255 range. latin-1 worked
|
|
fine when every terminal also spoke latin-1.
|
|
|
|
utf-8 replaced latin-1 as the default everywhere, but utf-8 is only backwards
|
|
compatible with the ascii subset (bytes 0-127). the upper half of latin-1 is
|
|
encoded differently in utf-8 (as 2-byte sequences). when a utf-8 terminal
|
|
connects to a latin-1 mud, the accented characters come through as mojibake.
|
|
|
|
nobody goes back to update ancient mud codebases, so the accented characters
|
|
just broke. the path of least resistance is to strip out accents and write in
|
|
pure ascii. swedish scene folks had a term for it: "dumb swedish" -- writing
|
|
swedish without accented characters, like someone who doesn't know the language
|
|
properly. same thing happens with portuguese, french, german, any language that
|
|
relied on latin-1's upper range.
|
|
|
|
these muds aren't limited to ascii by design. they used latin-1 just fine for
|
|
years. the problem is that utf-8 broke compatibility with those bytes, and since
|
|
muds write to raw sockets, there are several layers working against getting
|
|
printf("swedish") to come out correctly on the other end.
|
|
|
|
the history
|
|
-----------
|
|
|
|
utf-8 was designed by ken thompson on a placemat in a new jersey diner one
|
|
evening in september 1992. rob pike was there cheering him on. they went back
|
|
to bell labs after dinner, and by the following monday they had plan 9 running
|
|
(and only running) utf-8. the full system conversion took less than a week.
|
|
|
|
the key design criterion that distinguished their version from the competing
|
|
fss-utf proposal was #6: "it should be possible to find the start of a character
|
|
efficiently starting from an arbitrary location in a byte stream." the original
|
|
proposal lacked self-synchronization. thompson and pike's version has it -- any
|
|
byte that doesn't start with 10xxxxxx is the start of a character.
|
|
|
|
the bit packing:
|
|
|
|
0xxxxxxx 1 byte, 7 free bits
|
|
110xxxxx 10xxxxxx 2 bytes, 11 free bits
|
|
1110xxxx 10xxxxxx 10xxxxxx 3 bytes, 16 free bits
|
|
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4 bytes, 21 free bits
|
|
|
|
the number of leading 1s in the first byte tells you how many bytes are in the
|
|
sequence. simple, self-synchronizing, ascii-compatible. it won because of that
|
|
ascii compatibility -- pragmatic adoption beat fairness to non-latin scripts.
|
|
|
|
the alternative was utf-16 (2 bytes for most characters, fairer to cjk), but
|
|
it's not ascii-compatible at all. pragmatism won.
|
|
|
|
source: rob pike's email from april 2003, correcting the record that ibm
|
|
designed utf-8. "UTF-8 was designed, in front of my eyes, on a placemat in a
|
|
New Jersey diner one night in September or so 1992."
|
|
|
|
the encoding design space
|
|
-------------------------
|
|
|
|
utf-8 couldn't have done the fairness thing without giving up the one property
|
|
that made it win. the entire first byte design (0xxxxxxx = ascii) is what makes
|
|
it backwards compatible. give that up to make room for 2-byte cjk and you
|
|
basically reinvent utf-16 but worse.
|
|
|
|
the three real options:
|
|
|
|
utf-8: english wins, everyone else pays more. but ascii-compatible, so every
|
|
existing unix tool, every C string function, every file path, every
|
|
null-terminated string just works. that's why it won.
|
|
|
|
utf-16: roughly fair across living languages. the entire basic multilingual
|
|
plane (latin, cyrillic, arabic, hebrew, greek, cjk -- basically everything
|
|
people actually type) is 2 bytes flat. supplementary stuff (emoji, historical
|
|
scripts) goes to 4 bytes via surrogate pairs. cjk goes from 3 bytes (utf-8)
|
|
down to 2, english goes from 1 byte up to 2. java and windows chose this
|
|
internally. but it breaks every C string assumption (null bytes everywhere in
|
|
english text), has byte-order issues (big endian? little endian? here's a BOM
|
|
to sort it out), and it's STILL variable-length because of surrogate pairs, so
|
|
you don't even get O(1) indexing.
|
|
|
|
utf-32: perfectly fair, perfectly wasteful. everything is 4 bytes. dead simple
|
|
to index into. but english text is 4x larger, cjk is 2x larger than their
|
|
native encodings. nobody wants that for storage or transmission.
|
|
|
|
thompson and pike's design criteria were about unix filesystem safety and ascii
|
|
compatibility. fairness across scripts wasn't on the list -- criterion #1 was
|
|
"don't break /" and criterion #2 was "no ascii bytes hiding inside multibyte
|
|
sequences." the encoding is optimized for a world where the existing
|
|
infrastructure was ascii, and the goal was to extend it without breaking
|
|
anything.
|
|
|
|
the irony is that utf-16 was supposed to be the "real" unicode encoding (it was
|
|
originally fixed-width at 2 bytes when unicode only had 65536 codepoints), and
|
|
utf-8 was supposed to be the filesystem-safe hack. but utf-8's unix
|
|
compatibility made it take over the web, and utf-16 got stuck as an internal
|
|
representation in java and windows.
|
|
|
|
what we do about it
|
|
-------------------
|
|
|
|
telnetlib3 handles charset negotiation (see charset-vs-mtts.rst). we default to
|
|
utf-8 and detect client capabilities via mtts. for clients that don't negotiate,
|
|
we should assume utf-8 and accept that legacy latin-1 clients will see garbage
|
|
for anything outside ascii. that's the world we live in.
|
|
|
|
see also
|
|
--------
|
|
|
|
- charset-vs-mtts.rst in this directory
|
|
- rob pike's utf-8 history email (2003): search "UTF-8 history rob pike"
|
|
- the original fss-utf proposal from ken thompson's archives (sep 2 1992)
|
|
- unicode.org utf-8 spec
|