diff --git a/docs/how/utf8-design-lesson.rst b/docs/how/utf8-design-lesson.rst new file mode 100644 index 0000000..7ac0fe5 --- /dev/null +++ b/docs/how/utf8-design-lesson.rst @@ -0,0 +1,133 @@ +utf-8's ascii bias and what it means for muds +============================================== + +the short version +----------------- + +utf-8 was designed to be backwards compatible with ascii (a 1963 american +standard). that compatibility is baked into the bit structure of every byte. +english text passes through at 1 byte per character with zero overhead. every +other language pays extra: + + ascii (english letters, digits) 1 byte + latin accented chars (e, o, n) 2 bytes + cjk (chinese, japanese, korean) 3 bytes + emoji, historical scripts 4 bytes + +compare to native cjk encodings like big5 or gbk where those same characters +are 2 bytes. utf-8 makes cjk text ~50% larger than its native encoding. the +entire first byte's high bit (0xxxxxxx) is reserved for those 128 ascii +characters, which are overwhelmingly english/american. + +why this matters for muds +------------------------- + +old muds from the 80s/90s used latin-1 (iso 8859-1), which encodes accented +characters (e, a, c, o) as single bytes in the 128-255 range. latin-1 worked +fine when every terminal also spoke latin-1. + +utf-8 replaced latin-1 as the default everywhere, but utf-8 is only backwards +compatible with the ascii subset (bytes 0-127). the upper half of latin-1 is +encoded differently in utf-8 (as 2-byte sequences). when a utf-8 terminal +connects to a latin-1 mud, the accented characters come through as mojibake. + +nobody goes back to update ancient mud codebases, so the accented characters +just broke. the path of least resistance is to strip out accents and write in +pure ascii. swedish scene folks had a term for it: "dumb swedish" -- writing +swedish without accented characters, like someone who doesn't know the language +properly. same thing happens with portuguese, french, german, any language that +relied on latin-1's upper range. + +these muds aren't limited to ascii by design. they used latin-1 just fine for +years. the problem is that utf-8 broke compatibility with those bytes, and since +muds write to raw sockets, there are several layers working against getting +printf("swedish") to come out correctly on the other end. + +the history +----------- + +utf-8 was designed by ken thompson on a placemat in a new jersey diner one +evening in september 1992. rob pike was there cheering him on. they went back +to bell labs after dinner, and by the following monday they had plan 9 running +(and only running) utf-8. the full system conversion took less than a week. + +the key design criterion that distinguished their version from the competing +fss-utf proposal was #6: "it should be possible to find the start of a character +efficiently starting from an arbitrary location in a byte stream." the original +proposal lacked self-synchronization. thompson and pike's version has it -- any +byte that doesn't start with 10xxxxxx is the start of a character. + +the bit packing: + + 0xxxxxxx 1 byte, 7 free bits + 110xxxxx 10xxxxxx 2 bytes, 11 free bits + 1110xxxx 10xxxxxx 10xxxxxx 3 bytes, 16 free bits + 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4 bytes, 21 free bits + +the number of leading 1s in the first byte tells you how many bytes are in the +sequence. simple, self-synchronizing, ascii-compatible. it won because of that +ascii compatibility -- pragmatic adoption beat fairness to non-latin scripts. + +the alternative was utf-16 (2 bytes for most characters, fairer to cjk), but +it's not ascii-compatible at all. pragmatism won. + +source: rob pike's email from april 2003, correcting the record that ibm +designed utf-8. "UTF-8 was designed, in front of my eyes, on a placemat in a +New Jersey diner one night in September or so 1992." + +the encoding design space +------------------------- + +utf-8 couldn't have done the fairness thing without giving up the one property +that made it win. the entire first byte design (0xxxxxxx = ascii) is what makes +it backwards compatible. give that up to make room for 2-byte cjk and you +basically reinvent utf-16 but worse. + +the three real options: + +utf-8: english wins, everyone else pays more. but ascii-compatible, so every +existing unix tool, every C string function, every file path, every +null-terminated string just works. that's why it won. + +utf-16: roughly fair across living languages. the entire basic multilingual +plane (latin, cyrillic, arabic, hebrew, greek, cjk -- basically everything +people actually type) is 2 bytes flat. supplementary stuff (emoji, historical +scripts) goes to 4 bytes via surrogate pairs. cjk goes from 3 bytes (utf-8) +down to 2, english goes from 1 byte up to 2. java and windows chose this +internally. but it breaks every C string assumption (null bytes everywhere in +english text), has byte-order issues (big endian? little endian? here's a BOM +to sort it out), and it's STILL variable-length because of surrogate pairs, so +you don't even get O(1) indexing. + +utf-32: perfectly fair, perfectly wasteful. everything is 4 bytes. dead simple +to index into. but english text is 4x larger, cjk is 2x larger than their +native encodings. nobody wants that for storage or transmission. + +thompson and pike's design criteria were about unix filesystem safety and ascii +compatibility. fairness across scripts wasn't on the list -- criterion #1 was +"don't break /" and criterion #2 was "no ascii bytes hiding inside multibyte +sequences." the encoding is optimized for a world where the existing +infrastructure was ascii, and the goal was to extend it without breaking +anything. + +the irony is that utf-16 was supposed to be the "real" unicode encoding (it was +originally fixed-width at 2 bytes when unicode only had 65536 codepoints), and +utf-8 was supposed to be the filesystem-safe hack. but utf-8's unix +compatibility made it take over the web, and utf-16 got stuck as an internal +representation in java and windows. + +what we do about it +------------------- + +telnetlib3 handles charset negotiation (see charset-vs-mtts.rst). we default to +utf-8 and detect client capabilities via mtts. for clients that don't negotiate, +we should assume utf-8 and accept that legacy latin-1 clients will see garbage +for anything outside ascii. that's the world we live in. + +see also +-------- + +- charset-vs-mtts.rst in this directory +- rob pike's utf-8 history email (2003): search "UTF-8 history rob pike" +- the original fss-utf proposal from ken thompson's archives (sep 2 1992) +- unicode.org utf-8 spec