utf-8's ascii bias and what it means for muds ============================================== the short version ----------------- utf-8 was designed to be backwards compatible with ascii (a 1963 american standard). that compatibility is baked into the bit structure of every byte. english text passes through at 1 byte per character with zero overhead. every other language pays extra: ascii (english letters, digits) 1 byte latin accented chars (e, o, n) 2 bytes cjk (chinese, japanese, korean) 3 bytes emoji, historical scripts 4 bytes compare to native cjk encodings like big5 or gbk where those same characters are 2 bytes. utf-8 makes cjk text ~50% larger than its native encoding. the entire first byte's high bit (0xxxxxxx) is reserved for those 128 ascii characters, which are overwhelmingly english/american. why this matters for muds ------------------------- old muds from the 80s/90s used latin-1 (iso 8859-1), which encodes accented characters (e, a, c, o) as single bytes in the 128-255 range. latin-1 worked fine when every terminal also spoke latin-1. utf-8 replaced latin-1 as the default everywhere, but utf-8 is only backwards compatible with the ascii subset (bytes 0-127). the upper half of latin-1 is encoded differently in utf-8 (as 2-byte sequences). when a utf-8 terminal connects to a latin-1 mud, the accented characters come through as mojibake. nobody goes back to update ancient mud codebases, so the accented characters just broke. the path of least resistance is to strip out accents and write in pure ascii. swedish scene folks had a term for it: "dumb swedish" -- writing swedish without accented characters, like someone who doesn't know the language properly. same thing happens with portuguese, french, german, any language that relied on latin-1's upper range. these muds aren't limited to ascii by design. they used latin-1 just fine for years. the problem is that utf-8 broke compatibility with those bytes, and since muds write to raw sockets, there are several layers working against getting printf("swedish") to come out correctly on the other end. the history ----------- utf-8 was designed by ken thompson on a placemat in a new jersey diner one evening in september 1992. rob pike was there cheering him on. they went back to bell labs after dinner, and by the following monday they had plan 9 running (and only running) utf-8. the full system conversion took less than a week. the key design criterion that distinguished their version from the competing fss-utf proposal was #6: "it should be possible to find the start of a character efficiently starting from an arbitrary location in a byte stream." the original proposal lacked self-synchronization. thompson and pike's version has it -- any byte that doesn't start with 10xxxxxx is the start of a character. the bit packing: 0xxxxxxx 1 byte, 7 free bits 110xxxxx 10xxxxxx 2 bytes, 11 free bits 1110xxxx 10xxxxxx 10xxxxxx 3 bytes, 16 free bits 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4 bytes, 21 free bits the number of leading 1s in the first byte tells you how many bytes are in the sequence. simple, self-synchronizing, ascii-compatible. it won because of that ascii compatibility -- pragmatic adoption beat fairness to non-latin scripts. the alternative was utf-16 (2 bytes for most characters, fairer to cjk), but it's not ascii-compatible at all. pragmatism won. source: rob pike's email from april 2003, correcting the record that ibm designed utf-8. "UTF-8 was designed, in front of my eyes, on a placemat in a New Jersey diner one night in September or so 1992." the encoding design space ------------------------- utf-8 couldn't have done the fairness thing without giving up the one property that made it win. the entire first byte design (0xxxxxxx = ascii) is what makes it backwards compatible. give that up to make room for 2-byte cjk and you basically reinvent utf-16 but worse. the three real options: utf-8: english wins, everyone else pays more. but ascii-compatible, so every existing unix tool, every C string function, every file path, every null-terminated string just works. that's why it won. utf-16: roughly fair across living languages. the entire basic multilingual plane (latin, cyrillic, arabic, hebrew, greek, cjk -- basically everything people actually type) is 2 bytes flat. supplementary stuff (emoji, historical scripts) goes to 4 bytes via surrogate pairs. cjk goes from 3 bytes (utf-8) down to 2, english goes from 1 byte up to 2. java and windows chose this internally. but it breaks every C string assumption (null bytes everywhere in english text), has byte-order issues (big endian? little endian? here's a BOM to sort it out), and it's STILL variable-length because of surrogate pairs, so you don't even get O(1) indexing. utf-32: perfectly fair, perfectly wasteful. everything is 4 bytes. dead simple to index into. but english text is 4x larger, cjk is 2x larger than their native encodings. nobody wants that for storage or transmission. thompson and pike's design criteria were about unix filesystem safety and ascii compatibility. fairness across scripts wasn't on the list -- criterion #1 was "don't break /" and criterion #2 was "no ascii bytes hiding inside multibyte sequences." the encoding is optimized for a world where the existing infrastructure was ascii, and the goal was to extend it without breaking anything. the irony is that utf-16 was supposed to be the "real" unicode encoding (it was originally fixed-width at 2 bytes when unicode only had 65536 codepoints), and utf-8 was supposed to be the filesystem-safe hack. but utf-8's unix compatibility made it take over the web, and utf-16 got stuck as an internal representation in java and windows. what we do about it ------------------- telnetlib3 handles charset negotiation (see charset-vs-mtts.rst). we default to utf-8 and detect client capabilities via mtts. for clients that don't negotiate, we should assume utf-8 and accept that legacy latin-1 clients will see garbage for anything outside ascii. that's the world we live in. see also -------- - charset-vs-mtts.rst in this directory - rob pike's utf-8 history email (2003): search "UTF-8 history rob pike" - the original fss-utf proposal from ken thompson's archives (sep 2 1992) - unicode.org utf-8 spec