Add a doc on utf8 design
This commit is contained in:
parent
6fd19d769b
commit
4b52051bed
1 changed files with 133 additions and 0 deletions
133
docs/how/utf8-design-lesson.rst
Normal file
133
docs/how/utf8-design-lesson.rst
Normal file
|
|
@ -0,0 +1,133 @@
|
|||
utf-8's ascii bias and what it means for muds
|
||||
==============================================
|
||||
|
||||
the short version
|
||||
-----------------
|
||||
|
||||
utf-8 was designed to be backwards compatible with ascii (a 1963 american
|
||||
standard). that compatibility is baked into the bit structure of every byte.
|
||||
english text passes through at 1 byte per character with zero overhead. every
|
||||
other language pays extra:
|
||||
|
||||
ascii (english letters, digits) 1 byte
|
||||
latin accented chars (e, o, n) 2 bytes
|
||||
cjk (chinese, japanese, korean) 3 bytes
|
||||
emoji, historical scripts 4 bytes
|
||||
|
||||
compare to native cjk encodings like big5 or gbk where those same characters
|
||||
are 2 bytes. utf-8 makes cjk text ~50% larger than its native encoding. the
|
||||
entire first byte's high bit (0xxxxxxx) is reserved for those 128 ascii
|
||||
characters, which are overwhelmingly english/american.
|
||||
|
||||
why this matters for muds
|
||||
-------------------------
|
||||
|
||||
old muds from the 80s/90s used latin-1 (iso 8859-1), which encodes accented
|
||||
characters (e, a, c, o) as single bytes in the 128-255 range. latin-1 worked
|
||||
fine when every terminal also spoke latin-1.
|
||||
|
||||
utf-8 replaced latin-1 as the default everywhere, but utf-8 is only backwards
|
||||
compatible with the ascii subset (bytes 0-127). the upper half of latin-1 is
|
||||
encoded differently in utf-8 (as 2-byte sequences). when a utf-8 terminal
|
||||
connects to a latin-1 mud, the accented characters come through as mojibake.
|
||||
|
||||
nobody goes back to update ancient mud codebases, so the accented characters
|
||||
just broke. the path of least resistance is to strip out accents and write in
|
||||
pure ascii. swedish scene folks had a term for it: "dumb swedish" -- writing
|
||||
swedish without accented characters, like someone who doesn't know the language
|
||||
properly. same thing happens with portuguese, french, german, any language that
|
||||
relied on latin-1's upper range.
|
||||
|
||||
these muds aren't limited to ascii by design. they used latin-1 just fine for
|
||||
years. the problem is that utf-8 broke compatibility with those bytes, and since
|
||||
muds write to raw sockets, there are several layers working against getting
|
||||
printf("swedish") to come out correctly on the other end.
|
||||
|
||||
the history
|
||||
-----------
|
||||
|
||||
utf-8 was designed by ken thompson on a placemat in a new jersey diner one
|
||||
evening in september 1992. rob pike was there cheering him on. they went back
|
||||
to bell labs after dinner, and by the following monday they had plan 9 running
|
||||
(and only running) utf-8. the full system conversion took less than a week.
|
||||
|
||||
the key design criterion that distinguished their version from the competing
|
||||
fss-utf proposal was #6: "it should be possible to find the start of a character
|
||||
efficiently starting from an arbitrary location in a byte stream." the original
|
||||
proposal lacked self-synchronization. thompson and pike's version has it -- any
|
||||
byte that doesn't start with 10xxxxxx is the start of a character.
|
||||
|
||||
the bit packing:
|
||||
|
||||
0xxxxxxx 1 byte, 7 free bits
|
||||
110xxxxx 10xxxxxx 2 bytes, 11 free bits
|
||||
1110xxxx 10xxxxxx 10xxxxxx 3 bytes, 16 free bits
|
||||
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4 bytes, 21 free bits
|
||||
|
||||
the number of leading 1s in the first byte tells you how many bytes are in the
|
||||
sequence. simple, self-synchronizing, ascii-compatible. it won because of that
|
||||
ascii compatibility -- pragmatic adoption beat fairness to non-latin scripts.
|
||||
|
||||
the alternative was utf-16 (2 bytes for most characters, fairer to cjk), but
|
||||
it's not ascii-compatible at all. pragmatism won.
|
||||
|
||||
source: rob pike's email from april 2003, correcting the record that ibm
|
||||
designed utf-8. "UTF-8 was designed, in front of my eyes, on a placemat in a
|
||||
New Jersey diner one night in September or so 1992."
|
||||
|
||||
the encoding design space
|
||||
-------------------------
|
||||
|
||||
utf-8 couldn't have done the fairness thing without giving up the one property
|
||||
that made it win. the entire first byte design (0xxxxxxx = ascii) is what makes
|
||||
it backwards compatible. give that up to make room for 2-byte cjk and you
|
||||
basically reinvent utf-16 but worse.
|
||||
|
||||
the three real options:
|
||||
|
||||
utf-8: english wins, everyone else pays more. but ascii-compatible, so every
|
||||
existing unix tool, every C string function, every file path, every
|
||||
null-terminated string just works. that's why it won.
|
||||
|
||||
utf-16: roughly fair across living languages. the entire basic multilingual
|
||||
plane (latin, cyrillic, arabic, hebrew, greek, cjk -- basically everything
|
||||
people actually type) is 2 bytes flat. supplementary stuff (emoji, historical
|
||||
scripts) goes to 4 bytes via surrogate pairs. cjk goes from 3 bytes (utf-8)
|
||||
down to 2, english goes from 1 byte up to 2. java and windows chose this
|
||||
internally. but it breaks every C string assumption (null bytes everywhere in
|
||||
english text), has byte-order issues (big endian? little endian? here's a BOM
|
||||
to sort it out), and it's STILL variable-length because of surrogate pairs, so
|
||||
you don't even get O(1) indexing.
|
||||
|
||||
utf-32: perfectly fair, perfectly wasteful. everything is 4 bytes. dead simple
|
||||
to index into. but english text is 4x larger, cjk is 2x larger than their
|
||||
native encodings. nobody wants that for storage or transmission.
|
||||
|
||||
thompson and pike's design criteria were about unix filesystem safety and ascii
|
||||
compatibility. fairness across scripts wasn't on the list -- criterion #1 was
|
||||
"don't break /" and criterion #2 was "no ascii bytes hiding inside multibyte
|
||||
sequences." the encoding is optimized for a world where the existing
|
||||
infrastructure was ascii, and the goal was to extend it without breaking
|
||||
anything.
|
||||
|
||||
the irony is that utf-16 was supposed to be the "real" unicode encoding (it was
|
||||
originally fixed-width at 2 bytes when unicode only had 65536 codepoints), and
|
||||
utf-8 was supposed to be the filesystem-safe hack. but utf-8's unix
|
||||
compatibility made it take over the web, and utf-16 got stuck as an internal
|
||||
representation in java and windows.
|
||||
|
||||
what we do about it
|
||||
-------------------
|
||||
|
||||
telnetlib3 handles charset negotiation (see charset-vs-mtts.rst). we default to
|
||||
utf-8 and detect client capabilities via mtts. for clients that don't negotiate,
|
||||
we should assume utf-8 and accept that legacy latin-1 clients will see garbage
|
||||
for anything outside ascii. that's the world we live in.
|
||||
|
||||
see also
|
||||
--------
|
||||
|
||||
- charset-vs-mtts.rst in this directory
|
||||
- rob pike's utf-8 history email (2003): search "UTF-8 history rob pike"
|
||||
- the original fss-utf proposal from ken thompson's archives (sep 2 1992)
|
||||
- unicode.org utf-8 spec
|
||||
Loading…
Reference in a new issue