Wednesday, February 9, 2011

Why is it that UTF-8 encoding is used when interacting with a UNIX/Linux environment?

I know it is customary, but why? Are there real technical reasons why any other way would be a really bad idea or is it just based on the history of encoding and backwards compatibility? In addition, what are the dangers of not using UTF-8, but some other encoding (most notably, UTF-16)?

Edit : By interacting, I mostly mean the shell and libc.

  • I believe it's mainly the backwards compatability that UTF8 gives with ASCII.

    For an answer to the 'dangers' question, you need to specify what you mean by 'interacting'. Do you mean interacting with the shell, with libc, or with the kernel proper?

  • Yes, it's for compatibility reasons. UTF-8 is backwards comptable with ASCII. Linux/Unix were ASCII based, so it just made/makes sense.

    From Steve K
  • Partly because the file systems expect NUL ('\0') bytes to terminate file names, so UTF-16 would not work well. You'd have to modify a lot of code to make that change.

    dan04 : Windows added support for UTF-16 by making a duplicate version of the entire Windows API. Adding support for UTF-8 would have been much simpler.
  • I thought 7-bit ASCII was fine.

    Seriously, Unicode is relatively new in the scheme of things, and UTF-8 is backward compatible with ASCII and uses less space (half) for typical files since it uses 1 to 4 bytes per code point (character), while UTF-16 uses either 2 or 4 bytes per code point (character).

    UTF-16 is preferable for internal program usage because of the simpler widths. Its predecessor UCS-2 was exactly 2 bytes for every code point.

    Mark Baker : I don't see that widths are much simpler. You still have to scan the whole string. If you're dealing with a lot of CJK text, then UTF-16 can actually be more compact than UTF-8 and may be worth using for that reason, otherwise I'd stick with UTF-8 everywhere.
    Cade Roux : Right ,UTF-16 has lost the big advantages UCS-2 had.
    ΤΖΩΤΖΙΟΥ : (UTF-16 has lost the big advantages UCS-2 had) …but gained the full range of Unicode characters.
    From Cade Roux
  • Modern Unixes use UTF-8, but this was not always true. On RHEL2 -- which is only a few years old -- the default is

    $ locale
    LANG=C
    LC_CTYPE="C"
    LC_NUMERIC="C"
    LC_TIME="C"
    LC_COLLATE="C"
    LC_MONETARY="C"
    LC_MESSAGES="C"
    LC_PAPER="C"
    LC_NAME="C"
    LC_ADDRESS="C"
    LC_TELEPHONE="C"
    LC_MEASUREMENT="C"
    LC_IDENTIFICATION="C"
    LC_ALL=
    The C/POSIX locale is expected to be a 7-bit ASCII-compatible encoding.

    However, as Jonathan Leffler stated, any encoding which allows for NUL bytes within a character sequence is unworkable on Unix, as system APIs are locale-ignorant; strings are all assumed to be byte sequences terminated by \0.

    dan04 : It doesn't have to be an ASCII-compatible encoding, but the POSIX standard does say "A byte with all bits zero shall be interpreted as the null character independent of shift state. Thus a byte with all bits zero shall never occur in the second or subsequent bytes of a character." This means UTF-16 and UTF-32 aren't allowed, but UTF-8 is.
    From ephemient
  • I think it's because programs that expect ASCII input won't be able to handle encodings such as UTF-16. For most characters (in the 0-255 range), those programs will see the high byte as a NUL / 0 char, which is used in many languages and systems to mark the end of a string. This doesn't happen in UTF-8, which was designed to avoid embedded NUL's and be byte-order agnostic.

  • As jonathan-leffler mentions, the prime issue is the ASCII null character. C traditionally expects a string to be null terminated. So standard C string functions will choke on any UTF-16 character containing a byte equivalent to an ASCII null (0x00). While you can certainly program with wide character support, UTF-16 is not a suitable external encoding of Unicode in filenames, text files, environment variables.

    Furthermore, UTF-16 and UTF-32 have both big endian and little endian orientations. To deal with this, you'll either need external metadata like a MIME type, or a Byte Orientation Mark. It notes,

    Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts.

    The predecessor to UTF-16, which was called UCS-2 and didn't support surrogate pairs, had the same issues. UCS-2 should be avoided.

    ΤΖΩΤΖΙΟΥ : If UCS-2 should be avoided, then MS Windows should be avoided too :)
    MSalters : Apparently Windows does support surrogate pairs, unlike UCS2.
  • I believe that when Microsoft started using a two byte encoding, characters above 0xffff had not been assigned, so using a two byte encoding meant that no-one had to worry about characters being different lengths.

    Now that there are characters outside this range, so you'll have to deal with characters of different lengths anyway, why would anyone use UTF-16? I suspect Microsoft would make a different decision if they were desigining their unicode support today.

    From Mark Baker

0 comments:

Post a Comment