gabriel rosenkoetter on 31 Mar 2004 23:44:02 -0000


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] UTF-8, Second Opinion


On Wed, Mar 31, 2004 at 05:09:07PM -0500, Paul wrote:
> OK.  If we limit the question about UTF-8 to graphical Web browsers and 
> e-mail clients, then is the use of UTF-8 for character coding of Web 
> pages and e-mail a good thang?

I only view emails in a terminal, and I'm far from alone in that, so
I'd still say that UTF-8 is contraindicated there. There are methods
for stating clearly what character set you're using, though, and
smart MUAs (both mutt and pine included) do the best thing possible
given that information, so it's less offensive than forcing a TERM
setting on me (as Red Hat does).

It *completely* makes sense for web pages (I'd rather see Japanese
characters than garbaged ASCII, even if I can't read either; I at
least know what language they're trying to speak, since the various
Asian language fonts are visually distinguishible).

> And, if a Web page or e-mail does not 
> specify its encoding, is UTF-8 a reasonable default?

No. 7-bit ASCII is the only reasonable default for viewing in that
case in order to maintain backwards compatibility. That said, I'm
pretty sure (though I'm not bothering to check) that most
Latin-character locales will do that for you.

> I guess if you're running a text-based browser or e-mail client in a 
> terminal, you might not like the use of UTF-8, right?

Well, I'm basically resigned to being treated like a second-class
citizen if I'm web browsing with a text-based browser. You (or,
rather, your MUA) shouldn't *send* me UTF-8 without saying it is,
and it may or may not be safe to assume that unspecified character
sets will work in UTF-8 (so you may get garbage when viewing 7-bit
ASCII email if your MUA is stupid, or you force it to be stupid),
but that'd be your problem. :^>

> In my case, I'm concerned about "internationality" (I think I just 
> created that word.) and my ability to send and view Japanese text in 
> addition to English text.  There are Japanese encodings such as 
> ISO-2022-JP, EUC-JP, and Shift_JIS.  But, if UTF-8 can handle "all" 
> languages, isn't better to not worry about regional encodings and use 
> the theoretically universal encoding?

Unicode keeps getting touted as the way to fix all sorts of
character set differences, but it's a false win. It only gets you
text, not various other behaviors (think Japanese keyboards; they've
got three modes, none of which is "Kanji" exactly). Locales are
still the right way to do this, Unicode is just extra fluff.

The problem with assuming that UTF-8 will magically fix this for you
is that if someone sends you ISO-2022-JP ASCII text, UTF-8 still
won't display it properly.

-- 
gabriel rosenkoetter
gr@eclipsed.net

Attachment: pgpDHcuuPwpt2.pgp
Description: PGP signature