[KS] unicode

Sat May 30 08:37:19 EDT 2015

On Sat, May 30, 2015, at 16:41, Frank Hoffmann wrote:
> For example, including the 
> character 事 three times instead of one time, that would then allow the 
> same reversibility that we had shown is possible with 요 遼 and 료 遼 
> for Korean. I wondered why this was not done.

Have you really thought your proposal through?

You are essentially suggesting that whenever someone enters a Kanji, it
should be annotated with its pronunciation - but in a way that is
invisible.  This sounds like a recipe for disaster.  Simple example:
when you copy and paste a character into a different context where it
would be pronounced differently - you have created junk data.   Not even
to mention the problem with choosing the right pronunciation when
characters are entered through handwriting recognition (increasingly
common on handheld devices) or OCR.

And nobody would understand the purpose.  It's essentially the same as
asking that whenever a Kanji is printed, it should be accompanied by the
Furigana.  That  would be nice for those of us that are poor at reading
Kanji - but the Japanese have no need for it, and for the same reason
they have no need to embed the pronunciation into the encoding.

Not even to mention the number of different readings, or the fact that
the set of readings is open (or work in progress) - sometimes you'll
meet Japanese telling you that their family name has to be read in a
certain way, even though this is not a dictionary reading of the Kanji. 
The character  事, by the way, has five readings in the Unihan database
(http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=4e8b), and
this does not even include the "goto" reading.

For Chinese the situation is even more hopeless.  Which pronunciations
will you support?  Mandarin? Cantonese? Shanghainese?  What about
historic readings?   For Cantonese, the common input methods are not
based on pronunciation, by the way, so you wouldn't even know how to get
the data into the computer in the first place.

The idea with multiple code points for different pronunciations may have
worked for Korean, for the rather small set of Hanja selected in the
1970s with their current "official" readings.  It simply doesn't scale.

The official Unicode answer would simply be that the pronunciation is an
annotation and doesn't belong in the encoding level.  If needed, it's
something that should be represented at a higher level.  For instance,
nobody stops you from writing

<span pronunciation="goto">事</span>

or something similar (quite likely there already is a standard for doing
this, I didn't look it up) in HTML.

Best wishes,
 Otfried