[KS] unicode

Fri May 29 22:25:34 EDT 2015

On Sat, May 30, 2015, at 08:33, Frank Hoffmann wrote:
> Your observation seems correct.
> It seems that Korean and Chinese fonts do have these double entries ... 
> I take again the example you introduced first:
> 
> 遼 --> yo  --> \uf9c3 --> 63939 --> 遼
> 遼 --> ryo --> \u907c --> 36988 --> 遼

The origin of this behaviour is not in the fonts, but in the different
national encodings.  The Korean KSC encoding is different from JIS and
GB in this respect.

The reason is very simple: Koreans expect words in Hanja to be sorted by
their pronounciation.  It is not uncommon for Hanja lists / tables /
posters to be sorted by the Hanja pronounciation.    This idea is quite
alien to Japanese or Chinese, who will expect characters to be sorted by
radical/stroke count.

In the 1970s/80s, when the national encodings were made, computers were
not as powerful as today.  Therefore the national encoding was made in
such a way that simply sorting the characters by their code point (that
is, one simply treats the word as a sequence of bytes and sorts
lexicographically by the byte value) will give a reasonable ordering. 
The Korean encoding was modeled after the Japanese JIS encoding, but
Koreans wanted the Hanja to be ordered by their pronounciation.  The
first Hanja in KSC is pronounced 가, and so on.   Of course, this posed a
problem for Hanja that have multiple pronounciations.  It was decided to
simply encode these characters MULTIPLE times, once for each encoding.

In the original Hanja set considered by KSC there were so few of these
characters that this was considered quite a tolerable overhead.  In
fact, it was argued that preserving the distinction between yo and ryo
is useful for further processing.

Similar arguments would not have made sense for Chinese or Japanese, and
were never considered, as far as I know.  Neither Chinese nor Japanese
encodes the same character multiple times just because of the
pronounciation.

Wind forward a few decades.  When Unicode started to include CJK
languages in the 1990s, a huge effort was made to unify the CJK
character sets, with much controversy - see
http://en.wikipedia.org/wiki/Han_unification, or my (very old) article
http://web.archive.org/web/20100328042929/http://tclab.kaist.ac.kr/~otfried/Mule/unihan.html.

Unicode doesn't believe in multiple code points for the same character -
but to preserve the ability to convert from KSC to Unicode and back
without losing any information, Unicode has to include those extra code
points.  Note that these are special code points, not inside the block
for Hanja, but in a special block for "legacy and compatibility"
characters.  If you look at the two Hanja above, you may notice that
U+907c lies in the Hanja block, while U+f9c3 is in the special block
labeled "CJK Compatibility Ideographs".

My feeling is that ultimately these extra code points will fall out of
use.   Certainly Korean texts often do not care about the distinction,
because the person entering the text wasn't careful or didn't even know
which pronounciation is used (or the text was digitalized using
handwriting recognition or even OCR).

To summarize: these code points for identical characters with different
pronounciation are exclusively used for Korean.  Korean fonts will
contain (identical) glyphs for both code points.  So why is the
distinction preserved when you use a Chinese font, but not with a
Japanese font?  Without having looked at the specific fonts, my guess is
that the Chinese font is fully Unicode-capable:  China has embraced
Unicode, and so the designers included these code points for
Unicode-compatibility, even though they would never be used in a Chinese
text.   Japan, on the other hand, is still very sceptical of Unicode,
and apparantly the fonts lack these compatibility code points (even
though the cost for including them would have been close to zero, as
it's simply a mapping to an already existing glyph).  So when you format
using the Japanese font, it seems that a compatibility mapping is
applied, mapping the compatibility code points to their equivalent
Unicode ideographs.

I hope this clarifies it a bit,
 Otfried