[KS] unicode

Fri May 29 23:06:27 EDT 2015

Great, clear explanations, Professor Cheong!
All understood.

I just opened the KOREAN and the JAPANESE Unicode fonts by Google, the 
already mentioned "Noto" fonts:
http://www.google.com/get/noto/#/
==> Noto Sans CJK JP   (= Japanese)
==> Noto Sans CJK KR   (= Korean)
So, these are "Made in America" fonts, and very recent ones. What I see 
is that the actual 'glyphs' inside the font are 100% identical (between 
these two fonts). Both consist of 61768 glyphs. Both include, for 
example the "yo" and the "ryo" versions (= identical visuals, as we 
know), both at no. 40720 and no. 58821 (of the internal 
counting/numbering). ... BUT, when (using that earlier mentioned 
TypeTool program) I do a search entering the Unicode OR the character 
itself -- for both, I only get to see both in the KOREAN font. In the 
Japanese font (again, all the glyphs in there are identical to the 
Korean one), then I only get to see ONE of the two .... that is, the 
"Find" function only shows me one. In real life computing that means 
exactly what you explained. If Andrew would, for example, copy/paste 
the (a) yo and (b) ryo (遼 / 遼) to some word processor and then use 
the Japanese version of the Noto font set, the overlaying code pages 
would "redirect" one of the two to be identical with the other (and 
that information is thereby lost). 

So, the history of this is quite different from what I had thought it 
is. Thank you for this.

Don't you agree that Unicode lost a very big chance there (except for 
the Korean font encoding, where it is less important, because there are 
not that many such characters with dual pronunciation in Korean)? I 
mean, if that Korean method of dual (or tripple etc.) entries would 
have been done with the Japanese Kanji, the that would be completely 
reversible. That seems to make so much more sense than creating a one 
way conversion. 

Thanks again.
Frank

On Sat, 30 May 2015 11:25:34 +0900, Otfried Cheong wrote:
> On Sat, May 30, 2015, at 08:33, Frank Hoffmann wrote:
>> Your observation seems correct.
>> It seems that Korean and Chinese fonts do have these double entries ... 
>> I take again the example you introduced first:
>> 
>> 遼 --> yo  --> \uf9c3 --> 63939 --> 遼
>> 遼 --> ryo --> \u907c --> 36988 --> 遼
> 
> The origin of this behaviour is not in the fonts, but in the different
> national encodings.  The Korean KSC encoding is different from JIS and
> GB in this respect.
> 
> The reason is very simple: Koreans expect words in Hanja to be sorted by
> their pronounciation.  It is not uncommon for Hanja lists / tables /
> posters to be sorted by the Hanja pronounciation.    This idea is quite
> alien to Japanese or Chinese, who will expect characters to be sorted by
> radical/stroke count.
> 
> In the 1970s/80s, when the national encodings were made, computers were
> not as powerful as today.  Therefore the national encoding was made in
> such a way that simply sorting the characters by their code point (that
> is, one simply treats the word as a sequence of bytes and sorts
> lexicographically by the byte value) will give a reasonable ordering. 
> The Korean encoding was modeled after the Japanese JIS encoding, but
> Koreans wanted the Hanja to be ordered by their pronounciation.  The
> first Hanja in KSC is pronounced 가, and so on.   Of course, this posed a
> problem for Hanja that have multiple pronounciations.  It was decided to
> simply encode these characters MULTIPLE times, once for each encoding.
> 
> In the original Hanja set considered by KSC there were so few of these
> characters that this was considered quite a tolerable overhead.  In
> fact, it was argued that preserving the distinction between yo and ryo
> is useful for further processing.
> 
> Similar arguments would not have made sense for Chinese or Japanese, and
> were never considered, as far as I know.  Neither Chinese nor Japanese
> encodes the same character multiple times just because of the
> pronounciation.
> 
> Wind forward a few decades.  When Unicode started to include CJK
> languages in the 1990s, a huge effort was made to unify the CJK
> character sets, with much controversy - see
> http://en.wikipedia.org/wiki/Han_unification, or my (very old) article
> 
http://web.archive.org/web/20100328042929/http://tclab.kaist.ac.kr/~otfried/Mule/unihan.html.
> 
> Unicode doesn't believe in multiple code points for the same character -
> but to preserve the ability to convert from KSC to Unicode and back
> without losing any information, Unicode has to include those extra code
> points.  Note that these are special code points, not inside the block
> for Hanja, but in a special block for "legacy and compatibility"
> characters.  If you look at the two Hanja above, you may notice that
> U+907c lies in the Hanja block, while U+f9c3 is in the special block
> labeled "CJK Compatibility Ideographs".
> 
> My feeling is that ultimately these extra code points will fall out of
> use.   Certainly Korean texts often do not care about the distinction,
> because the person entering the text wasn't careful or didn't even know
> which pronounciation is used (or the text was digitalized using
> handwriting recognition or even OCR).
> 
> To summarize: these code points for identical characters with different
> pronounciation are exclusively used for Korean.  Korean fonts will
> contain (identical) glyphs for both code points.  So why is the
> distinction preserved when you use a Chinese font, but not with a
> Japanese font?  Without having looked at the specific fonts, my guess is
> that the Chinese font is fully Unicode-capable:  China has embraced
> Unicode, and so the designers included these code points for
> Unicode-compatibility, even though they would never be used in a Chinese
> text.   Japan, on the other hand, is still very sceptical of Unicode,
> and apparantly the fonts lack these compatibility code points (even
> though the cost for including them would have been close to zero, as
> it's simply a mapping to an already existing glyph).  So when you format
> using the Japanese font, it seems that a compatibility mapping is
> applied, mapping the compatibility code points to their equivalent
> Unicode ideographs.
> 
> I hope this clarifies it a bit,
>  Otfried
> 
> 
> 
> 

--------------------------------------
Frank Hoffmann
http://koreanstudies.com