[KS] unicode

Frank Hoffmann hoffmann at koreanstudies.com
Sun May 31 18:36:08 EDT 2015

Dear Professor Cheong:

Thank you again.
The first point below -- I include your entire mail here as quote: I am 
not sure if I understand that correct.
Duplicate "code points" (definition here, for everyone to follow: 
http://en.wikipedia.org/wiki/Code_point) ...  we find them listed here: 
http://unicode.org/charts/  ... look for "East Asian Scripts" ... let 
me come back to Andrew's original example here: 요 遼 and 료 遼.
We find this ideograph, as you pointe out to me in your earlier mail, 
at two places, ones in the "regular" character table, and then in the 
"CJK Compatibility Ideographs" set. Here is a listing of that CJK 
Compatibility Ideographs set with ALL the code points etc. as it is or 
should be implemented in modern Unicode fonts: 
We find 遼 (요) on page 12 of that PDF document (\uf9c3) -- while the 
료 遼 version, as you pointed out earlier, is in the regular table.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: yo.jpg
Type: image/jpeg
Size: 8481 bytes
Desc: not available
URL: <http://koreanstudies.com/pipermail/koreanstudies_koreanstudies.com/attachments/20150531/d15b1da3/attachment.jpg>
-------------- next part --------------

But this does NOT just mean setting two "code points" (= two codes) for 
the same glyph (for the same 'image' that depicts a Chines character), 
but the glyph itself is present twice, the image of that character is 
actually doubled -- NOT just a code reference to it. Exactly that 
allows reversibility.
But when you say we COULD add further "code points" then you say, in 
plain words, not doubling the actual ideographs, the images showing 
characters, but adding more codes to the SAME ideograph. Isn't that how 
e.g. the JAPANESE fonts are encoded already. Even if we take this 
example above, the "CJK Compatibility Ideographs" for Korea HANMUN 
characters with dual pronunciation are NOT being used but instead there 
are code points to the SAME ideography ... basically "redirects" or 
"double assignments" ... typing "red apple" and "green apple" both 
times produce the same image of an apple, to put it in other words. 
THAT is then no reversible.

Am I completely misunderstanding you here? (Well possible.)

You wrote:
> The main question is whether this effort is worthwhile.   (...)  I
> guess that JLP software can already predict the reading of a text quite
> accurately using dictionaries and context, so there may simply not be
> enough incentive anymore to build a system that would support the
> creating and maintenance of these annotations.

Yes, sure! As a historian I would consider this a done deal, and too 
late now to change the basics! Let me just say that, other than in your 
example with the different pronunciations of the Latin letter "o" 
Chinese characters are NOT a used in that many languages, and thus 
there are not that many variations, and so that kind of generic logic 
is not quite valid, in my view. In terms of computing I would consider 
it a big missed chance to put that in Unicode. All other "solutions" 
you briefly mentioned are and will be work-arounds, none of them can 
possibly be clean ones, ones that work 100% of the time, and none they 
are all far more (!) complicated. What could have been done with a 
simple font encoding then requires extensive and complicated code and 
math. I think you agree with me here. ... I was therefore just 
wondering why that had not be done, and now I (and we all) do have a 
better understanding of the processes. With the far stronger power of 
the Internet giants these days, if that same process would have started 
15 years later, I can't imagine the decisions would still be as 
"academically" limited as they now are. All is history and historically 
determined, even computer code!

Thanks, that was fun!

On Mon, 01 Jun 2015 00:23:54 +0900, Otfried Cheong wrote:
> On Sun, May 31, 2015, at 06:41, Frank Hoffmann wrote:
>> At this point in time this argument seems null and void to me. That new 
>> "Noto" fonts, just as an example, the "all-inclusive" version of it, is 
>> 115 MB large. So, maybe if we triple large numbers of characters for 
>> the Kanji, it may then be 200 or 300 or 500 MB at most. 
> Actually, adding duplicate code points for glyphs already in the font
> would take very little extra space in the font.  You see, a modern font
> contains a table that maps code points to glyph numbers.  So if two code
> points are to be rendered with the exact same glyph, the second one
> really only needs a few extra bytes for an additional mapping.
>> Possible advantages of doing this?
>> - There might be many more that would come up as a result of the 
>> technical possibility that I cannot think of now -- but what comes to 
>> mind first is certainly translation software: that now works with 
>> dictionaries, same as for Chinese and Koreans.
> I'm sure there are many advantage of having readings available.  One
> obvious application would be for visually impaired computer users - they
> need to rely on screen reading software, which would benefit greatly
> from having the reading available.
> So clearly it is a good idea to build systems where Kanji are
> (automatically) annotated with the reading.  But the question is -
> should this be done through the character encoding?   
> (1) For reading data to be useful, it needs to be reliable.  This is
> only possible if users understand what's going on: their editor would
> have to be able to show the reading (for instance when the mouse is
> hovering over the Kanji), warn when the suffix of a word starting with
> Kanjis is changed (which might cause the reading to change), etc.  Users
> should not need to know about character encoding, so the encoding level
> seems the wrong place to put this annotation.
> (2) Consider the following four English words:
> rhomb, comb, tomb, bomb.
> Each of them contains the letter "o", but it is pronounced differently
> in each of these words.  I personally would find it very helpful if
> every English word was annotated with its pronunciation!  Shall we
> create different versions of the letter "o" for its different
> pronunciations?
> No, clearly not.  An "o" is an "o" - it's a character, not a sound.  And
> for the same reason it would be wrong to make duplicate copies of the
> Kanji.
> (3) We have long left the stage where "one codepoint, one character" was
> true.   Unicode nowadays often represents a single character using a
> sequence of codepoints, for instance to describe accents.
> For Kanji, there is already a set of "Variation Selectors", that allow
> you to specify which precise variant (e.g. glyph) of a Kanji you wish to
> use: http://www.unicode.org/charts/PDF/UFE00.pdf
> IF you wanted to represent readings on the encoding level (and that's
> still a big IF), one could to the same:  Create a set of "Reading
> selectors", and represent a Kanji as:
> <Kanji codepoint> <reading selector>
> This way you can easily allow 100 different readings or more for each
> Kanji without having to worry about font file sizes, or more important
> about the waste of code points.  
> ---
> The main question is whether this effort is worthwhile.   Someone more
> familiar with Japanese language processing would have to answer that.  I
> guess that JLP software can already predict the reading of a text quite
> accurately using dictionaries and context, so there may simply not be
> enough incentive anymore to build a system that would support the
> creating and maintenance of these annotations.
> Best wishes,
>  Otfried

Frank Hoffmann

More information about the Koreanstudies mailing list