[KS] unicode

Otfried Cheong otfried at airpost.net
Sun May 31 11:23:54 EDT 2015

On Sun, May 31, 2015, at 06:41, Frank Hoffmann wrote:

> At this point in time this argument seems null and void to me. That new 
> "Noto" fonts, just as an example, the "all-inclusive" version of it, is 
> 115 MB large. So, maybe if we triple large numbers of characters for 
> the Kanji, it may then be 200 or 300 or 500 MB at most. 

Actually, adding duplicate code points for glyphs already in the font
would take very little extra space in the font.  You see, a modern font
contains a table that maps code points to glyph numbers.  So if two code
points are to be rendered with the exact same glyph, the second one
really only needs a few extra bytes for an additional mapping.

> Possible advantages of doing this?
> - There might be many more that would come up as a result of the 
> technical possibility that I cannot think of now -- but what comes to 
> mind first is certainly translation software: that now works with 
> dictionaries, same as for Chinese and Koreans.

I'm sure there are many advantage of having readings available.  One
obvious application would be for visually impaired computer users - they
need to rely on screen reading software, which would benefit greatly
from having the reading available.

So clearly it is a good idea to build systems where Kanji are
(automatically) annotated with the reading.  But the question is -
should this be done through the character encoding?   

(1) For reading data to be useful, it needs to be reliable.  This is
only possible if users understand what's going on: their editor would
have to be able to show the reading (for instance when the mouse is
hovering over the Kanji), warn when the suffix of a word starting with
Kanjis is changed (which might cause the reading to change), etc.  Users
should not need to know about character encoding, so the encoding level
seems the wrong place to put this annotation.

(2) Consider the following four English words:

rhomb, comb, tomb, bomb.

Each of them contains the letter "o", but it is pronounced differently
in each of these words.  I personally would find it very helpful if
every English word was annotated with its pronunciation!  Shall we
create different versions of the letter "o" for its different

No, clearly not.  An "o" is an "o" - it's a character, not a sound.  And
for the same reason it would be wrong to make duplicate copies of the

(3) We have long left the stage where "one codepoint, one character" was
true.   Unicode nowadays often represents a single character using a
sequence of codepoints, for instance to describe accents.

For Kanji, there is already a set of "Variation Selectors", that allow
you to specify which precise variant (e.g. glyph) of a Kanji you wish to
use: http://www.unicode.org/charts/PDF/UFE00.pdf

IF you wanted to represent readings on the encoding level (and that's
still a big IF), one could to the same:  Create a set of "Reading
selectors", and represent a Kanji as:

<Kanji codepoint> <reading selector>

This way you can easily allow 100 different readings or more for each
Kanji without having to worry about font file sizes, or more important
about the waste of code points.  


The main question is whether this effort is worthwhile.   Someone more
familiar with Japanese language processing would have to answer that.  I
guess that JLP software can already predict the reading of a text quite
accurately using dictionaries and context, so there may simply not be
enough incentive anymore to build a system that would support the
creating and maintenance of these annotations.

Best wishes,

More information about the Koreanstudies mailing list