[KS] unicode

Thu Jun 4 10:13:29 EDT 2015

On Mon, Jun 1, 2015, at 07:36, Frank Hoffmann wrote:
> let me come back to Andrew's original example here: 요 遼 and 료 遼.
> http://unicode.org/charts/PDF/UF900.pdf
> We find this ideograph, as you pointe out to me in your earlier mail, 
> at two places, ones in the "regular" character table, and then in the 
> "CJK Compatibility Ideographs" set. Here is a listing of that CJK 
> Compatibility Ideographs set with ALL the code points etc. as it is or 
> should be implemented in modern Unicode fonts: 
> http://unicode.org/charts/PDF/U2F800.pdf
> We find 遼 (요) on page 12 of that PDF document (\uf9c3) -- while the 
> 료 遼 version, as you pointed out earlier, is in the regular table.

Actually that character is U+2F9C3, a character in Unicode plane 2.  It
also doesn't look anything like U+907C.  The character we were
discussing is U+F9C3 in plane 0 (the BMP), on page 6 of this link: 
http://unicode.org/charts/PDF/UF900.pdf 

> But this does NOT just mean setting two "code points" (= two codes) for 
> the same glyph (for the same 'image' that depicts a Chines character), 
> but the glyph itself is present twice, the image of that character is 
> actually doubled -- NOT just a code reference to it. Exactly that 
> allows reversibility.

Reversibility comes from the fact that there are two code points for the
two pronunciations of this same character.   The font has absolutely
nothing to do with it.  After all, a font is only a mechanism to convert
a sequence of characters to a picture on paper/ on the screen.   You can
do the reversal Hanja -> Hangul by working strictly on a text file, with
no font ever coming into play! (In fact, the font contains no
information that would allow you to do this reversal.)

As the font is simply a mechanism to convert characters to "pictures",
it is possible to map different characters to the same picture (glyph). 
 This mechanism has been available in Truetype fonts for a long time,
and is certainly in all modern font formats.  Several characters can use
the same glyph, and the glyph needs to be in the font only once.

> But when you say we COULD add further "code points" then you say, in 
> plain words, not doubling the actual ideographs, the images showing 
> characters, but adding more codes to the SAME ideograph. Isn't that how 
> e.g. the JAPANESE fonts are encoded already. Even if we take this 
> example above, the "CJK Compatibility Ideographs" for Korea HANMUN 
> characters with dual pronunciation are NOT being used but instead there 
> are code points to the SAME ideography ... basically "redirects" or 
> "double assignments" ... typing "red apple" and "green apple" both 
> times produce the same image of an apple, to put it in other words. 
> THAT is then no reversible.

The "redirection" that you observed when you switched the font to the
Japanese fonts, and the character U+F9C3 was replaced by U+907C is not
directly caused by the font.   Again, a font is merely a mechanism to
map characters to pictures.  A font cannot possibly modify a piece of
text!   What's happing is certainly not that the font contains a
redirection from U+F9C3 to U+907C - in fact the problem is that the font
contains NO mapping for U+F9C3.

What happened instead is (probably roughly) the following:  When you
changed the font, your text processor noticed that the font does not
contain all the characters in the piece of text.   Now, in most cases it
would simply display those characters with a different font.  But in
this case I looked up in its internal tables that the character in
question was a character that has an equivalent form, and it noticed
that this equivalent form is available in the font.  So it replaced
U+F9C3 with the equivalent form U+907C, losing reversibility. 
Personally I think this is wrong:  it should at least have warned the
user that it is modifying the text that is being formatted.  This could
be reported as a bug - not of the font, but of this specific text
processor.  

> it a big missed chance to put that in Unicode. All other "solutions" 
> you briefly mentioned are and will be work-arounds, none of them can 
> possibly be clean ones, ones that work 100% of the time, and none they 
> are all far more (!) complicated. What could have been done with a 
> simple font encoding then requires extensive and complicated code and 
> math.

Actually the possible solution I mentioned as (3), with "Reading variant
markers" would be completely indistinguishable for the user from the use
of separate code points.   It certainly does not require complicated
code or math.  

If you are used to a model where the two mappings

  code point -> character       and then      character -> glyph

are simple one-to one mappings, then you may think that reading variant
markers are complicated.  But Unicode has left this model behind about
twenty years ago.    In Unicode, a sequence of code points can represent
a single character (for instance, a letter and several accent markers
define a single accented letter, or a sequence of conjoining jamo
elements define a single Hangul syllable); and many languages cannot be
typeset without the ability to automatically choose between different
glyphs based on the context that a character appears in (e.g. Arabic).  
The Unicode formatting algorithm can handle these and more complicated
issues (for instance, left-to-right and right-to-left writing
intermixed) - adding my reading markers (which after all have no visible
effect) is a trivial change to this algorithm.

It's only the hobbyist's simplified use of Unicode that would need a bit
more care to handle this. 

And really "font encoding" has nothing to do with it.   The fact that
some characters are reading variants of each other is completely
independent of fonts - a text processor would need to know about this
fact without reference to the specific font.  What a nightmare if
reading alternatives depended on the font in use!  These variants would
have to defined inside the Unicode support library of the OS. 

Again, a font is simply a mechanism to map characters (or simplified,
code points) to glyphs.   The font has NOTHING to do with the mapping
from Hangul to Hanja (that is done by the input method), or with the
reverse mapping, when that is possible.   The font certainly contains no
information that would make either mapping possible.

Best wishes,
 Otfried