[KS] unicode

Frank Hoffmann hoffmann at koreanstudies.com
Thu Jun 4 19:38:16 EDT 2015


Dear Professor Cheong:

Many, many thanks!
Your last posting is an eye-opener! I indeed last looked closely at 
fonts 20 years ago. So I guess, that is what mostly informed my 
understanding of the structural setup -- and while I had a concept of 
how Unicode works, I indeed missed out completely on ONE essential 
element. You wrote:

 > These variants would
 > have to defined inside the Unicode support library of the OS.

THAT is the key for my misunderstanding. I was taking of for granted 
that the Unicode tables and everything else are build into the Unicode 
fonts (that I just do not see them with the tools I use). So, yes, 
sorry, my area is network administration & programming, and some 
*selected* areas of system administration -- but with no regular 
computer science education I certainly miss out of things I am 
otherwise not working with. So, again thank you for this! It all makes 
sense now, and I can puzzle things together (and know where to look).

Also, as PS to two of my earlier statements:
(a) Unicode compatibility characters: I see there already are Unicode 
normalization scripts that do the remapping (thereby resulting in the 
loss of reversibility of Chinese characters in Korean with two 
pronunciations) -- just as I had suggested could be done -- and that is 
e.g. implemented by Wikipedia and various mainstream computer programs 
(I thing the Windows version of MS Word also).
(b) In this same connection I now found documents at the Unicode 
Consortium that also clearly show that the DIRECTION -- as you and also 
Professor Muller had pointed out -- go into the exact opposite 
direction from what I would think makes sense (for the sake of looking 
slick and slim and united). The fact that how exactly this works 
technically (as I understand now) makes no difference there.

One add-on questions .... to squeeze out the last I can from you as the 
expert on this ::)
I stumbled over the KLDP project of the Hanja-Hangul converter, that 
also allows automated conversion into various styles, including Han 
unification, North Korean, etc. It looks interesting and could POSSIBLY 
be useful in Korean studies. It's Python code, and works file in 
terminal/SSH mode. 
http://kldp.net/projects/hanja/
My simple question: is there a PHP or HTML framework that allows such 
Python code to be accessed/run (so it can be accessible via the Web)? 
Didn't see any implementation of that. ... Hope I did not stretch it 
too far.


Best,
Frank



On Thu, 04 Jun 2015 23:13:29 +0900, Otfried Cheong wrote:
> On Mon, Jun 1, 2015, at 07:36, Frank Hoffmann wrote:
>> let me come back to Andrew's original example here: 요 遼 and 료 遼.
>> http://unicode.org/charts/PDF/UF900.pdf
>> We find this ideograph, as you pointe out to me in your earlier mail, 
>> at two places, ones in the "regular" character table, and then in the 
>> "CJK Compatibility Ideographs" set. Here is a listing of that CJK 
>> Compatibility Ideographs set with ALL the code points etc. as it is or 
>> should be implemented in modern Unicode fonts: 
>> http://unicode.org/charts/PDF/U2F800.pdf
>> We find 遼 (요) on page 12 of that PDF document (\uf9c3) -- while the 
>> 료 遼 version, as you pointed out earlier, is in the regular table.
> 
> Actually that character is U+2F9C3, a character in Unicode plane 2.  It
> also doesn't look anything like U+907C.  The character we were
> discussing is U+F9C3 in plane 0 (the BMP), on page 6 of this link: 
> http://unicode.org/charts/PDF/UF900.pdf 
> 
>> But this does NOT just mean setting two "code points" (= two codes) for 
>> the same glyph (for the same 'image' that depicts a Chines character), 
>> but the glyph itself is present twice, the image of that character is 
>> actually doubled -- NOT just a code reference to it. Exactly that 
>> allows reversibility.
> 
> Reversibility comes from the fact that there are two code points for the
> two pronunciations of this same character.   The font has absolutely
> nothing to do with it.  After all, a font is only a mechanism to convert
> a sequence of characters to a picture on paper/ on the screen.   You can
> do the reversal Hanja -> Hangul by working strictly on a text file, with
> no font ever coming into play! (In fact, the font contains no
> information that would allow you to do this reversal.)
> 
> As the font is simply a mechanism to convert characters to "pictures",
> it is possible to map different characters to the same picture (glyph). 
>  This mechanism has been available in Truetype fonts for a long time,
> and is certainly in all modern font formats.  Several characters can use
> the same glyph, and the glyph needs to be in the font only once.
> 
>> But when you say we COULD add further "code points" then you say, in 
>> plain words, not doubling the actual ideographs, the images showing 
>> characters, but adding more codes to the SAME ideograph. Isn't that how 
>> e.g. the JAPANESE fonts are encoded already. Even if we take this 
>> example above, the "CJK Compatibility Ideographs" for Korea HANMUN 
>> characters with dual pronunciation are NOT being used but instead there 
>> are code points to the SAME ideography ... basically "redirects" or 
>> "double assignments" ... typing "red apple" and "green apple" both 
>> times produce the same image of an apple, to put it in other words. 
>> THAT is then no reversible.
> 
> The "redirection" that you observed when you switched the font to the
> Japanese fonts, and the character U+F9C3 was replaced by U+907C is not
> directly caused by the font.   Again, a font is merely a mechanism to
> map characters to pictures.  A font cannot possibly modify a piece of
> text!   What's happing is certainly not that the font contains a
> redirection from U+F9C3 to U+907C - in fact the problem is that the font
> contains NO mapping for U+F9C3.
> 
> What happened instead is (probably roughly) the following:  When you
> changed the font, your text processor noticed that the font does not
> contain all the characters in the piece of text.   Now, in most cases it
> would simply display those characters with a different font.  But in
> this case I looked up in its internal tables that the character in
> question was a character that has an equivalent form, and it noticed
> that this equivalent form is available in the font.  So it replaced
> U+F9C3 with the equivalent form U+907C, losing reversibility. 
> Personally I think this is wrong:  it should at least have warned the
> user that it is modifying the text that is being formatted.  This could
> be reported as a bug - not of the font, but of this specific text
> processor.  
> 
>> it a big missed chance to put that in Unicode. All other "solutions" 
>> you briefly mentioned are and will be work-arounds, none of them can 
>> possibly be clean ones, ones that work 100% of the time, and none they 
>> are all far more (!) complicated. What could have been done with a 
>> simple font encoding then requires extensive and complicated code and 
>> math.
> 
> Actually the possible solution I mentioned as (3), with "Reading variant
> markers" would be completely indistinguishable for the user from the use
> of separate code points.   It certainly does not require complicated
> code or math.  
> 
> If you are used to a model where the two mappings
> 
>   code point -> character       and then      character -> glyph
> 
> are simple one-to one mappings, then you may think that reading variant
> markers are complicated.  But Unicode has left this model behind about
> twenty years ago.    In Unicode, a sequence of code points can represent
> a single character (for instance, a letter and several accent markers
> define a single accented letter, or a sequence of conjoining jamo
> elements define a single Hangul syllable); and many languages cannot be
> typeset without the ability to automatically choose between different
> glyphs based on the context that a character appears in (e.g. Arabic).  
> The Unicode formatting algorithm can handle these and more complicated
> issues (for instance, left-to-right and right-to-left writing
> intermixed) - adding my reading markers (which after all have no visible
> effect) is a trivial change to this algorithm.
> 
> It's only the hobbyist's simplified use of Unicode that would need a bit
> more care to handle this. 
> 
> And really "font encoding" has nothing to do with it.   The fact that
> some characters are reading variants of each other is completely
> independent of fonts - a text processor would need to know about this
> fact without reference to the specific font.  What a nightmare if
> reading alternatives depended on the font in use!  These variants would
> have to defined inside the Unicode support library of the OS. 
> 
> Again, a font is simply a mechanism to map characters (or simplified,
> code points) to glyphs.   The font has NOTHING to do with the mapping
> from Hangul to Hanja (that is done by the input method), or with the
> reverse mapping, when that is possible.   The font certainly contains no
> information that would make either mapping possible.
> 
> Best wishes,
>  Otfried
> 

--------------------------------------
Frank Hoffmann
http://koreanstudies.com


More information about the Koreanstudies mailing list