Introduction to Unicode

Eugenia Loli 2004-11-01 General Development 33 Comments

Unicode, or the Universal Character Set (UCS), was developed to end once and for all the problems associated with the abundance of character sets used for writing text in different languages. It is a single character set whose goal is to be a superset of all others used before, and to contain every character used in writing any language (including many dead languages) as well as other symbols used in mathematics and engineering. Any charset can be losslessly converted to Unicode, as we’ll see.

About The Author

Eugenia Loli

Ex-programmer, ex-editor in chief at OSNews.com, now a visual artist/filmmaker.

Follow me on Twitter @EugeniaLoli

33 Comments

2004-11-01 8:10 pm
Anonymous
Great readings about unicode and how an OS deals with it
here:
http://www.cs.bell-labs.com/sys/doc/utf.html
2004-11-01 8:16 pm
Anonymous
Could someone tell me how to losslessly convert ISCII (Indian Script Code for Information Interchange) text to Unicode? Are there programs out there to do it?
2004-11-01 8:19 pm
Anonymous
Its just a thought, but /usr/bin/iconv ?(on linux)
2004-11-01 8:22 pm
Anonymous
” Any charset can be losslessly converted to Unicode, as we’ll see.”
is completely false.
UNICODE unifies different characters used in china/korea/japan in only one codepoint. This means that when a korean senda a mail to a japanese the will use the same char to mean different things. Think of getting an email from a friend where i and e were unified in only one char.
The TRON encoding contains, nearl 200.000 different characters, and has support for 1.500.000.
But in the usual “everyone is an euramerican” view I guess unicode is a good thing.
2004-11-01 8:49 pm
Anonymous
I’ve read the same thing (on the TRON pages). The TRON idea – let people decide what charset to use, and let it fit seamlessly into the system, seems superior to the “one charset to rule them all” approach of Unicode.
2004-11-01 8:51 pm
Anonymous
” This means that when a korean senda a mail to a japanese the will use the same char to mean different things”
Isn’t this language rather than characters?
Presumably your putative Korean and Japanese would not be attempting to read Korean as Japanese or Japanese as Korean? (european) Languages often share words and their meaning is not necessarily uniform.
2004-11-01 9:07 pm
Anonymous
The is the famous “bone problem” issue. There is a character,
used to represent the word “bone”, which greatly differs in
appearance between Chinese and Japanese.
For various nationalistic reasons, many people insist that
there must be distince code points for this. They argue
that one can not distinguish them, or that an ugly character
image may be used.
Well, that’s like arguing that the different letter forms
of ‘a’ should have different code points. Does your ‘g’
end in a hook or a loop? Does your ‘y’ have a curved or
straight descender? Hey, they’re different, so they should
have different code points, right? What if you wanted to
put both kinds of ‘g’ in the same document?
It may well be that a Chinese-style bone glyph is seriously
jarring or even unreadable to a Japanese person. Oh well.
Some people hate some fonts. News at 11, OK? There are
many fonts that I hate too, so I don’t use those fonts.
This logic doesn’t sit well with some. It’s taken as an
insult that the “bone” character in one language could
be equivalent to the “bone” character in another language.
Oh well. Your nationalistic issues are your problem; the
encoding is perfectly fine.
2004-11-01 9:09 pm
Anonymous
This is the famous “bone problem” issue. There is a character,
used to represent the word “bone”, which greatly differs in
appearance between Chinese and Japanese.
For various nationalistic reasons, many people insist that
there must be distinct code points for this. They argue
that one can not distinguish them, or that an ugly character
image may be used.
Well, that’s like arguing that the different letter forms
of ‘a’ should have different code points. Does your ‘g’
end in a hook or a loop? Does your ‘y’ have a curved or
straight descender? Hey, they’re different, so they should
have different code points, right? What if you wanted to
put both kinds of ‘g’ in the same document?
It may well be that a Chinese-style bone glyph is seriously
jarring or even unreadable to a Japanese person. Oh well.
Some people hate some fonts. News at 11, OK? There are
many fonts that I hate too, so I don’t use those fonts.
This logic doesn’t sit well with some. It’s taken as an
insult that the “bone” character in one language could
be equivalent to the “bone” character in another language.
Oh well. Your nationalistic issues are your problem; the
encoding is perfectly fine.
2004-11-01 9:33 pm
Anonymous
I’m no expert on asian languages or alphabets, but it seems to me as though this “bone” problem is more of a font problem rather than an encoding problem.
I’m not sure if this was what you were trying to say or not.
2004-11-01 10:02 pm
Anonymous
If it’s a single logical character (as ‘a’ and ‘a’ in French are a single logical character)< then it should be the same code-point. Unicode specifies what the text contains *logically*, not what the text should look like. Unicode apps need to have higher-level functionality to specify the language for each fragment of text. The app then must use that functionality to adapt the appearance of the text to the language in use. This would mean substituting stuff like the Japanese “bone” glyph when the text is marked as being in Japanese.
2004-11-01 10:21 pm
Anonymous
I didn’t know that UTF-8 was a subset of unicode 🙂
2004-11-01 10:31 pm
Anonymous
Oh, ever famous U+9AA8, yet another Japanese troll.
See yourself:
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=9AA8
The character rendered for zh, zh-tw, ja:
http://oku.edu.mie-u.ac.jp/~okumura/texfaq/japanese/bone.png
Notice that zh-tw and ja is right-left inverted to each other. Does this matter?
Honestly, I’ve never seen anybody really caring, *except* for some Japanese people. FWIW, I am a Korean.
2004-11-01 10:43 pm
Anonymous
I’ll be the first to confess, I don’t really know too much about unicode, nor do I really know the details of this ‘bone issue, which I’ll take as fact for now. But I think verbat does have a valid complaint. Whether its solely unicode’s fault is another issue.
If Japanese/Chinese/Korean…have different letters that mean/look reasonably different, they should have been given their own encoding or seperate letter codes for the ones that differ. We can’t draw a direct parallel between european languages where English/French…all share a similar alphabet. French just adds a few accented letters and what not. Maybe a better comparison would be to compare English to Greek or Russian, where you can definitely see similar characters, but also different ones. I mean would it be okay if they decided to use the letter ‘a’ in place of greek’s alpha, because they looked similar enough.
Bah, we need linguists in here to clear all these if’s and maybe’s up.
2004-11-01 11:10 pm
Anonymous
” We can’t draw a direct parallel between european languages where English/French…all share a similar alphabet”
You could use the (handwritten) french figure one which resembles an uncrossed seven. The french seem to use the international norm in print.
2004-11-01 11:30 pm
Anonymous
“Could someone tell me how to losslessly convert ISCII (Indian Script Code for Information Interchange) text to Unicode? Are there programs out there to do it?”
Please go through this doc http://www.unicode.org/versions/Unicode4.0.0/ch09.pdf and you will see that Unicode Standard encodes Devnagri characters in the same relative as those encoded in positions A0-F416 in the ISCII-1988 standard. So character conversion should be relatively easy. After that its a matter of fonts.
2004-11-01 11:47 pm
Anonymous
“Bah, we need linguists in here to clear all these ifs and maybes up.”
Yeah, well professionals can bicker and be petty to; it’s not JUST the uninformed. Personally, I wouldn’t care to see UTF-32 or somesuch adopted as a standard. That way, everyone’s happy, and text files are only 4x bigger. (Think, most text files don’t even fill their block on the hard drive.)
2004-11-02 12:21 am
Anonymous
you should understand that this a far wider problem than handling CJK.
Unicode is an eurocentric view of the world, and inposed from companies, not technology or logic.
This may be of interest to you:
http://tronweb.super-nova.co.jp/unicoderevisited.html
And I am european.
2004-11-02 12:24 am
Anonymous
Chinese: Taiwan, Singapore, basically any place that isn’t ruled by the communists use the “traditional” characters. Mainland China uses “simplified” characters.
Korea: Has it’s own written (and almost completely phonetic) system called “Hangul” created in the 14th century.
Japan: also has a more modern written system than the Chinese tradition system.
Jorea and Japan (as well as other asian countries) use the traditional Chinese Characters in a similar fashion to the way the west might use Latin. Official things, like titles in medicine, perhaps in some official documents, money, Temples.
In other words… Koreans will use Hangul, Japanese will use their more modern system (The name escapes me know), Mainland Chinese will use simplified, and other Chinese places will use traditional.
So there is no “bone” to pick w/ unicode in the context of basic communication.
2004-11-02 12:34 am
Anonymous
I am also Korean, but you seems to not to know there are three variations in CJK glyphs. Did you even tried to study Japanese or Chinese????? Or, do you even care about those glyphs?
There’s Traditional Chinese (called Jung-Ja in Korea) which are used by Korea and Taiwan. And there goes Simplified Chinese (Gan-Ja) used by China. And there is “Kanji” (Yak-Ja) which is in-between simplication of Trandional and Simplified and used in Japan.
Also, think about all the combination possibilities of Hangul….it’s roughly as much as Traditional Chinese.
2004-11-02 12:36 am
Anonymous
YOU ARE WRONG. JAPANESE DOES NOT USE TRADITIONAL CHINESE. They use a simplified variation of Tranditional Chinese (not as oversimplified as Simplified Chinese used by mainland China) called Kanji in Japanese and Yak-Ja in Korean.
2004-11-02 1:25 am
Anonymous
Unicode didn’t unify traditional characters(jungja) and simplified characters(ganja). Let’s get the fact correct. Most traditional/simplified characters have separate codepoints. Check this yourself with Unihan DB.
http://www.unicode.org/charts/unihan.html
“Bone” issue is about variant characters(icheja), which is completely different. I think they are unified for good(TM). If you don’t agree, here’s an entertaining reading:
Unicode Technical Committee N2326 (2001-04-01, i.e., April fool)
N2326: Proposal to encode additional grass radicals in the UCS
http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2326.pdf
112 codepoints for grass radicals. Good luck you finish encoding at all.
2004-11-02 2:36 am
Anonymous
You are incorrect that Singapore uses traditional characters. A policy decision was taken a very considerable time ago to switch from that to the simplified characters. The latter has been in use and taught in schools for a considerable time.
There is also a romanised phonetic form referred to as hanyu pinyin that is used for Chinese personal names in a romanised form and for describing pronunciations but not as a language as such.
2004-11-02 3:13 am
Anonymous
The real problem might be in the education. People in Japan learned to write certain chars in certain way (Pilhoek), and Korean/Taiwanese learned the other way. When the difference collides, there might be some confusion, esp. to those Han-glyph-centric countries like Japan and Taiwan, etc… And those older generation folks (and some younger ones) will care about it.
But you know, modern Korean pple doesn’t use Hanja anymore, and they don’t give any darned shit whether it’s Yakja or Jungja….to them, it is no big deal…(sigh)
Personally, it’s sad to see Hanja forgotten from Korea…
2004-11-02 5:21 am
Anonymous
“This logic doesn’t sit well with some. It’s taken as an
insult that the “bone” character in one language could
be equivalent to the “bone” character in another language.
Oh well. Your nationalistic issues are your problem; the
encoding is perfectly fine.”
No it’s not that simple. The Chinese and the Japanese have simplified the “Kanji” (chinese characters) several times in their history independently which has led to things like having the originally same character looking considerably different in the two languages. Sometimes you wouldn’t even recognize the character again if you didn’t know. For many characters people know the “traditional” and revised form, but ever since the radical simplification for the Chinese characters it gets ever harder to recognize what those characters stand for if you are used to the Japanese way of writing Kanji.
This is more like, say, the Germans would still use Fraktur (check this URL if you don’t know Fraktur http://en.wikipedia.org/wiki/Fraktur ) and the English wouldn’t, and then someone insists, that the two systems essentially are the same because they both are using the roman alphabet, so you can easily intermix them. The problem is if you intermix the two some people not used to Fraktur wouldn’t be able to read a single word. And in most cases it would look so out of place that the whole text gets very hard to read.
2004-11-02 5:56 am
Anonymous
It is interesting that the link you point to explicitly points out that Fraktur is a typeface, not a seperate script. People have to get out of the way of thinking that says that a unicode string is something that is more or less a representation of what should be on the screen. There is an enormous amount of processing that must be done before a unicode string can be displayed. Part of that processing is generating glyphs from the string in a way that is presentable. Higher-level behaviors, such as what glyphs get shown for a logical code point, is left to higher-level layers.
Look at it this way: English and French use the same script. The string “donnez-moi le petit pain” is a perfectly valid string in the script used by English. Of course, it makes no sense to those who don’t understand French (and very little to those who do In order to properly understand the text, you need to know what language it’s in. That’s a higher-level concept then the underlying idea of a sequence of characters.
2004-11-02 9:24 am
Anonymous
@verbat:
Thank you for the very good link http://tronweb.super-nova.co.jp/unicoderevisited.html
The article is not biased at all because it backs itself up with hard data — because of so-called surrogate pairs in newer versions of Unicode, Unicode is unnecessarily space and computationally inefficient.
[quote]
“Accordingly, the new and improved Unicode has essentially become an inefficient 32-bit character encoding system, since 94 percent of the grand total of 1,114,112 character code points (1,048,576) are encoded with 32-bit encodings [7].[/quote]
Indeed, why all this logical unification of letters if almost all characters would need 32-bits to be stored?
For me, it’s insane that so-called computer scientists can come up with such a monstrosity. The TRON way is thus much better because it minimizes space and computation required.
As the article says, the Basic Multilingual Plane (which has logical unification all over) is not enough for e.g. Japanese. The inefficient surrogate pairs technique have to be used which end up using up too much space and digitized Japanese documents would have to use unnecessarily more disk space and bandwidth than Euroamerican ones.
Read article from the above link, it’s worth it.
@Rayiner:
You say “There is an enormous amount of processing that must be done before a unicode string can be displayed”
It’s because of this unification. If the simpler and more space efficient approach of TRON is taken, this processing will not be needed.
2004-11-02 9:54 am
Anonymous
Yep. The bifferemce detveem the chinese “bome” and the japanese “bone” is imsigmificamt. Vhy do those idiots keep on deimg such imconsiberate jerk-offs?
2004-11-02 11:12 am
Anonymous
I have this remnant idea (probably wrong) of unicode unable to handle bon-ji (or bonji) and that it is some kind of indian character (maybe sanskrit), care to shed some light ?
2004-11-02 3:21 pm
Anonymous
I develop for a website that draws mainly people from USA, Western Europe and Japan. It drives me crazy, because they all enter in the guestbook, and it takes me a lot of guessing what happens with the characters. For instance: if I define the page to be UTF-8 (the server sends the page as UTF-8 and I include a meta-tag) does this also control the forms in the page?
And then, how to store it in the database? I use SQL Server 2000, and use nvarchar and ntext, this uses UCS-2 (which is a pain, because Enterprise Manager displays each byte as a Character).
And then step 3, display the entries, again in a page specifying encoding in UTF-8. Well, results differ with different browsers and different language settings of the PC, so I have no idea what Japanese or Polish visitors may see.
Argh!
But here’s another excellent page if you need one:
http://www.cs.tut.fi/~jkorpela/chars.html
2004-11-02 5:53 pm
Anonymous
Talk about the Tower Of Babel!
Azureus creates file/directory names that looked exactly the same as those from BitTornado yet they are not the same. So when I try to switch between the 2 Bit Torrents clients, I get 2 file/directory names that have the exact same names but occupy different listings in my file manager( Total Commander).
I can read file/directory names from Azureus directly in my file manager but I need a CJK viewer like NJStar to read file/directory names from Bit Tornado.
Most programs will not recognise the file names from Azureus and opening these files will either crash the programs or produce error message “file not found”. I have to rename the files with romanised names before they can be used.
Most programs will recognise the file/directory names from Bit Tornado but need NJStar otherwise they will display nonsense.
The weirdest thing is that NJStar will crash Total Commander when I delete file/directory names from Azureus. So to use or not to use NJStar is a headache.
AND VERY few chinese websites use Unicode at all. Most of them use Simplified GB (China/Singapore). Many use Traditional Big5(HK/Taiwan). Fortunately Firefox autodetect the char codings very well except that sometimes when I see nonsense then I have to choose the code manually.
NOW about typing in the CJK code, I’LL NEED a whole bottle of aspirins before I can begin to explain the whole MESS. So I’ll spare you the details.
2004-11-02 6:35 pm
Anonymous
It’s because of this unification. If the simpler and more space efficient approach of TRON is taken, this processing will not be needed.
Not at all. Maybe for Asian languages, but many other scripts will still require large amounts of processing. Arabic and Indic scripts have all sorts of mandatory liguratures, context-sensitive substitutions, and context-sensitive repositioning. These things need to be done regardless of what the character encoding looks like. By comparison, the substitution of language-specfic glyphs is an easy process.
That’s precisely why the uniciode standard seperates logical codepoints from displayable glyphs. If a character is the same underlying thing in two different languages, even if they look different, then they are the same character, and should be represented by the same codepoint.
2004-11-02 9:27 pm
Anonymous
Thanks, I certainly haven’t thought of context-sensitive substitutions in some languages :/. I’m a European.
2004-11-05 3:48 am
Anonymous
“That’s precisely why the uniciode standard seperates logical codepoints from displayable glyphs. If a character is the same underlying thing in two different languages, even if they look different, then they are the same character, and should be represented by the same codepoint.”
I don’t understand your point. What if “d” was looking like a “b” in spanish (but was the same). So you may receive :
“The bifferemce detveem the chinese “bome” and the japanese “bone” is imsigmificamt.”
Maybe, just maybe, they do complain for something after all.