The Unicode Consortium has launched a very controversial project known as Han Unification: an attempt to create a limited set of characters that will be shared by these so-called “CJK languages.” Instead of recognizing these languages as having their own writing systems that share some common ancestry, the Han unification process views them as mere variations on some “true” form.
To help English readers understand the absurdity of this premise, consider that the Latin alphabet (used by English) and the Cyrillic alphabet (used by Russian) are both derived from Greek. No native English speaker would ever think to try “Greco Unification” and consolidate the English, Russian, German, Swedish, Greek, and other European languages’ alphabets into a single alphabet. Even though many of the letters look similar to Latin characters used in English, nobody would try to use them interchangeably.
Pretty damning explanation of how some of the most popular languages in the world are treated as second class citizens by the Unicode Consortium. Not coincidentally, this consortium is pretty much entirely run by American and European men and (a few) women.
The problem against Han unification is that Chinese has over 50,000 characters. Japanese and Korean would need to be pushed aside as second class citizens if we prioritize based on how widespread a language is.
And the Japanese and Koreans would never stand for that.
If you count every character that was ever used in the history of China, then yes. However, only a few thousand characters are actually being used today.
How so?
1. The Dai Kan-Wa Jiten contains about 50,000 characters, so at least Japanese should be covered.
2. Most of the kanji are the same. Obviously they would add the ones that are present in one language, but not the others.
Also, they cannot base it on modern Chinese, because it has mapped a set of traditional characters into a smaller set of modern ones, so not even the Chinese will be satisfied.
Actually, I am not really sure what is wrong with this idea. Unless they use this to include on one version of a character. Then it is real bad. Is that the case?
No, the total number of characters ever used is around 90,000. Even if only a few thousand characters are used regularly, some characters have niche uses.
If you read the article, the author explained the need to encode really rare characters.
QR codes have a mode ONLY for japanese characters. About the size of Emoji.
But the point is well taken – in his case, Bengali, they would have to add a few new characters, and that should be done before things like Emoji.
Maybe if we still wrote in APL…
Many Bangladeshis that I have encountered use Sylheti and it is, to western eyes, peculiar in that letters are overprinted on each other to form syllables or whole words. The Chinese languages are a different challenge.
The challenges that this guy would have us face are as large as a switch from pixel based imagery to vector based. In a Sinocentric world unicode may be replaced and the West may spend a while catching up.
It’s kind of dumb, IMO, that we’re pretty much all connected now, but yet we have all these different languages. I wish the world could standardize on one language going forward. It wouldn’t have to be English, and in fact probably shouldn’t be. I’d learn whatever (new) language everyone decided on.
On the other hand, I suspect translation tools will get so good one day, that it might not even matter.
I think it would probably have to be English. It’s the only language that has permeated the globe, there are thousands or even millions of English speakers is almost every country across the globe. Mandarin and Hindi probably have more speakers by sheer numbers, but those speakers are concentrated in their respective countries and their languages have tended not to extend much beyond their respective borders.
Even in India, English is dominant… in a country with many different languages, it’s the only one that’s common across the whole.
We looked to sell software there a few years ago, and there was no desire to do a local translation – prospective customers felt it was better to use English than to spend time translating the system into Hindi, Telugu, and others.
As a wonderful Hindi woman I once worked with explained to me “English is the language of business, so if you want to do international business? You learn it”.
So English would probably be the best choice, it has the most software support, probably the most speakers worldwide, and from what I have seen its a lot easier to put on a keyboard than most others.
Lots of less educated people don’t speak English. Of course the target for your software would be the more educated people so English-only probably is acceptable if supporting multiple languages would have high costs associated with it.
Except given China’s growth in all measures, learning Chinese will be required for doing business with Chinese companies.
All together now:
“Take my love, take my land. Take me where I cannot stand. I don’t care, I’m still free. You can’t take the sky from me.”
I really, really, hope not. English phoneme and syllable pronunciation is based on tradition, there is no general rules on that. It is quite primitive.
That creates the quite strange case of know how to write a word but not how to speak it.
Also, it has no accentuation signs and can not cope graphically with intonations, again resorting to tradition to resolve that.
Anyone that visited different parts of USA or Great Britain know that, it causes a bit of confusion.
It is good as a second language, being grammatically simple and having a compact symbology, but that’s it.
Esperanto the commies will say!! xD (read with Bioshock Infinite voice)
I’m joking, american english is a “de facto standard” and It’s ok, It’s a pretty good tool for the job. We don’t need to reinvent the wheel.
Sadly languages are very related to politics and nationalism and all that stupid shit… so It will be very difficult to standardize on something (even more difficult if the USA leadership continues to decline).
Hi,
American English is only a “de facto standard” in a small part of the world where people can’t even capitalise proper nouns. The English used by all other countries is English. 🙂
– Brendan
Well.. yeah… nevermind.
When the wise man points to the moon, the fool looks at the finger.
When a wise man points at a Moon, a wiser man looks away.
I hope not, English language has pretty much the most ridiculous orthography you can find anywhere.
Edited 2015-03-18 07:46 UTC
Oh I don’t know. There’s always French…
I vote on Asimov’s Galactic Standard
Let me venture a wild wild guess, you are a native Engish speaker?
Hey wait, I had an even better idea. Let’s standardise on one computer language. It doesn’t have to be Pascal.
If we did that, I’d vote for a constructed trade language to keep irregularities and language gotchas to as much of a minimum as possible. Any native language chosen would insight annoyance by those who don’t speak it and have to learn it, and would contain any number of irregularities and pitfalls that would further increase frustration. Constructed languages can be learned faster, and wouldn’t cause much complaints of favoritism to a certain group of people or country.
Of course, we can make the same argument for technology. We have all these interconnected computers, and not one modern universal filesystem to share devices amongst them all (no, fat32 does not count).
Look up Esperanto: http://eo.wikipedia.org/wiki/Esperanto
You can’t read it? Well, that is because that language went nowhere and basically proved that you cannot invent a new language. (the only exception that I know of is Korean script, but no spoken language)
Then again, Singapore basically proved that forcing people to switch to English is great for your economy.
It’s a pretty low statement to add that “the consortium has very few women”. It’s low, because sex has nothing to do with character coding; it’s just a show of rampant sexism in reverse.
Regardless, if you bothered to check, there are also members like the government of India or of Bangladesh represented there. And anyone (really, anyone) can join.
So Thom, just because you see more names that you don’t like it doesn’t mean that it’s sexism or racism or some sort of western superiority at work.
I agree, but please don’t … he will punish us with another gamergate article.
Can’t figure out why you were modded down for that question. I don’t see what gender has to do with this issue either. Whether Unicode is run by males, females, or little green men doesn’t change the issue nor the difficulties it will likely create.
I took that as women having something innate against pictograms, but didn’t dare to write it as it seemed politically incorrect.
Edited 2015-03-18 18:39 UTC
Clicks, clicks, clicks…
Add a reference to gender equality in your article, and it automatically gets 23.56454% more clicks[1]
Gender equality is for the 2010s what “green” was for the 2000s.
[1]Number may not be entirely correct. 30% of scientists say 50% of all statistics are fraudulent.
Edited 2015-03-18 19:57 UTC
I’m not outraged by the idea. If it was reasonably readable, then the possibility of font substitution would allow me to read documents in my language on a device that didn’t have fonts installed for my language.
I learned all these languages to some extent (not that I really speak them) and also some Chinese and Japanese.
The first group of languages shares actually quite a few characters and I do not see a reason why e.g. ‘x’ or ‘o’ would need a separate code for each of these languages.
The same goes for Japanese and Chinese (have no clue about Korean). The tree symbol(http://en.wiktionary.org/wiki/%E6%9C%A8) could be one code slot for all scripts.
Yes, if you use hairdryer or toaster as your device then you may not get all the characters …
Yes, fair enough. I guess I was half thinking of Latin and related characters, and half thinking of Han related characters, and it doesn’t really fit together.
I do actually run into an issue with macrons sometimes, with fonts not supporting them. It’s not really the same, because there is a font to fall back on, but I wonder if it could have been less of an issue if Unicode had avoided LETTER WITH ACCENT code points in favour of just the COMBINING ACCENT ones. (Although I also realise that Unicode was aiming for compatibility with existing encodings here.)
Well … where i live (Belgium), English is very widely used in communications with people from foreign countries, which means it is used a LOT. Drive 1 hr in any direction, and you will be in a foreign country.
My native language is Dutch, but most of us also speak French.
We are used to speaking English, and i’m sure Thom will relate to this.
Having to speak English to communicate with foreign people is no big deal, and it works well. However, changing the codepage, and taking away the very characters used to express yourself in your native language is another thing entirely. IMHO, it is just not acceptable.
The arrogance of the original article is pretty obvious to me. It is whinning in length about the necessity of using CGJ for Bengali input, which is pretty much dictated by the complexity of script itself. The criticism of Han Unification project is yet less sound. The author compares CJK languages to Latin and Cyrillic, which don’t actually share symbols at all. Instead it would make sense to compare CJK languages to languages using Latin alphabet (eg. Polish and Dutch) so that the comparison becomes fair: on either side of it there would be languages using common subset of letters derived from a single source alongside their own unique letters developed on top of that single source.
The only seed of reason in that article is the difficulty of inputting complex scripts into traditional PCs. Again, main problem here is not the limited amount of keys on a common PC keyboard, but a limited support for input methods (IMEs) and complexity of the problem at hand. This problem may be solved via thoughtful engineering (eg. aforementioned Han Unification project of Unicode Consortium), education of western software engineers, dedicated effort of native speaker groups, but absolutely not by nationalist whinning and misrepresenting technical problem as political one.
Ultimately, if Unicode is that bad, why nobody is pushing a sane alternative project? The very fact that Chineese, Japaneese, Korean and Indian phone manufecturers don’t push their own alternative Unicode already shows that this whinning is utter nationalistic nonsense.
The stupidity of this logical fallacy makes you a good candidate for a 3-year free membership in the Unicode consortium.
What exactly is a logical fallacy here?
The author compares CJK languages to Latin and Cyrillic, which don’t actually share symbols at all.
What do you mean by that? “AaCcEeIi*OoPpXx” are exactly the same in both, as well as “BKMHT” (upper case) and “y” (lower case).
*Cyrillic is not the same as modern Russian.
Actually, here is reversed relation – these glyphs came to mimic latin letters, but they didn’t use to. This is pretty much the opposite to CJK languages’ situation.
these glyphs came to mimic latin letters, but they didn’t use to
Under Peter the Great, they were indeed adjusted to be more like Latin ones, but they hadn’t been looking that much different before.
Anyway, this doesn’t change the fact that Latin and Cyrillic share quite a few letters.
Depends on definition of “much”. They were different enough to inhibit reuse of types in early slavic typesetting.
You get this impression because you cherry-pick glyphs. The only Cyrillic letters that truely share glyphs with Latin are “І”, “Ј”, “О”, “Ð “, “С” and “Ð¥” (6 letters). Distinct look of “Д and “Е” was mostly weaseled out by modern typography, though some Slavic languages retained it until early XX century, and one may still find distinct slavic look of these letters in common print in some regions, eg. on Balkans. Other letters were never truely the same between Cyrillic and Latin script. (Mind you, italics and handwriting also matters here.)
And now compare this to CJK languages to actually see how utterly nonsensial the whole comparison is.
Edited 2015-03-19 07:24 UTC
I agree. It does come off as rather whiny. He never explains the problem with spelling his own name. We just have to take his word for it. The other issue he describes has been solved: ৎ has been given it’s own code point. It’s really just a word final form of the letter ত “ta” with the vowel suppressed. It’s not that it was previously missing, it’s just that this specific form was a bit more cumbersome to write.
A short explanation might be in order. (Disclaimer: I’m not an expert. Please correct me if I’m wrong.) The Bengali writing system is an abugida which means that each consonant has an inherent vowel which can be modified using different diacritics. Some abugidas simply add a diacritic that suppresses the vowel when they need to form a consonant cluster. Others like Bengali combine elements from the consonants into complex ligatures.
From what I can gather this is what the name Aditya is supposed to look like: আদিতà§à¦¯. The problem seems to be with the last syllable “tya”. To produce it one has to write ত “ta” + ৠ(suppress vowel) + য “ya”. In this case the letters don’t really combine into one. Rather the য “ya” changes shape, making it look like a separate letter. I don’t know if the point he’s trying to make is that the combining “-ya” should have it’s own code point.
For completeness’ sake I’ll answer my own post.
As far as I understand the alternate form â€à§à¦¯ of the letter য has more uses than just as a member of a consonant conjunct. It can also be combined with a vowel to form some special or foreign vowel sounds. (I’d compare it to the German umlaut marks: ä, ö, ü.) One still has to use the vowel suppressing mark to produce the â€à§à¦¯ which makes it really counter-intuitive.
As a person who speaks Korean, I’m all for the Han unification as long as it completely unifies all the Han derivations into one.
These will definitely offend a lot of Japanese. But for Koreans, maybe not that much.
That’s because you don’t use the characters.
To you, even if you have Hanja-compatible names, it’s like getting a Chinese tatoo. You also have only a handful of repeated surnames based on some Chinese warlord you don’t care about.
Japanese and Traditional Chinese characters and shapes have been evolving for millennia. A variant character, even within the languages, is meaningful and has sentimental value. Many variations have existed since the very beginning and people throughout the centuries have chosen one or another to name themselves and the places they inhabited. Common people didn’t have surnames originally but when they chose one, they took all that luggage with them.
Would you spit on your ancestors name?
Would you enjoy seeing your name spelled as Rapist when using a slightly different font because the character for Dry wasn’t considered unique enough by the Unicode committee?
The only thing that can be salvaged from Unicode is UTF-8 and only the original form before it got dried by UTF-16 groupies.
Dude, that sounds very racist. And using the word, rapist, in an IT-related forum is very rude. I’m not even a Korean. (I’m a Canadian.) But I also speak French.
But seriously. What’s with Japanese people’s hate on Korea?
Anyways, I do care for China and its leadership in technological standardization for the whole East Asia. China’s leadership will obviously benefit both Japan and Korea in the long run.
P.S. Actually, Koreans do use Chinese Characters, in academia and specialized contexts. I know how my Japanese clients say that Koreans are “out of touch” in Chinese Characters, but this is not true.
Edited 2015-03-18 17:04 UTC
But seriously, we are talking about CJK languages here, which means that Koreans do use Chinese Characters (but just as not much as the Japanese).
We had several different ways of writting Latin letters. All of those have been forcebly unified with the difference in style delegated to fonts. What how is that different? Except it happened over a 100 years ago?
Not quite what he said, re-read it. He said we wouldn’t consider unifying Latin/Cyrillic/Greek into one, despite that all of these have descended from the Greek alphabet.
Have a look at the insular letters in the Latin Extended-D block. These were apparently deemed significant and different enough to justify their own code points.
http://en.wikipedia.org/wiki/Insular_script#Unicode
http://en.wikipedia.org/wiki/Latin_Extended-D
I’m not sure this is quite as ridiculous as it first appears. I think at least some of these forms were used only in certain circumstances (the start of words?) and the more familiar forms were used otherwise. This being the case, if these letters were encoded purely as the same code point, then you’d either need a supplementary Insular font for every font you wanted to use, and switch between the regular font and the Insular one within words, or you’d need a renderer smart enough to pick the desired appearance from the context, and you wouldn’t be able to refer to these forms out of context (e.g. when discussing their use). Both of these seem problematic. I think it might have been better to use the same code point plus a modifier though (and same for many of the other characters in that block).
“No native English speaker would ever think to try “Greco Unification” and consolidate the English, Russian, German, Swedish, Greek, and other European languages’ alphabets into a single alphabet.”
funny that that’s exactly what unicode does for most european languages
just take a look at this site:
http://unicode-table.com/en/
on the right side you can even see which part applies to what country.
As regards the news itself, Unicode is broken for doing Logographic/Syllabary alphabets (“Asian moonspeak” in common parlance).
Last time I checked, Chinese, Japanese and Korean keyboards did not have 50.000 keys. See, the characters are made of “strokes” which are much fewer in number (less than 300 or so).
So, it would have been possible to store the strokes and some control characters, and then “construct” the Asian moonspeak charactrers during display. But, backward compatibility called, and it wants every single character to map to a single code point (number).
Imagine if Unicode tried to store every syllabe possible from the Greek, Russian, English etc languages and added new code points every time a new syllabe was added to the language. Yes, that’s what Unicode does with Asian moonspeak.
PS: Okay, the above is not 100% correct, but good for perspective.