The first time I learned about UTF-8 encoding, I was fascinated by how well-thought and brilliantly it was designed to represent millions of characters from different languages and scripts, and still be backward compatible with ASCII.[…]
Designing a system that scales to millions of characters and still be compatible with the old systems that use just 128 characters is a brilliant design.↫ Vishnu Haridas
On a slightly related note, if you are ever bothered or annoyed by text online rendering as unknown squares, you most likely are just missing the proper fonts to render them. At least on most Linux and BSD systems, all you need to do is install the entire set of Noto fonts, including those for every single non-Latin script. Assuming your package manager has sane naming conventions, it’ll most likely come down to something like
sudo dnf install google-noto* or whatever your system’s install package command is, and after installing a whole slew of font files, your system will now be able to render virtually every script under the sun.
After installing this massive font set, you can do things like write and render in hieroglyphics, write Ea-nāṣir‘s name the way it’s supposed to, and render all kinds of other scripts and symbols without ever having to look at one of those blank squares ever again.
That’s helpful, since I’ve got this clay tablet of a one-star review but whoever wrote it just put a bunch of squares
It’s a happy coincidence that ASCII was originally designed to work on systems with 7 bit chars, a bit short of an eight bit byte.
I wasn’t really around at the time but I think the reason ASCII targeted 7 bits was to be compatible with data networks that only offered 7 data bits + 1 parity bit.
https://shubmehetre.com/posts/why-ascii-uses-7-bits/
It’s the same reason network encoding protocols like uuencode/base64 (think email) only use 7 bits.
It provides an rather obvious solution for extending the character set using that unused bit. It’s quite rare for things like this to happen but it’s nice that it did because otherwise it would have been necessary to break compatibility.
Alfman,
I would even go and say: Internet was a product of happy accidents.
If we were to “design” it, it would be a clunky, much less useful system. And more locked down than the Chinese version of TikTok.
sukru,
I’d say the internet inherited a lot of issues because it was designed for much simpler networks. The most obvious being 32bit address space, which still holds many of us back even decades after IPv6. It’s not the only thing that hasn’t aged well: small MTUs needlessly creates inefficiencies and places a huge strain on backbone routers. Obviously most payloads are far larger than 1.5KB.
Engineers could clearly do better today given how much we’ve learned. However I agree, back when it was designed, the internet wasn’t built by corporations with malicious intent. Today that’s not really a given, corporations have become obnoxiously proficient at manipulating hardware & software standards for their own agenda.
I always look at email and am turned off by how poorly it’s aged. There’s a lot of legacy bloat and hacks, and it’s a nuisance because of it. I think most admins would agree that a redesign would be very helpful, except that if the modern tech giants had their way, email would be replaced with closed non-federated networks that they control. It’s unfortunate but this happens a lot when older more open & federated technologies get replaced.
Even though engineers can make better standards, the reality is it’s up to mega corporations to use the standards and make them popular. Otherwise a standard doesn’t really work without a substantial user base. I’m often critical of Visa and Mastercard, which are another example of corporations being responsible for some of the most insecure standards in e-commerce. It would be easy even for individuals like me to create better crypto-graphically secure standards, but the bottleneck isn’t with engineering so much as adoption. Unless you are giant yourself with a large userbase already, like apple convincing stores to take apple pay, then developing a superior standard doesn’t guarantee relevancy.
Email has aged beautifully and gracefully compared to Usenet. I didn’t realize what a shitpile of protocols that was until I tried standing up a moderated group.
UTF-8 is NOT a brilliant design. Yes its a very good design, but I think “brilliant” is overstating it.
Early version of Unicode was every character represented as 16bit value – UCS2. I believe early versions of WindowsNT were designed to support UCS2; variable length encoding wasn’t in the early plans.
Then they realised that characters will be represented by base code point + combining character code point, e.g. Unicode will never be fixed width characters.
16bit Unicode should never have been invented because you have to worry about byte ordering.
They should have realised from day 1 that a single character can be represented by a different number of bytes, and they should have only defined UTF-8 in the standard.
UTF16 surrogate pairs are an abomination, a really nasty design to support 16bit characters that never should have existed.
They also went overboard deciding there needs to be 1million+ possible code points (similar to how IPv6 grossly over estimated the number of required IP addresses). Unicode latest version 17.0 has 159801 code points. The fact is they should have limited Unicode to 18bits = maximum 262144 code points. I could write a much improved Unicode 2.0 but of course it will never be supported 🙂
tom9876543,
To be fair, byte ordering matters with UTF8 as well, it’s just more obvious that you’re using network byte ordering with UTF8 reading one byte at a time. There’s no reason programmers can’t use the exact same method to read UTF16 one byte at a time and it would work just fine there too. However it seems silly to read UTF16 one byte at a time instead of two.. The issue of course is that x86 notoriously went their own way with byte ordering.
You could get away with less, but it’s about future proofing and not risk making the same mistake again. In computer engineering we have a history of not leaving ourselves enough space and paying compatibility costs later. Like hard disks that outgrew BIOS & DOS’s ability to address them. Or like the Y2K problem, the upcoming year 2038 problem, etc.
https://en.wikipedia.org/wiki/Year_2038_problem
18bit code points seems like a lot but keep in mind these aren’t perfectly packed so consumption can go up faster than you might think. IMHO it’s beneficial to give ourselves more room. That said, while UTF8’s killer feature was compatibility with ASCII, UTF8 isn’t really optimal for storage. For better or worse the encoding is biased towards Latin characters. This could have been solved with an alphabet selector or even a simple compression scheme that stores characters in even fewer than 8 bits.
Alfman sorry but “byte ordering matters with UTF8” is wrong. “There’s no reason programmers can’t use the exact same method to read UTF16 one byte at a time and it would work just fine there too. ” – that is incorrect. Programmer has determine if its big endian or little endian. Unicode has special character for it.
Endianness is only an issue for multi-byte data. Google search AI result literally says “Endianness does not apply to single-byte data.”
“18bit seems like a lot but consumption can go up faster than you might think.” – wrong again. The latest Unicode standard from this year has 160,000 code points as per Wikipedia. In 1990, Unicode consortium should have been able to list all of the different languages and how many many characters they have and came up with an approximate number of required code points (with extra buffer room added as well). Seems they just decided 1 million would be a good number with no scientific reasoning.