The first time I learned about UTF-8 encoding, I was fascinated by how well-thought and brilliantly it was designed to represent millions of characters from different languages and scripts, and still be backward compatible with ASCII.
[…]Designing a system that scales to millions of characters and still be compatible with the old systems that use just 128 characters is a brilliant design.
↫ Vishnu Haridas
On a slightly related note, if you are ever bothered or annoyed by text online rendering as unknown squares, you most likely are just missing the proper fonts to render them. At least on most Linux and BSD systems, all you need to do is install the entire set of Noto fonts, including those for every single non-Latin script. Assuming your package manager has sane naming conventions, it’ll most likely come down to something like sudo dnf install google-noto* or whatever your system’s install package command is, and after installing a whole slew of font files, your system will now be able to render virtually every script under the sun.
After installing this massive font set, you can do things like write and render in hieroglyphics, write Ea-nāṣir‘s name the way it’s supposed to, and render all kinds of other scripts and symbols without ever having to look at one of those blank squares ever again.

That’s helpful, since I’ve got this clay tablet of a one-star review but whoever wrote it just put a bunch of squares
It’s a happy coincidence that ASCII was originally designed to work on systems with 7 bit chars, a bit short of an eight bit byte.
I wasn’t really around at the time but I think the reason ASCII targeted 7 bits was to be compatible with data networks that only offered 7 data bits + 1 parity bit.
https://shubmehetre.com/posts/why-ascii-uses-7-bits/
It’s the same reason network encoding protocols like uuencode/base64 (think email) only use 7 bits.
It provides an rather obvious solution for extending the character set using that unused bit. It’s quite rare for things like this to happen but it’s nice that it did because otherwise it would have been necessary to break compatibility.
Alfman,
I would even go and say: Internet was a product of happy accidents.
If we were to “design” it, it would be a clunky, much less useful system. And more locked down than the Chinese version of TikTok.
sukru,
I’d say the internet inherited a lot of issues because it was designed for much simpler networks. The most obvious being 32bit address space, which still holds many of us back even decades after IPv6. It’s not the only thing that hasn’t aged well: small MTUs needlessly creates inefficiencies and places a huge strain on backbone routers. Obviously most payloads are far larger than 1.5KB.
Engineers could clearly do better today given how much we’ve learned. However I agree, back when it was designed, the internet wasn’t built by corporations with malicious intent. Today that’s not really a given, corporations have become obnoxiously proficient at manipulating hardware & software standards for their own agenda.
I always look at email and am turned off by how poorly it’s aged. There’s a lot of legacy bloat and hacks, and it’s a nuisance because of it. I think most admins would agree that a redesign would be very helpful, except that if the modern tech giants had their way, email would be replaced with closed non-federated networks that they control. It’s unfortunate but this happens a lot when older more open & federated technologies get replaced.
Even though engineers can make better standards, the reality is it’s up to mega corporations to use the standards and make them popular. Otherwise a standard doesn’t really work without a substantial user base. I’m often critical of Visa and Mastercard, which are another example of corporations being responsible for some of the most insecure standards in e-commerce. It would be easy even for individuals like me to create better crypto-graphically secure standards, but the bottleneck isn’t with engineering so much as adoption. Unless you are giant yourself with a large userbase already, like apple convincing stores to take apple pay, then developing a superior standard doesn’t guarantee relevancy.
Email has aged beautifully and gracefully compared to Usenet. I didn’t realize what a shitpile of protocols that was until I tried standing up a moderated group.
runciblebatleth,
Haha, yeah there are a lot of standards in a similar boat. By agreeing to replace them, we risk our flawed but open standards being replaced by a company pushing something more closed and proprietary 🙁
Email is actually awesome.
At least SMPT (and POP / IMAP)
So many times I used telnet to debug my setup. Even today, it is possible to connect to GMail servers directly to chat with them (and possibly send email)
The same old commands like FROM / RCPT TO have survived decades:
https://en.wikipedia.org/wiki/Simple_Mail_Transfer_Protocol#SMTP_transport_example
(And yes, 7-bit ASCII helps a lot here. We can dump attachments in uuencode, but I have never tried that personally)
sukru,
Most providers moved to encryption (which I am in favor of) You can pipe through stunnel/openssl channels, but it’s tedious and other tools provide better diagnostics and get the job done faster. Despite being antiquated a bit jarring, I wouldn’t say email admin is a daily grind, honestly most days and even years are uneventful once things are working. But it’s really the exceptions that have driven me to hate email because more often than not the fault is somewhere else and you’re at the whims of bigger fish giving a damn about your problem.
UTF-8 is NOT a brilliant design. Yes its a very good design, but I think “brilliant” is overstating it.
Early version of Unicode was every character represented as 16bit value – UCS2. I believe early versions of WindowsNT were designed to support UCS2; variable length encoding wasn’t in the early plans.
Then they realised that characters will be represented by base code point + combining character code point, e.g. Unicode will never be fixed width characters.
16bit Unicode should never have been invented because you have to worry about byte ordering.
They should have realised from day 1 that a single character can be represented by a different number of bytes, and they should have only defined UTF-8 in the standard.
UTF16 surrogate pairs are an abomination, a really nasty design to support 16bit characters that never should have existed.
They also went overboard deciding there needs to be 1million+ possible code points (similar to how IPv6 grossly over estimated the number of required IP addresses). Unicode latest version 17.0 has 159801 code points. The fact is they should have limited Unicode to 18bits = maximum 262144 code points. I could write a much improved Unicode 2.0 but of course it will never be supported 🙂
tom9876543,
To be fair, byte ordering matters with UTF8 as well, it’s just more obvious that you’re using network byte ordering with UTF8 reading one byte at a time. There’s no reason programmers can’t use the exact same method to read UTF16 one byte at a time and it would work just fine there too. However it seems silly to read UTF16 one byte at a time instead of two.. The issue of course is that x86 notoriously went their own way with byte ordering.
You could get away with less, but it’s about future proofing and not risk making the same mistake again. In computer engineering we have a history of not leaving ourselves enough space and paying compatibility costs later. Like hard disks that outgrew BIOS & DOS’s ability to address them. Or like the Y2K problem, the upcoming year 2038 problem, etc.
https://en.wikipedia.org/wiki/Year_2038_problem
18bit code points seems like a lot but keep in mind these aren’t perfectly packed so consumption can go up faster than you might think. IMHO it’s beneficial to give ourselves more room. That said, while UTF8’s killer feature was compatibility with ASCII, UTF8 isn’t really optimal for storage. For better or worse the encoding is biased towards Latin characters. This could have been solved with an alphabet selector or even a simple compression scheme that stores characters in even fewer than 8 bits.
Alfman sorry but “byte ordering matters with UTF8” is wrong. “There’s no reason programmers can’t use the exact same method to read UTF16 one byte at a time and it would work just fine there too. ” – that is incorrect. Programmer has determine if its big endian or little endian. Unicode has special character for it.
Endianness is only an issue for multi-byte data. Google search AI result literally says “Endianness does not apply to single-byte data.”
“18bit seems like a lot but consumption can go up faster than you might think.” – wrong again. The latest Unicode standard from this year has 160,000 code points as per Wikipedia. In 1990, Unicode consortium should have been able to list all of the different languages and how many many characters they have and came up with an approximate number of required code points (with extra buffer room added as well). Seems they just decided 1 million would be a good number with no scientific reasoning.
tom9876543,
No it’s not. UTF8 uses network byte ordering. If you wanted to you could write an algorithm to process UTF8 characters faster by opportunistically reading characters 4 bytes at a time (ignoring the unneeded bits), you’d clearly have to account for byte ordering. It’s natural to think about algorithms that process one byte at a time, but that doesn’t change the fact that byte ordering is significant.
Ah, you’re right. The RFC recommends BE be assumed, but I guess in practice this doesn’t happen.
https://en.wikipedia.org/wiki/UTF-16
O/T but “Google search AI result literally says…” made me laugh.
I guess this may be the future of citing sources for arguments though :-/
That statement isn’t wrong, but UTF8 is obviously not single byte data. The fact that multibyte sequences can be broken down into single byte reads, doesn’t change the fact that endian matters. I recently implemented a Canbus parser that reads one byte at a time. I couldn’t just ignore byte ordering though. In canbus the endian issues are even more complicated thanks to the fact that fields can start at arbitrary bit locations.
Sorry but I checked before making the statement. While theoretically we could fill every last code point, in practice some sections have been sparse probably to help keep things more organized.
It’s easy to define a standard looking backwards, but when looking forwards it makes sense to leave some more headroom. Obviously you’re free to disagree, but I don’t believe 4 bytes was an unreasonable upper bound.
It applies to any bit grouping that has individual addressing. So it also applies to single-byte data, but nobody does bit-level addressing. A single byte is little-endian.
No, a single byte is big endian, by essence, just like natural numbers where the greater digits/weights are on the left.
We “enumerate” bits starting from 7 (left) down to 0 (right), or 15 (left) down to 0 (right), … Even hexadecimal do so “naturally”.
Yet transmission over a serial line can be either starting from bit 7 or bit 0 (right shift or left shift). So better be sure that both your ends are configured correctly.
Kochise,
I agree, big endian does seem more “natural” to me too. It would be so much easier if everything did that consistently, and I think internet pioneers did try to make it happen. However communication protocols can implement it either way, fitting our expectations isn’t technically necessary. If not for x86 going against the grain, big endien would probably be dominant everywhere and we wouldn’t even have to think about it.
Yes, you are right, however the bit ordering for most wireline protocols doesn’t typically need to be configured or negotiated because it’s part of the standard. I don’t think I’ve ever seen a wire protocol where this was even configurable in software. It’s standardized and the hardware engineer’s responsibility to get it right.
https://en.wikipedia.org/wiki/Serial_port
Because the byte is almost universally the smallest denominator (the hardware doesn’t let us read fractions of a byte), software devs are not usually affected by bit ordering within a byte regardless of if the host is big endien or little endien. But the order of the bytes being written into the wire has to be defined by the higher application level protocol even if bit ordering is taken care of by the hardware.
Simple fact is, Unicode Consortium should have been able to calculate in the late 1980s that there are X number of written languages, each language has Y characters, and so the total number of characters across all languages is about 150,000.
They should have decided to limit Unicode to 18 bits = 262144 code points.
Then a “brilliant” theoretical design of UTF-8 would only need a maximum of 3 bytes per code point.
4 byte encoding of a single code point is inefficient and IMHO this is not a “brilliant” design. Another example of the less than ideal implementation is the BOM. BOM wouldn’t be required if they never wasted time developing 2 byte encodings.
tom9876543,
This makes a standard suitable for the past, but unicode’s responsibility was forward looking.
You’re focusing an awful lot on the word “brilliant”. Sure it is overstated, but then so too are your criticisms. Most of those code points you refer to don’t even need the fourth byte. Allowing for one in the spec as an option gives the future more headroom without breaking any existing software that’s already UTF8 compliant. Those who’ve learned lessons from Y2K will have more natural appreciation for why forward looking standards are important: It’s cheap to introduce a forward looking standard from the start, but to go back and fix software and migrate data decades later costs trillions. Or else we risk text file incompatibilities that are anathema to Unicode’s mission.
I would have done things differently too, but at least they learned from those mistakes when they did UTF8.
Alfman,
And none of these matter with efficient coding algorithms. like the Huffman codes (basically a very simple greedy algorithm can generate optimal code trees).
The final on disk, or on wire representation will not need to waste any bits to unused information. And we can still keep UTF-8 as the interchange format.
(Many compression algorithms like ZIP (along with Lempel-Ziv) already use Huffman codes)