You can’t just assume UTF-8

Thom Holwerda 2024-04-30 General Development 20 Comments

Humans speak countless different languages. Not only are these languages incompatible, but runtime transpilation is a real pain. Sadly, every standardisation initiative has failed.

At least there is someone to blame for this state-of-affairs: God. It was him, after-all, who cursed humanity to speak different languages, in an early dispute over a controversial property development.

However, mankind can only blame itself for the fact that computers struggle to talk to each other.

And one of the biggest problems is the most simple: computers do not agree on how to write letters in binary.
↫ Cal Paterson

For most users, character encoding issues are not something they have to deal with. Programmers and other people who deal with the lower levels of computing, however, deal with this way more often than they should.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

20 Comments

2024-04-30 5:00 pm

Alfman verbose=1
Interesting article!

I am more in favor of assuming UTF8 than using statistics. I actually do think assuming UTF8 has gotten more reliable over the years. Ten or so years ago, unicode characters would regularly break documents and HTML pages, these days I rarely come across issues with utf8. It’s become the standard format for text nearly everywhere, databases, web, console, text files, etc. Other standards still exist, but they’re becoming more niche (IMHO that’s a good thing).

The article did allude to a major shortcoming of modern computer files. There’s no standard header for metadata. Every single format ends up reinventing the wheel. MP3 files that describe the author, office documents, jpg, png, etc. It’s too late to fix this now. But imagine a world where all files had standard metadata fields and standardized APIs to manipulate them. This would solve so many of our organizational and indexing problems. I can’t exactly blame the creators of unix for not forseeing this need, but realistically had they tackled this early on there’s a chance standard file metadata operations could have ended up being part of the greatest common denominator.

2024-04-30 6:41 pm

kurkosdr
Indeed, when the creators of Unix decided that plain-text files would be a stream of bytes containing just the text and nothing else, they made it impossible for text files to have any kind of metadata (unless you use filesystem-level metadata which tend to be lost when copying to FAT32 partitions or when uploading to services like Dropbox). Which of course means no character encoding metadata either.

As a Greek person, I’ve dealt with this problem numerous times. For example, SRT subtitles are usually 8-bit windows codepage text files, which means that you had to set your DVD player to the correct encoding to decode the text correctly (which wasn’t straightforward and usually required setting your preferred subtitle language to Greek or setting your region to Greece or even applying a firmware patch). I still have to do this on Kodi sometimes. Another issue is the English version of Windows, which defaults to windows-1252 codepage and you have to dig into the Control Panel to change it so your txt files work correctly. But if text files included encoding metadata, it wouldn’t matter what my “preferred” language is to render any txt file correctly.

About the need for a standardised metadata header, would’ve been nice but it was never going to happen. I mean, there are even different 8-bit encodings for the same language even after an effort was made to standardise those things (for example ISO 8859 vs the Windows codepages). Your best bet for standardised metadata is filesystem-level metadata and make sure they aren’t lost (they will be, so don’t bother).

2024-04-30 8:23 pm

Alfman verbose=1
kurkosdr,

About the need for a standardised metadata header, would’ve been nice but it was never going to happen. I mean, there are even different 8-bit encodings for the same language even after an effort was made to standardise those things (for example ISO 8859 vs the Windows codepages). Your best bet for standardised metadata is filesystem-level metadata and make sure they aren’t lost (they will be, so don’t bother).

I agree with you on both counts. It’s one of those things that might have been successfully standardized in the beginning, but if implemented today would end up being a poorly supported niche feature that can’t be relied on to work across platforms and services.

I’m trying to think of a word that succinctly describes this concept:. “could have worked out at the beginning, but it’s too late now” but I’m drawing blanks. Anyone have a good word for this?

2024-05-02 5:33 pm

M.Onty
“Reverse Boarfoxing”

The Wild Boar and the Fox, by Aesop

A WILD BOAR was whetting his tusks against a tree, when a Fox coming by, asked him why he did so; “for,” said he, “I see no reason for it; there is neither hunter nor hound in sight, nor any other danger that I can see, at hand.” “True,” replied the Boar; ” but when danger does arise, I shall have something else to do than to sharpen my weapons.”

2024-05-02 11:34 pm

Alfman verbose=1
M.Onty,

“Reverse Boarfoxing”

I found no hits for this word combination. Creating an original term could be fun but I’m not sure we can get it to stick, haha. Webster’s going to want to know how you use it in a sentence?

“You should have invested in apple years ago, but there’s no use reverse boarfoxing now” 🙂

2024-05-03 4:30 am

r_a_trip
I think the term “too little, too late” fits this.

2024-05-01 1:53 am

LeFantome
“I can’t exactly blame the creators of unix for not forseeing this need”

Would a standard created by UNIX have solved the problem? First of all, UNIX “standards” went through a rather lengthy period where they were anything but. Queue the XKCD.

Even if the UNIX world managed to agree, the size and success of the DOS then Windows world meant that no UNIX standard stood a chance of being a standard. Look at your own findings below about web documents processed through Windows. You can almost assume UTF8 these days ( and ASCII is a subset ). But Java and .NET are still UTF16 and not going anywhere.

The UNIX world did take a run at the problem by assuming they could define a universal “Portable Character Set” and make it part of the definition of UNIX. The Open Group put it into the Single Unix Specification.
https://pubs.opengroup.org/onlinepubs/9699919799/

According to that standard, “the characters in Portable Character Set are defined in the ISO/IEC 10646-1:2000 standard”. But ISO/IEC 10646 has been revised and withdrawn dozens of times. It describes the UCS ( Universal Character Set ). Here is the latest version:
https://www.iso.org/standard/76835.html

Since that document says that it “specifies seven encoding schemes of the UCS: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE”, I don’t think we have much hope.

Maybe our forefathers understood the importance of standards and that is why they left it out. We just were not ready for it yet. Perhaps someday we will mature enough as a species to agree on these kinds of things.

2024-05-01 3:16 am

Alfman verbose=1
tanishaj,

Even if the UNIX world managed to agree, the size and success of the DOS then Windows world meant that no UNIX standard stood a chance of being a standard.

DOS and windows weren’t invented in a vacuum. A lot of the concepts unix invented found their way into DOS and windows like pipes, socket apis, even ASCII itself. IMHO it’s not a stretch to think a metadata API could have been copied by mac/windows/linux/bsd/etc such that all operating systems would support it today. I think metadata was one of microsoft’s plans for alternate data streams in ntfs, but by the time windows invented it it was already too late to normalize it as a standard OS primitive, I do think an earlier metadata standard could have resulted in a different outcome though.

Look at your own findings below about web documents processed through Windows. You can almost assume UTF8 these days ( and ASCII is a subset ). But Java and .NET are still UTF16 and not going anywhere.

ASCII is a subset of UTF8 but not extended ASCII, which I highlighted in my post. Anyway the use of UTF16 inside a language is becoming less relevant because even those languages need to support UTF8 as a matter of course even if they do so by translating to and from a different internal format.

According to that standard, “the characters in Portable Character Set are defined in the ISO/IEC 10646-1:2000 standard”. But ISO/IEC 10646 has been revised and withdrawn dozens of times. It describes the UCS ( Universal Character Set ). Here is the latest version:
Since that document says that it “specifies seven encoding schemes of the UCS: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE”, I don’t think we have much hope.

Those documents speak to text encoding in the POSIX.1-2017 standard and later, but does not speak to the lack of file metadata and APIs that I am talking about. They actually highlight a perfect use case for standardized file metadata.

Maybe our forefathers understood the importance of standards and that is why they left it out. We just were not ready for it yet. Perhaps someday we will mature enough as a species to agree on these kinds of things.

That argument is not a rebuttal for not having metadata standards. It would have been very useful to standardize this half a century ago…but they lacked the foresight (ie our hindsight) to realize it would become a problem.

2024-05-02 10:44 am

Minuous
>concepts unix invented found their way into DOS and windows like pipes, socket apis, even ASCII itself.

ASCII is a UNIX standard? That’s demonstrably nonsense; the first ASCII standard is from 1963 and development of UNIX did not start until many years after that.

2024-05-02 12:56 pm

Alfman verbose=1
Minuous,

ASCII is a UNIX standard? That’s demonstrably nonsense; the first ASCII standard is from 1963 and development of UNIX did not start until many years after that.

The point was that DOS didn’t grow in a vacuum, it incorporated ideas from unix. And metadata would have been another candidate for copying between early operating systems.

2024-05-02 5:13 am

dsmogor
Slapping metadata (especially if it’s unrestricted open ended indication of underlying format) just makes the file unprocessable by most computer software. One hast to put a stop somewhere. It’s like inventing an arbitrary USB signalling and voltage specs and the calling it a day because it’s announced on the socket.

2024-05-02 11:21 am

Alfman verbose=1
dsmogor,

Slapping metadata (especially if it’s unrestricted open ended indication of underlying format) just makes the file unprocessable by most computer software. One hast to put a stop somewhere. It’s like inventing an arbitrary USB signalling and voltage specs and the calling it a day because it’s announced on the socket.

I’d say it’s the lack of information about the character encoding uses is what makes files not consistently processable. Assumptions and having to rely on heuristics is a hack, one that is objectively worse than passing along an explicit encoding property. Software would never be worse off for having that metadata. Even in the case that software doesn’t support a foreign language encoding (ie it is not installed), this type of factual information about encoding is still valuable to report back to the user instead of just taking a wild guess based on statistical analysis of known languages.

For a thought experiment, imagine we’re creating a video container format. Instead of having a metadata standard that describes the codec used (like fourcc), somebody says that information isn’t needed because a heuristic algorithm can guess the codec based on a statistical analysis of the data stream. This algorithm with take a sample of the stream and then guess the likely encoding using heuristics. Even if it could work most of the time, hopefully it’s clear why it’s a terrible solution compared to including the codec information in the standard. Well it’s the same deal with text encoding, it’s the exact same problem. The problem is not that a standard is not warranted for text encoding for the exact same reasons, but rather that we did not have a standard from the beginning and adding one now is too late.

2024-04-30 6:12 pm

cpcf
For actively developed software I don’t think it’s impossible to develop standards incrementally, but it’ll take a long long time and a lot of unrewarded effort. But then again we have a host specially developed OS, and the issues around UTF8 seem far far more important than having another OS to choose from, perhaps the problem isn’t sexy enough.

I suppose someone being cynical will tell us that eventually AI will deliver a Universal Translator that works for everyone and everything, who would you trust to run that, Meta, Google, Microsoft or Apple?

2024-05-01 12:03 pm

Kochise
If there was a standardized library (ICU ?) provided by the people who tries to standardize the way people exchange text and data on computers, maybe it wouldn’t be such an issue.. But UTF-8 is hard to use, because there is no one fit-it-all API to perform this, and so many legacy text files in various code pages are still around, saved in different media format.
2024-05-02 5:14 am

dsmogor
Translators need heaps of data to train. Do some niche languages there’s simply not enough of it in readily available digital form

2024-04-30 8:14 pm

Drizzt321
So, we need to read the entire data in order to run the heuristics to figure out the encodings? Or is there a magic amount of a file that we can read to be pretty sure which encoding it is? And is there proper, good libraries in the major languages that can do it Right(tm) that we can just pull in? Would love it if he pointed to some actual implementations, because I don’t see the average, even if good, programmer having the knowledge to properly implement that. Or the time. And then we get to a whole bunch of separate implementations that have to be maintained (usually poorly).

2024-04-30 8:48 pm

Alfman verbose=1
Drizzt321,

So, we need to read the entire data in order to run the heuristics to figure out the encodings?

I believe most if not all real world implementations perform a statistical evaluation on an arbitrary length of text (ie not the complete file).

And then we get to a whole bunch of separate implementations that have to be maintained (usually poorly).

There’s no standard heuristic for this, but there are dozens of questions about this on stack exchange with answers, here is one:

softwareengineering.stackexchange.com/questions/187169/how-to-detect-the-encoding-of-a-file

I’m seeing tons of open source implementations for many languages.
github.com/superstrom/chardetsharp
github.com/chardet/chardet
github.com/sv24-archive/charade
github.com/Joungkyun/libchardet

I don’t remember installing it, but “chardetect” is installed on my debian desktop. May have been installed by default.

An old mozilla paper on this may be of interest:
https://www-archive.mozilla.org/projects/intl/universalcharsetdetection

While the goal for heuristics is to be right most of the time, it rubs me the wrong way to have software relying on chance instead of real standards…uck.

2024-04-30 9:32 pm

Alfman verbose=1
I noticed that the mozilla paper about byte encodings contains encoding errors….how funny is that!

You can notice the mozilla link is full of ‘?’ characters where the browser can’t parse the character. Firefox reports that page as “UTF8”, but the contents of the page contains extended byte sequences that aren’t appropriate for UTF8 encoding.

I checked using a hex editor and those are hex 0xA0 bytes. It turns out the text was likely written in an editor that uses the Windows-1252 code page to produce non-breaking spaces….
https://www.ascii-code.com/CP1252/160

The HTML standard always combines consecutive spaces into one space on screen, and incidentally this is why text editors that target HTML (like MS frontpage or dreamweaver) convert extra spaces into non-breaking space characters instead, which is what happened here. However the correct byte sequence for non-breaking space in UTF8 isn’t “0xA0”, it’s “0xC2 0xA0”. Hence the character glitches. Obviously the old Windows-1252 text content got pasted/imported into a UTF-8 website resulting in a page containing text from two encodings.

I am so pleased that mozilla’s paper about character encoding is itself a test case of what can happen when the character encoding is wrong! This is the kind of irony that makes me smile 🙂

2024-05-01 1:01 am

LeFantome
Very nice catch and indeed quite wonderful in its own tragic way.

2024-05-01 10:37 am

chriscox
Yeah, I think what is “true” here is that there are those that “change” and then there is Windows which “demands”.

In all my experiences dealing with i18n situations, it’s those that “demand’ and will never change, are usually the reason why things can’t work.

So, while Windows can argue that “their way” is the “the only way” because “it’s good”, it actually causes a lot of problems.

So much so that in some of my software I have to restrict functionality on Windows.

Sure, just like with so many other “standards” (not), we could capitulate and adopt a strict Windows-only as the standard policy world wide. I know Microsoft would appreciate that.