Windows was an early adopter of Unicode, and its file APIs use UTF‑16 internally since Windows 2000-used to be UCS-2 in Windows 95 era, when Unicode standard was only a draft on paper, but that’s another topic. Using UTF-16 means that filenames, text strings, and other data are stored as sequences of 16‑bit units. For Windows, a properly formed surrogate pair is perfectly acceptable. However, issues arise when string manipulation produces isolated or malformed surrogates. Such errors can lead to unreadable filenames and display glitches—even though the operating system itself can execute files correctly. But we can create them deliberately as well, which we can see below.↫ Zafer Balkan
What a wild ride and an odd corner case. I wonder what kind of odd and fun shenanigans this could be used for.
My understanding is that Windows accepts unpaired surrogates because they decided that was the least painful way to be backwards compatible with NTFS filesystems created when UCS-2 didn’t impose any special restriction on those codepoints.
Unicode filenames were always a bad solution to a non-problem; this is just one more illustration of why.
Minuous,
It’s easy for you to say if your language is latin based, but otherwise it’s kind of unfair to insist filesystems can’t represent your language.
I think linux implementation is the simplest: threat filenames as bytes and let higher level applications interpret what they mean in terms of characters.
My objections to unicode have more to do with mission creep, hijacking unicode for emojis with colors etc went too far. Now what you see is entirely dependent on the device you use. IMHO the scope should have been limited to natural alphabets and emojis should have used a markup language that don’t pollute the unicode namespace.
Care to elaborate on why you think it was a non-problem, or why Unicode was a bad solution?
Without Unicode, a lot of people weren’t able to name their files in their own native language, or if they could, they could not easily be shared across different regions with different languages.
I would agree that UTF-16 is not as good an encoding as UTF-8, but that’s only with the power of hindsight. Windows NT development started in November 1989; UTF-8 wasn’t announced until January 1993, four years later. And UTF-8 would not necessarily fix this specific problem – it is still possible to encode UTF-8 incorrectly, so Windows might still have the same kinds of issues displaying or handling it.