Unicode normalization forms: when ö != ö

Thom Holwerda 2022-01-01 Internet 10 Comments

Some time ago, a very weird issue was reported to me about a Nextcloud system. The user uploaded a file with an “ö” on a SMB share that was configured as an external storage in the Nextcloud server. But when accessing the folder containing the file over WebDAV, it did not appear (no matter which WebDAV client was used). After ruling out the usual causes (wrong permissions, etc…), I analyzed the network traffic between the WebDAV client and the server and saw that the file name is indeed not returned after issuing a PROPFIND. So I set some breakpoints in the Nextcloud source code to analyze if it is also not returned by the SMB server. It was returned by the SMB server, but when the Nextcloud system requested more metadata for the file (with the path in the request), the SMB server returned a “file not found” error, which lead Nextcloud to discard the file. How can it happen that the file is first returned by the SMB server when listing files but then the server suddenly reports an error when requesting more metadata?

Special characters must be second only to time, dates, and timezones when it comes to weird behaviour in code.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

10 Comments

2022-01-01 10:34 am
Anonymous
File under yet another annoying and avoidable problem. In the hardware and safety critical world standard tests and protocols and certifications exist. It’s time for the software world to catch up.
I imagine a few “legends in their own minds” and corporations getting fat of the hamster wheel of broken code and unnecessary duplication won’t want this level of accountability. Nobody does but I prefer problems to be fixed and stay fixed before they are a problem I get to hear about…

2022-01-04 5:03 am
henderson101
This is text encoding though.. so it depends on as many factors as date encoding. It your text is encoded in a specific codepage, rather than a standard, that needs to be converted, then if your text is encoded in an ISO standard, that needs to be accounted for. You seem to believe everything is simple, but the legacy says otherwise.

2022-01-01 10:43 am
ndrw
Unicode is known for causing such issues but this is Nextcloud we are talking about – I didn’t have to dig nearly this far to uncover hundreds of other issues. Frankly, Nextcloud is not a production quality software. It is slow, buggy and very difficult to maintain in a working order. Which is pity, privacy by design kind of hinges on deploying it yourself.

2022-01-03 9:24 am
Bill Shooter of Bul Platinum Prime
Yeah… Next cloud. I used to run it, but I’d agree its a pain and very messy code base.

2022-01-01 12:58 pm
FortranMan
One aspect that hasn’t really been mentioned yet is the strong Anglo-centric bias in computing. APIs, language key words, etc are all biased towards English speakers and this looks like another manifestation of the same thing. Rather than a nefarious plot to disenfranchise non-English speakers, these seem like a side effect of so much early work being done in the United States, and then developers always rushing to just “make it work” instead of thinking through the problem for the long term. They just run with something that solves their immediate problem, and don’t test or design for cases that won’t immediately impact them. Over the long term, open source software tends to combat both of these effects, but it’s a struggle. At least in this case the code was available to find and adjust. If it were proprietary software that would not be possible.
2022-01-02 3:34 am
el-topo
Tom probably knows this but as a Swedish speaker I’d just like to interject that there might be perfectly normal reasons to name a file ‘ö’ as it is a single letter noun meaning ‘island’ in my mother tongue.
2022-01-02 8:54 pm
Sarreq Teryx
Of course the correct thing here is to NOT use a normalization routine on an already stored filename, unless it’s being used to rename the un-normalized one. if it’s stored as un-normalized, read it as un-normalized, period. any normalization should have happened before the initial save.

2022-01-03 11:10 am
Alfman verbose=1
Sarreq Teryx,
Of course the correct thing here is to NOT use a normalization routine on an already stored filename, unless it’s being used to rename the un-normalized one. if it’s stored as un-normalized, read it as un-normalized, period. any normalization should have happened before the initial save.
One might also treat it as a binary value and punt the problem to high level software. The thing about having normalization and collation in the FS is that changing code pages can have an impact on where files are physically located in an indexed on disk. One rational for treating file names as bytes is that unicode is a changing standard with new code points added periodically, supporting at the OS/FS layers is more effort than it’s worth. For it’s part Linux basically treats a filename as binary and doesn’t try to interpret character mappings at all.
Windows is different though, there unicode characters have some degree of normalization/prepossessing prior to indexing to achieve case insensitive filesystem semantics. NTFS deals with unicode changes by saving the code mapping inside FS metadata at the time of format. This enforces consistent unicode mappings within the file system, however it may get out of sync with other (perhaps newer) unicode mappings in other software.
Here’s a detailed discussion of this from a past article:
https://www.osnews.com/story/30417/how-to-enable-case-sensitivity-for-ntfs-support-for-folders/
It seems like things ought to be canonical because having multiple ways to represent one file name seems like a bad idea. And there’s also some benefit to case insensitive collation and matching. On the other hand treating file names as bytes makes implementations and protocols significantly more strait forward. The problem is that user facing software has to deal with collation and canonicalization and this can make file system access very inefficient. Beyond that it can lead to inconsistent behavior between programs. The ramifications can be more than just inconvenient, you may have a web server exposing local file system conventions to the world but URLs may fail to resolve if unicode characters are represented inconsistently between browsers/web servers/application servers/databases/operating systems/file systems/etc.
This is why it’s important, yet hard to be consistent.

2022-01-07 7:26 am
Lennie
Seems like a bad idea to change things in the in between, best to leave that problem to the end user application (possible with help of the operating system)

2022-01-07 2:15 pm
Alfman verbose=1
Lennie,
Seems like a bad idea to change things in the in between, best to leave that problem to the end user application (possible with help of the operating system)
I agree that has merit, but when things are left to user space there are things that can’t be taken advantage of like collated file system indexes. This is the reason listing files using ls or a linux file explorer in a folder containing lots of files (like wordpress content directories for a large site) is so slow. Anyways there’s no doubt that leaving things to user space is the easiest thing to do.