The Document Foundation, which developers LibreOffice, is mad at Microsoft for the levels of complexity in the Microsoft 365 document format. They claim Microsoft intentionally makes this format’s XML schema as complex and obtuse as possible to lock users into the Microsoft Office ecosystem.
This artificial complexity is characterised by a deeply nested tag structure with excessive abstraction, dozens or even hundreds of optional or overloaded elements, non-intuitive naming conventions, the widespread use of extension points and wildcards, the multiple import of namespaces and type hierarchies, and sparse or cryptic documentation.
In the case of the Microsoft 365 document format, the only characteristic not present is sparse or cryptic documentation, given that we are talking about a set of documents totalling over 8,000 pages. All the other characteristics are present to a greater or lesser extent, making life almost impossible for a developer trying to implement the schema.
↫ Italo Vignoli
I feel like this was widely known already, since I distinctly remember the discussions around the standardisation process for the Office Open XML file formats. Then, too, it was claimed that Microsoft’s then-new XML file formats were far more complex and obtuse than the existing, already standardised OpenDocument file formats, and that there was no need to push Microsoft’s new file formats through the process.
These days, you might wonder how relevant all of this still is, but considering vast swaths of the private, corporate, government, and academic world still run on Microsoft Office and its default file formats, it’s definitely still a hugely relevant matter. As an office suite, you are basically required to support Office Open XML, and if Microsoft is making that more complex and obtuse on purpose, that’s a form of monopoly abuse that should be addressed.
> I feel like this was widely known already, since I distinctly remember the discussions around the standardisation process for the Office Open XML file formats.
Define “widely”.
Among the people who follow such things? Sure, but one can hardly claim that such would constitute a sufficient portion of even the technically inclined general populace to be considered as “widely known,” as the expression is normally used.
Office Open XML was (and is) a successful derailment program by Microsoft to hobble the then newly minted OpenDocument standard. 8000 pages of vague and convoluted document descriptions. At the time of “standardization” a lot of the Office Open XML standard came down to “do it like office version x”. This “format” never came down to standardization and openness. It was created to strengthen “the moat” around MS Office and in extension Windows. It created an almost impossible task for competing office suites to implement this byzantine format and it gave MS the opportunity to tell governments and organizations that they make use of an open standard for their documents. On paper fully true. In practice anything but.
Never mind the ballot stuffing MS perpetrated to get an ISO approval…
Thing is, I’m guessing they basically structured OOXML to be as similar to the predecessor binary formats as possible, so that they could reuse code and easily convert from one to the other. The logic being, supporting ODF as default would put an unnecessary burden on Microsoft, since existing MS Office features would be supported in a less than perfect manner. And since they were the 800 lb gorilla with the much larger user base, from their perspective, they should be the ones to dictate the format that’s best suitable for their needs. Ultimately, you may not like it, but considering that the ODF formats were in all likelihood similarly based on the old StarOffice formats, it seems fair enough.
At the end of the day, it’s not much use belly-aching – the only thing that can turn the tide is individual (and institutional) action. If you want to further the cause of ODF, then insist on using and sending ODF formats everywhere and see how it goes. Microsoft has actually improved its ODF support quite a lot in the last few years, to the point where Office 2024 actually supported ODF 1.4 before LibreOffice did. And in case the recipient is still using an older version of Office, maybe they can consider it a gentle nudge for them to try LibreOffice instead.
Forcing a nearly unimplementable “standard” through the ISO approval process by buying off national standard bodies and ballot stuffing at ISO itself is “fair enough”?
It is condoning behavior like that which got us the current situation where megacorporations can buy laws that favor them and put everyone of us mere mortals at a severe disadvantage.
I’m not condoning any of that, I am merely talking about technical merits compared to ODF. No doubt OOXML may seem convoluted, but “unimplementable” is objectively not true.
I’d say it’s de facto unimplementable in the same way that Windows spent most of its life as a de facto monopoly.
Seems to me it’s a “de facto” Standard more than anything else, considering all the major office apps out there have managed to implement it to a more.or.less decent degree of fidelity. Same exact thing can be said about ODF,by the way – the only real measure of fidelity is comparing a given app’s implementation to that of LibreOffice.
“Do it like” (proprietary) “Office version x” makes it unimplementable in full. Decompiling Office after breaking the shrink wrap is a breach of contract and any resulting code implementing it like “Office version x” does, is tainted code.
OOXML results in near full support by competing Office Suites, but never near enough. It’s the paper cuts on other suites than MS Office that keep people tied to MS.
Ok, but by the same merit, I can’t think of any office apps, open source or no, that implement ODF so as to be fully cross-compatible with LibreOffice without any paper cuts.. Seems to me to be in the nature of document format specifications.
This reminds of the IT security certifications (Cyber Essentials, ISO xxxxx, whatever).
The main classic vendors will offer easy-to-comply controls to hide all the complexity, the major “certified certification auditors” expect such controls and the institutions preparing your organization to adhere to the standards all will refer to how the major vendors do things.
In theory, nothing stops you from being fully compliant with fully open source solutions, but you would be up to a bag of hurt to prove that your custom solution does the same thing as whatever checkbox does in AWS or Azure.
The ISO documents are, in theory, available, but usually behind paywalls. It’s a classic case of crosswanking, not unlike we see between government regulators and the “regulated” industries/institutions.
It makes sense when you realize the truth: This is not a document format.
This is just serialized data where they replaced the old binary serializer with an xml based one.
If it was a document format, i would expect to be able to edit parts of it in isolation. But having been in the situation where i needed to update a few parts of an excel file programmatically, and it was features that our excel libraries did not support, i had the pleasure of having to read the docs and try to work with the unzipped contents.
Quickly found out that simple things, like exact element order in the xml structure was super important, even inside container elements that just contained a long list of different possible elements it could contain up to one of each of. If you needed to insert one here, you needed to find the exactly right place, even though you had no idea how many of the possible elements from the list was already there, so basically, you needed a list of all elements that had to be before the one you wanted to insert.
But what made it click for me, that this is just a serialization format, was how strings in an excel sheet are saved. In the sheets XML there is just a reference to a shared string, so unlike numbers, strings are not saved with the sheet. Then there is a shared strings xml file that has all your different strings from all your sheets in one long list. You would think the reference would be to an id or something, but no, it is an “array” index. So if you add or remove a string, then you need to rewrite all sheets because their shared strings references are updated.