Microsoft Office, like many companies in recent months, has slyly turned on an “opt-out” feature that scrapes your Word and Excel documents to train its internal AI systems. This setting is turned on by default, and you have to manually uncheck a box in order to opt out.
If you are a writer who uses MS Word to write any proprietary content (blog posts, novels, or any work you intend to protect with copyright and/or sell), you’re going to want to turn this feature off immediately.
↫ Dr. Casey Lawrence
The author of this article, Dr. Casey Lawrence, mentions the opt-out checkbox is hard to find, and they aren’t kidding. On Windows, here’s the full snaking path you have to take through Word’s settings to get to the checkbox: File > Options > Trust Center > Trust Center Settings > Privacy Options > Privacy Settings > Optional Connected Experiences > Uncheck box: “Turn on optional connected experiences”. That is absolutely bananas. No normal person is ever going to find this checkbox.
Anyway, remember how the “AI” believers kept saying “hey, it’s on the internet so scraping your stuff and violating your copyright is totally legal you guys!”? Well, what about when you’re using Word, installed on your own PC, to write private documents, containing, say, sensitive health information? Or detailed plans about your company’s competitor to Azure or Microsoft Office? Or correspondence with lawyers about an antirust lawsuit against Microsoft? Or a report on Microsoft’s illegal activity you’re trying to report as a whistleblower? Is that stuff fair game for the gobbledygook generators too?
This “AI” nonsense has to stop. How is any of this even remotely legal?
It’s not just in Word, actually, but in all of Office (like, PowerPoint sends everything you type in so it’s “designer” can offer design suggestions for your slides based on their content). I just last week made a presentation about AI for a network of non-profit orgs and inserted a slide about disabling this. Oh, and this checkbox _will_ check itself back on. It’s done it more than once to me.
Microsoft has not been very forthcoming https://answers.microsoft.com/en-us/msoffice/forum/all/does-microsoft-word-scan-my-documents-to-train/88ed5ccb-7890-40eb-b55f-d59e5d0b43af
IMHO people are upset over the wrong facet of this. AI is being treated as a scapegoat for the bigger issue of corporate entitlements over private data. Yes it’s wrong that they’re using private data without permission to train AI, but the reality is this problem has roots long before AI. This is merely the latest incarnation of a long trend of corporations screwing around with our privacy for their own commercial gain. Google/facebook/etc have notoriously been hoarding private data and spying on users to sell ads for example. So…I’d agree violating user privacy is harmful, but all of it is harmful to privacy not because of AI, but because it’s wrong for corporations to invade people’s privacy at all.. It’s not a problem with AI in and of itself. We should be protesting the unauthorized corporate collection of private data on the basis that it’s not their data, it just shouldn’t be relevant that they’re using the data for AI versus any other commercial reason.
I partially agree with you.
Yet, there is a demonstrable link between the emergence of AI and the acceleration of these power grabs over user data. Every big tech corp is now in a race to get the first acceptable AGI in the face of the world (mostly, their shareholders), so they can capture the perceived infinite $ to be made with it when it can replace almost anyone. The scenario might not be realistic, but this is moot: that’s what their shareholders expect from them, so they go all in.
Thing is, they thought the vast content repository that is the Internet seemed a good enough source at first, but it ain’t. So now, they’re looking at ways to get more, with greedy moves like the one discussed here.
So yeah, you’re right in that AI is not the only reason for big corps to suck out users data, but while before the idea was to better know users in order to sell them more stuff, now it’s the output of the users brains that’s at stake, to try and imitate its abilities. That’s definitely a big change, and pretty much linked to what AI is.
IMHO it’s a huge leap between using publicly available data that any human would be entitled to read in the open versus private data that was never published. I’ve always been concerned about corporations abusing their control over our data. For one, their centralized data silos can enable dramatic escalation into a police state. (ie apple or google scanning private documents, recording GPS locations without permissions, windows/office taking a peek on private documents, etc).. These invasions of privacy are/should be alarming in and of themselves even without AI.
So let me ask this more directly: Is there anyone who thinks corporations should be entitled to use private data for their own purposes as long as it’s not AI?
Everyone seems to be making the fuss about AI, but to me these invasions of privacy are appalling for any purpose.
I don’t necessarily agree on that, but it shouldn’t matter why they want to invade user privacy, that shouldn’t be allowed for any reason. To me it doesn’t make sense to preach respect of privacy, but only in the context of AI.
That’s a bigger debate over whether humans should be pursuing AGI…and I do sympathize with those concerns.
The smoke and mirrors that make up particularly LLM based AI platforms is driven by that corporate entanglement. They are not separable.
CaptainN-,
Why? Obviously corporations train LLMs, but it does not follow that private data needs to be invaded to do it. I can concede there is a debate over using public data (ie humans we are allowed to learn using public data, should AI be allowed to?). But using private data is very different line, and I don’t think even those of us who are proponents of AI could defend that.
No other entity can afford to throw billions of dollars down the drain on scamware like this. And they all need more information than has currently been produced by all of humanity in all of history to make the magic trick look like it has promise, so they are DESPERATE for more data. The only entities positioned, in terms of economic power, political power, institutional power – all the kinds of structural power in our socio-economic system, are these corporations. In short – we have a large scale data driven magic trick, and a centralized economy with power concentrated in the hands of a few very wealthy technical elites, trying to pump up a new extractive technology. If you take out any of the legs of that stool, it all falls down. They are entangled.
CaptainN-,
It’s expensive sure, but I don’t agree with you that chat-gpts-like LLMs need more private data. instead what they need is less noise. Private documents don’t improve the noise.
I disagree. This data is not likely for knowledge base building, too much noise and no quality assurances. It’s more likely they are training AI for a mundane purpose: optimizing word workflow. By seeing the kinds of edits millions of users are making at scale, AI can become more proficient at making suggestions before humans make those edits (welcome back clippy!).
Again though, it’s enraging the companies feel entitled to access our data without consent, that’s a bigger problem than the AI itself.
>”This “AI” nonsense has to stop. How is any of this even remotely legal?”
LOL, you should welcome your new AI overlords. Just hope they give you the tasty insects to eat instead of the disgusting slimy ones.
Almost like people have been through 2020 and still have no clue how the world actually works lol.
Almost indeed.
It’s legal because they make all the laws. That’s not even complicated.
Your future legal code. Written for you by AI. What a glorious future it will be.
This can NOT be legal. Word is being used by many organisations and some things written inside is strictly confidential.
The ironic part is… I don’t suffer from the data grab described in the article because I don’t use Office with a microsoft account.
Mandatory meme, that’s it we are switching to GNU/Linux, or better to LibreOffice. As for the whole AI, indeed it’s becoming creepy and in my opinion it will damage Windows reputation. End users buy Windows OS and then realizes it’s infested with Recall, Office suite is tracking what you write and sends it to Microsoft “AI”, then you open up Notepad and there it is again, a Clippy named AI. Cloud was somewhere else, still somewhat separated from the Windows, so that could be tolerated, this whole AI era, it consumed Windows as a whole. IMHO it will start to harm Windows and some people will start to find it too creepy to be used.
Thom, what copyright does the act of training a neural net with copyrighted data violate? Are you claiming that the weights on the nodes of an MLP are somehow copies of data? Now, if the MLP spits out verbatim copies, then this is copyright infringement and can be pursued on a case-by-case basis, but the weights themselves aren’t copyright infringement.
I remember when you where in favor of fair-use and transient copies btw, so I wonder: do you have any long-standing principles?
It’s a violation of people’s privacy (we are not talking about published data here so fair use doesn’t apply), but that’s the magic of TOSes: nobody reads them, yet they are legally binding. I bet there is something in the TOS that says they can opt-in you to this without explicitly asking you. So it shouldn’t be fair game, but thanks to the magic of TOSes (aka shrink-wrap contracts), it unfortunately is.
But again, these two (training AI on published vs private data) are two completely different things, why can’t you understand this?
I don’t know if you are playing devil’s advocate here.
Leaving aside how you’re redefining the problem to make it look like this new usage (private unpublished data used to train AIs) has always been acceptable and presumably “fair” (IANAL, but I’m curious to read a lawyer’s analysis of how the incriminated feature would indeed constitute fair use).
However, I’d love to have a look at how MS argued in its PIA that the risk of data leakage, as low probability as it might be, was acceptable without having user consent (in GDPR terms, so necessarily “opt-in”). I’m pretty sure any supervisory authority (even the Irish one) would conclude this has to stop being opt-out to be legal.
Nope, I am pointing out TOSes are legally enforceable, despite nobody reading them. It wasn’t a problem until now (it’s not like they could get you for anything), but now that they have started to mine your private data it has become a problem.
Also, we don’t know if Microsoft does this in GDPR countries, keep in mind that not every country in the world is an EU member state. In non-GDPR countries, they are free to TOS away your privacy rights.
Well I live in one, use a localised version of Office at work, with a M365 account hosted in an availability zone based in the EU, and the thing is indeed activated by default.
worsehappens,
Did you actually catch office in the act of uploading your information?
The reason I ask is because it’s possible that the setting has to be enabled AND you need a non-EU IP address for microsoft to collect private data.
I don’t have any insider info here, but it’s just a guess that they might use IP geolocation to enforce EU compliance similar to what apple are doing on IOS to comply with EU court orders.
Well I certainly didn’t take the time to set a MITM proxy to certify it, but the fact that said “connected experiences” do run and produce suggestions when the checkbox is activated is enough for me.
worsehappens,
Ok. I’d like to have more detailed information, but I understand that you wouldn’t have it.
Ideally we’d know exactly what’s being transferred because details matter. Not only is it important to know exactly what’s being sent for privacy reasons, but also to better understand potential attack surfaces. But when it comes to proprietary software, we don’t often get to know the details.
kurkosdr,
+1, exactly.
Unless we were to create new laws. An LLM that summarizes content (and does not reproduce work verbatim as you indicated) does not violate traditional copyrights.
I would be more than happy for courts to invalidate TOS and shrink wrap agreements that take away consumer privacy rights. European courts seem a lot more progressive on this front, But until this actually happens in the US, our courts have promoted corporate interests over public interests, enforcing their unfair one-sided terms even when it’s BS 🙁
>”Unless we were to create new laws. An LLM that summarizes content (and does not reproduce work verbatim as you indicated) does not violate traditional copyrights.”
Did you ever notice that most politicians who write the new laws are lawyers themselves? At least in America, if the lawyers who are trial lawyers write these laws correctly, they will be opening up a multi-trillion dollar litigation goldmine for themselves. For how long do you think they will be able to withstand the allure of re-writing the laws to benefit their own trial lawyer practices?
andryprough,
Sure. The self-serving corruption is rampant in government and I don’t see that getting better. Not that government hasn’t been through bumpy roads in the past, but future generations may see this period of history as an explosion of self-serving corruption because the “checks and balances” that have previously acted as guardrails against abuse have largely been eroded.
I’d argue the weights can be considered to be a form of lossy compression of the input data, in combination with the full data set. So if an MP3 on disk can be copyright infringing even without converting it back to audio, then so can weights derived from copyrighted data without converting it back to the original text or a paraphrasing.
The1stImmortal,
“Lossy compression of something specific” does not equal “generalization”.
We should all be in agreement that reproduction of copyrighted work needs to be discouraged. And admittedly some LLMs have notoriously been found to exactly reproduce training data…frankly these occurrences need to be subject to copyright infringement charges just like for anyone else. However the duplication of training data is usually not intentional. It’s a sign that the NN is over-fitting the training data rather than generalizing it correctly.
I’m really not trying to provide AI with a weaselly excuse for getting away with copyright infringement, but if the NN is successfully generalized and does not reproduce original works, then it does not infringe copyrights on traditional copyright grounds.
Now say you wanted to expand copyrights to include generalization above and beyond basic reproduction, well we need to tread carefully with that idea. It would represent a dramatic increase in the scope for copyright infringement. There would be implications for humans who have long taken the right to generalize other people’s work for granted. Even websites like osnews could be in legal jeapardy. If generalizations are considered infringement too then the lawyers are going to have a field day with all of the new ground brought into the fold for infringement by expanding the scope of copyright.
I’m saying the statistical model used to “generalize” is itself a form of lossy compression of a collection of data, capable of reproducing an approximation of the input data, to a varying degree of fidelity depending on the effective compression level, on demand with the right settings and context. That’s necessarily true of the process.
Do the same with audio, create a highly compressed archive capable of reproducing approximations of the input data through statistical models of the input data (even if doing across multiple tracks) and you have the same problem with audio.
It just also happens you can get it to spit out other things by exploiting the commonalities between pieces input data. But it’s still a compressed lossy archive of the input data.
The1stImmortal,
You really have to stretch the meaning of “lossy compression” far beyond it’s normal meaning for a re-written generalization of a work to be considered a compression of it. 🙂 I’ll disagree with you that it’s lossy compression in any traditional sense, but I believe I understand your point and we can still discuss it.
My point was that if we allow copyright to be expanded to include generalizations, which are not traditional grounds for copyright infringement, then this new interpretation would represent a significant expansion for copyrights. This has a dramatic impact on human authorship too. It would not longer be sufficient just to worry about direct copying, but now authors would have to be careful not to generalize someone else’s work. How is this even supposed to work in a copyright infringement case?
You may be visualizing these generalization copyrights only in the context of AI, but humans do the same thing. This is what I mean by treading carefully.
MacOS is also infested with AI nonsense, and is just promising to get worse.
Local AI is not bad in the same way that indexing of files isn’t. But if the AI phones back home to upload any data, it is bad.
kurkosdr,
I think this distinction is very important too. When we’re talking about locally running AI (like llama.cpp and stable diffusion), these don’t need to send data up to “the cloud” such that privacy becomes an issue. Not only does local execution alleviate privacy issues and data leaks, but also other factors like planned obsolescence and the ability to access to your own data & apps long term (ie data & software preservation).
A game that uses AI & LLM locally isn’t a problem in the same way that it would be if the same functions were implemented “in the cloud” (ie simcity-fication). A word processor with local AI rewrite features isn’t a privacy problem. A graphics package with generative AI capabilities (like background fill) can offer impressive benefits, but making these features dependent on remote vendor services causes far more harm to privacy and dependency than the same AI features running locally.
Unfortunately tech companies these days see these dependencies as a desirable goal. Given two solutions, one that runs locally versus one that traps users inside of the vendor’s cloud, the later is preferred for reasons of control. Even good & welcome features can be ruined by this.
An example is supporting python in excel, a very welcome improvement over VBA…. except that microsoft used it as an opportunity to tether excel to azure cloud services. There was no reason for python to be implemented as a cloud service instead of as a local excel scripting language. But the azure dependency was an internal development goal and so this feature got corrupted by the cloud agenda.
https://www.osnews.com/story/136767/microsoft-brings-python-to-excel/
https://www.reddit.com/r/excel/comments/16tohx5/we_developed_python_in_excel_one_of_microsoft/
Much of today’s technology is being corrupted in a similar fashion and AI is no exception 🙁
>antirust lawsuit
cniles are getting desperate
Thanks. Turned off.
Hopefully I will remember to check some time later, if it didn’t turn itself on again “magically” as @worsehappens suggests…
I recall I read somewhere that in Soviet Russia , secret police was placing posters saying : With an iron fist we will drag you to happiness.
What was old is new again , sigh …
I mean, yes that is an insanely difficult to find checkbox. No question. The UX sucks. But motivations? Microsoft doing Microsoft things, would have put that exactly where it is even if it had not been controlling something they don’t want people to see.
The only thing missing from the sequence to disable that setting, is a sign above the checkbox saying “Beware of the leopard”…
(https://www.goodreads.com/quotes/40705-but-the-plans-were-on-display-on-display-i-eventually)
Not defending AI stupidity or Microsoft, but I do have to point out that this feature does nothing of the sort nor does Microsoft use Word (online or desktop app) for LLM training:
https://www.howtogeek.com/is-microsoft-using-your-word-documents-to-train-ai/
And if you don’t trust a simple article where they talked directly to MS, you can read the MS terms of service. It spells out exactly what they do and do not use your content for, including how Copilot/AI uses it. None of this is obfuscated. None of this is difficult to find or requires a legal background. Here’s the US TOS, I presume other countries are similar (or better) in clarity.
https://www.microsoft.com/en-US/servicesagreement
Privacy is a very important issue and the chipping away at it by corporations is a real concern. Getting out torches and pitchforks of poorly researched (or maybe even purposefully misleading, I don’t know…) articles DOES NOT HELP SOLVE THE PROBLEM. Stop spreading misinformation. You sure as shit can disagree with the TOS and Microsoft and AI. Bog knows I do. But be smarter than the typical user and don’t spread misinformation.
Computer nerds are some of the dumbest smart people I know. smh.
drugajin,
I do appreciate the links providing more informational updates. But I feel the author focused on the wrong questions…
I don’t care about “LLM training”, I care that my private data isn’t being sent to MS without my permission and unfortunately this all important question was neither asked nor answered. Does any private data get sent to microsoft’s cloud services for any reason at all? If yes, then microsoft are still guilty IMHO and the public should still be enraged!
I’ll grant you that things can get off the rails when we’re forced to speculate. But not for nothing, it is microsoft’s fault alone that they’ve failed to document what they’re doing with user data. Even now it’s not clear.
Does user data get sent to microsoft though? Because this is what I’m most concerned about when it comes to privacy. My privacy concern does not go away just because they’re not using it to train AI. I don’t like that people are so focused on the AI part and not the privacy part.
Did you actually read the TOS? I’m skeptical that’s the case. If you did, maybe read and understand it again because it literally answers the question about what is and isn’t sent to them by using their services and what they use it for. In pretty clear language.
If you don’t trust that MS follows their own TOS or is lying about it, that’s fine and even understandable considering it’s Microsoft, but that is a whole different conversation.
Continuing to spread misinformation about things like this makes it harder to get people to take privacy concerns seriously and allows bad actors/corporations to continue acting maliciously.
drugajin,
It’s 17k+ words long. I doubt any of us here can truly claim to understand the whole thing without a lawyer. Anyway I did search for this specific issue and came up empty handed using the keywords I was looking for.
I still have no specific information on what microsoft’s “connected experiences” is actually doing with private user data without making my own assumptions. The “AI Services” section might be relevant…maybe… but it’s just a guess since the connection isn’t explicit.
Furthermore, even after reading that section it doesn’t really provide enough clarity on privacy matters. People who are legally responsible for patient privacy and sensitive financial information may have real concerns with these terms, and the TOS don’t make at all clear that a new version of MS word could be uploading this information and making themselves vulnerable.
That’s not even my problem…. companies need to be more explicit about taking user data. Too often companies try to hide behind a vague TOS. Whether I like it or not though, a TOS can be a legal way for companies like microsoft to cover their asses. But whether you like it or not, it doesn’t make their actions ethical; users are right to call out corporate behaviors and privacy invasions that they find objectionable regardless of the TOS.
So do you have a clear and thorough answer to what they are doing with “connected experiences”? I haven’t seen it. If you do then please link/quote it here because it’s relevant to this article and I think everyone here wants to know.
>>So do you have a clear and thorough answer to what they are doing with “connected experiences”? I haven’t seen it. If you do then please link/quote it here because it’s relevant to this article and I think everyone here wants to know.
No. Do your own homework. None of this is difficult to figure out nor would I consider it onerous for anyone interested in using a.) using service and b.) wanting to understand how their data is used by said service. You’re making this sound like an impossible task that is beyond the capability of someone with basic tech, reading, and comprehension skills. It took me all of 10 minutes of fucking around on DDG and reading links/docs/TOSes, You’re at least as smart as I am, probably smarter, so it should be comparable. I’m not going to lie and say it was a fun use of my time, but it’s also not difficult to find and understand in the slightest.
And besides all that, I posit if that if anyone here is whargarbling about trash like this, put your money where your mouth is and move to privacy respecting software which *gasp* may require relearning or recreating workflows. If privacy is that big of a concern (AND IT SHOULD BE!), then do something about it and stop being lazy. Otherwise this is just more pissing and moaning about Micro$haft (and others like Google) with no action and more spreading misinformation.
drugajin,
So provide a link. I appreciated the first link, but these responses haven’t provided any new information.
I think users have good reason to be concerned when their software is potentially uploading their data to a service without explicit permission.