In the world of open source, relicensing is notoriously difficult. It usually requires the unanimous consent of every person who has ever contributed a line of code, a feat nearly impossible for legacy projects. chardet, a Python character encoding detector used by requests and many others, has sat in that tension for years: as a port of Mozilla’s C++ code it was bound to the LGPL, making it a gray area for corporate users and a headache for its most famous consumer.
Recently the maintainers used Claude Code to rewrite the whole codebase and release v7.0.0, relicensing from LGPL to MIT in the process. The original author, a2mark, saw this as a potential GPL violation.
↫ Tuan-Anh Tran
Everything about this feels like a license violation, and in general a really shit thing to do. At the same time, though, the actual legal situation, what lawyers and judges care about, is entirely unsettled and incredibly unclear. I’ve been reading a ton of takes on what happened here, and it seems nobody has any conclusive answers, with seemingly valid arguments on both sides.
Intuitively, this feels deeply and wholly wrong. This is the license-washing “AI” seems to be designed for, so that proprietary vendors can take code under copyleft licenses, feed it into their “AI” model, and tell it to regurgitate something that looks just different enough so a new, different license can be applied. Tim takes Jim’s homework. How many individual words does Tim need to change – without adding anything to Jim’s work – before it’s no longer plagiarism?
I would argue that no matter how many synonyms and slight sentence structure changes Tim employs, it’s still a plagiarised work.
However, what it feels like to me is entirely irrelevant when laws are involved, and even those laws are effectively irrelevant when so much money is riding on the answers to questions like these. The companies who desperately want this to be possible and legal are so wealthy, so powerful, and sucked up to the US government so hard, that whatever they say might very well just become law.
“AI” is the single-greatest coordinated attack on open source in history, and the open source world would do well to realise that.

If it becomes trivial to write using a tool, there’s no significant effort to protect from being stolen. The books containing logarithm tables were very valuable human work in the past, but a trivial computer program can write one in milliseconds today.
And surely there’s nothing creative in the code produced by LLMs. If there is any by means of verbatim copying then it’s easy to check.
I’m a fan of knowledge being unshackled, though fear how it will be misused. It’s a loss for proprietary code and restrictive licenses, but not for code, and means of coding, being democratized. Maybe open source wasn’t the final destination, just an imperfect step along the way. Maybe we won’t keep needing to fight over contracts.
For better or worse ideas can still be patented, and patents will still apply.
Thom Holwerda,
I do not agree with this. AI is merely a tool. Of course tools can be used abusively but AI tools themselves are neither pro nor anti-FOSS.
When you read open source licenses including GPL, there is nothing in it that prohibits it’s use for AI. The GPL is not just imp[icity compatible with training AI, but it goes so far as to explicitly reject all prohibitions on how the code can be used downstream. The only requirement is that derivative works also be GPL licensed. So the argument should not be that we cannot train AI on open source, because this violates both the text and spirit of FOSS. IMHO a more solid argument would be that derivative works should themselves be FOSS, including AI works. If one truly believes in the virtues of GPL & FOSS, then this is what supporters should be clamoring for.
The fly in the ointment is that AI works are not eligible for copyright. No copyright => No AI generated open source. All FOSS licenses are prefaced on there being a copyright to enforce.
SlothNinja,
This isn’t obvious to me, is there some legally settled case law that I’m not aware of? I’d like to learn more about it, let me know!
Alfman,
There was a recent one about AI generated art.
But… that is bollocks without setting a threshold or a guideline.
Today *all* digital art is basically AI generated.
The brushes you use in illustrator
The photographs you “take”
Filters, smudges, effects that you use
Use one form or another of AI
Even the “auto correct” today uses modern versions of BERT (first proper LLM)
Says WHO? Under what authority or jurisdiction?
The courts:
https://www.reuters.com/legal/government/us-supreme-court-declines-hear-dispute-over-copyrights-ai-generated-material-2026-03-02/
Thank you for the link.
I see “Visual Art” and I see “US Supreme Court” which makes me wonder if/how that would apply to software/code or 3D printouts and/or in Europe/China/Japan/Korea?
A rather obscure judgement on a rather obscure subject in a very obscure country at best 😛
Andreas, the US is very aggressive in making all economic partners have similar copyright protections.
However, if this ruling ends up affecting software, the right to repair movement will have a big stepping stone as the DMCA explicitly relies on copyright. Reverse engineering and hacking would become a free for all again if nothing is copyrightable.
I am well aware of that, though it looks like they overstretched quite a time ago and the world starts showing the finger. The EU never accepted Software patents.
Serafean,
Thanks for linking that!
At least within the limited scope of this case, it seems the judges found that a human using AI does not get to copyright AI generated work. That is interesting in and of itself, although technically this is different from what we’ve been talking about. The judges did not rule on whether the licenses of the originating works used to train an AI applies to the derived work. This is very relevant to whether GPL applies or not.
My own feeling about this is that using AI does not remove copyright in principal. So to the extend that a copyright would have applied without AI, then it should still apply with AI. There is no change.
Alfman,
https://www.eff.org/deeplinks/2025/06/two-courts-rule-generative-ai-and-fair-use-one-gets-it-right
The models exist, they have ingested the data. That is done under fair use, so it does “remove” copyright.
Serafean
Thanks for linking it. I don’t have enough time to deeply analyze those cases right now. But from the EFF’s own summary of them the two cases do not come to the same conclusion. It seems like things are still being fought in courts.
I would still say it has less to do with LLMs “removing copyright” and more to do with transformative works not infringing. The original copyright is still 100% in effect and an LLM is still not allowed to infringe on it. The crux of the issue of course is that an LLM copies the ideas and not the expressions. However humans do this too. IIRC parts of “The Da Vinci Code” were ripped off from previous authors.
https://www.cbsnews.com/news/code-author-admits-reworking-parts/
There’s no denying LLMs can generate copied ideas, in fact they do it very well. However copyright was not protecting those ideas in the first place, only specific expressions are protected. And there’s a reason it works this way, copyrights were not intended to create a legal monopoly on ideas, just a legal monopoly on one’s expressions. The point is nuanced, but still very significant.
I may be wrong, but I believe when Thom says “AI” in the context of this discussion, he is speaking of the entire “AI” industry, particularly the big companies pirating and otherwise consuming all content that exists on the Internet.
When you say “AI is merely a tool”, you conveniently leave out that this tool is built almost entirely from stolen content. I know you and I don’t agree on much when it comes to “AI”, but you have to admit that the tool would be nearly useless if it were to be built solely from properly licensed and willingly offered content. The companies behind the “AI” software know this, or they wouldn’t be illegally slurping up every byte of copyrighted content they can.
An “AI” that is built only on public domain fiction, for example, will only regurgitate classical feeling and sounding results. The fact that they are actually trained on almost 100% pirated content means they can produce modern sounding and feeling narrative works for that would-be author who just can’t seem to write well enough to publish their own books without plagiarizing established, living, modern authors. Now they can plagiarize all they want and when someone calls them on it they can say they were “vibe writing” and get away with it.
When it comes to source code licensing, the issues are obvious and no amount of hand-waving can counter the fact that “AI” (i.e. the evil corporations behind the algorithms, and the rampant IP theft used to feed them) is destroying open source licensing right before our eyes.
Morgan,
My point is that when we’re talking about open source code licenses, it already IS properly licensed and willingly offered. There is no copyright issue to train AI because the licenses are expressly permissive of derivative works. I accept that some dislike AI, but for better or worse FOSS licenses like GPL are compatible with AI. On copyright grounds the criticism is not the training of AI, but that derivative works might not be properly licensed, that’s the only copyright issue I see.
I completely understand that some are vehemently against AI even if it follows copyrights. However the fact remains that a lot of FOSS software is being licensed under permissive terms that don’t rule out AI in any legal way. There might need to be new licenses to add AI restrictions.
Fundamentally, the problem is that we’re in the middle of a fight over whether the copyright concept of “derived work” has any meaning in the era of LLMs.
“Field of endeavour” restrictions are forbidden by all three major definitions of this (FSF Four Freedoms, Debian Free Software Guidelines, Open Source Definition) but they shouldn’t be necessary.
Either the LLM is producing derived works (in which case restrictions on AI are unnecessary and we’re seeing massive worldwide copyright infringement) or it’s not, and everything that’s not protected by trademarks or patents is fair game as long as you launder it through an LLM first, because you weren’t delegated the ability to restrict Fair Use in your licenses. (Which would be a very bad situation, since companies like Disney can still sue you for infringement of trademarked characters, and companies like Amazon can still sue you for infringing their software patents, but pretty much anything you do can be copied willy-nilly.)
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!),
I was focusing on the rights granted by the terms of open source licenses. But yes you are right that copyright law itself has other provisions that might make derived uses permissible without requiring explicit permission. Off the top of my head…
1) fair use rights
2) reproduce facts without permission
3) transformative use
https://www.copyrighted.com/blog/what-is-transformative-use-copyright
LLMs are undeniably transformative assuming we go by the same standards that have been applied to humans. Assuming training goes well, LLMs transform expressions into generalizations of ideas. Copyrights only deal with the infringement of expressions and not generalizations. Generalizations are allowed even if they are derivative. Throughout history we’ve allowed humans to created transformative works based on other works. Modern LLMs are opening the debate over whether transformative works should still be allowed when the process gets mechanized. If we want to continue allowing humans to generalize works but create new restrictions for machines, then we’d need new laws to make the distinction.
All of these are important points that need to be debated.
An LLM does not change an expression’s copyright status. Either the derived expression infringes on the source expression or it does not. In other words an unbiased examiner should not know or care whether the derivative work was generated by an LLM or human to make an objective judgement on whether the work infringes. So LLM changes nothing here. However what it does do is make the process of generating new expressions of existing ideas a whole lot easier. Whether we should allow the automation of new derivative expressions warrants a debate for sure, but we should recognize that this lies outside of traditional copyrights.
The specific issue at hand in the article is that the “AI” generated code was used to bypass the terms of the LGPL to relicense the work under MIT, by poisoning the clean room rewrite process[1]. The question of whether that is actually illegal is indeed for the courts to decide (if this goes to court), but it’s undeniable that a world where “AI” can be used in such a way warrants a sense of extreme caution, something that you and others who choose to be blind to the issue will handwave away as “gee, too bad but that’s the way it is now”. Your position is the same as demolishing a building with ICBMs instead of controlled demolition, and saying “oh well, some people died in the fallout, but it’s okay because the building still collapsed and that’s what’s important”. You seem to care more about how quickly we can get to the end goal and fuck the consequences of how we get there, because after all it’s “just a tool”.
1.
I do actually agree on your assessment: It does not matter who/what wrote the new code, but how it was written. “Clean room” justifies re-licencing (see Unix –> BSD), simple translation into another language should not allow for a relicencing. This would look reasonable and fair to me.
But now look at the matter on hand which is Encoding detection. in my own experience, Claude does not need any code template to accomplish this and it was probably easier and more productive to just summarize the use case, the major edge cases and the final compatibility API/command line parameters.
Clean room as it comes in my book.
On a second thought, I correct myself and do see an actual problem of course: even when ‘Clean room’ was intended, there was no guarantee that Claude ‘cheated’ and looked up any Github repositories and read the actual code from there (which beats the purpose of a ‘clean room’ approach.)
But again, how does a human ‘clean room’ engineer guarantee, that he never looked at similar code before (especially when an OS solution was available.)
I spent a lot of time on a SQL Parser and formatter (which shows age). I could rewrite it better from scratch (ignore resources) but of course I would be influenced by good and bad ideas I have seen before.
Morgan,
Yes, that’s kind of the point. The issue is not that the software was trained on LGPL, there was never a prohibition on derived works. The reason FOSS licenses exist is to explicitly make derived works permissible. The main thing that “copyleft” licenses care about is that the derived work is properly licensed. Assuming the derived work is not otherwise exempt from copyright laws, then there may indeed be a problem that derived works may be improperly licensed according to the terms set out in the FOSS license. So I think we can agree on that.
@Alfman,
I don’t agree with that either but rather see it as massive chance:
1) people are now enabled to contribute to existing OS projects, solving their particular problems, when they did not have the technical skills or understand before.
Examples:
– I have been using Claude AI to rewrite the H2 MVStore backend, implementing parallel prefetching, squeezing out 10% additional throughput. It was actually the AI explaining the MVStore Logic too me. W/o it I would never have achieved anything close to that.
– JSQLParser sees a recent surge in BugFixes from assumingly Chinese contributors. High quality work! I am sure that has been done with help of AI too.
2) Adding features and maintaining OS makes you visible to potentially paying clients (if you just got some marketing and presentation skills). I got paid by different corporate and municipal clients just for implementing specific features into existing OS (after developing/maintaining this code for years).
The (clean room) rewrite of Unix into BSD was what exactly (in the scope of “License whitewashing”)?
The only change now is having a tool making such rewrites feasible and faster.
AI is in no way a clean room approach. But i guess instead of wasting energy on arguing about it one must understand it goes both ways. You can now take source code of Windows, run it through some LLM and say AI made the output, maybe call it Doors, due to the trademark, and it’s now GPL. This is just as legit as what was done with chardet. So if one things goes then another should be just fine too.
Excuse me, why?!
As stated above, I have implemented a deep change in H2 MVStore allowing for parallel pre-fetching. To my best knowledge, there is no similar or comparable software around and it took more than 10 full days to get this right. So how would that NOT qualify as a `clean room` implementation?
Of course you are not the author of the code that you didn’t write and neither is AI. Law doesn’t grant that right to non humans. On top of that that code is based on work of other authors and striped of that information altogether, for most licences and basic assumptions that is not legal. But OK, lets say you are right, fine, Doors don’t share one single line of code with Windows, on top of that most is written in Rust and user interface elements are unified and AI features removed … So hence totally legit and can be redistributed and sold as such. AI produced it after all and that is that, we simply have to assume it’s legal.
I get behind that easily, except the “I am not the author”. Example: If I use a machine for wood turning (maybe even a CNC), then I am still the maker of this fantastic piece of art (and not the vendor of the CNC machine). Your argument says, that I own it only when I chiseled it from the trunk — with my teeth and finger nails (or else it goes to the chisel vendor).
Of course you are the author of that piece of art, or better the code you wrote, we don’t have to be abstract here. If somebody, or something, else made it trough some creative process, lets say AI producing a derivate work from the work of other authors, then you are not an author of the result. Think of it as claiming a sand dune authorship nature created. Just because you reserved a ticket to Dubai, that doesn’t make you an author of a sand dune. Currently nobody is as AI can’t claim authorship and on top of that such art highly likely infringes the rights of the original authors, the resulting work is derived from. But yeah, it’s time to i guess test this in court, lets say with something like mentioned Doors.To see on how long would it take for Microsoft to jump at it, on where it would be their work feed into LLM and after resold as some other original work produced by AI. How long do you think it would take, once one would start selling Doors before Microsoft would react? When it comes to other people work, that works for them just fine it’s AI after all.
With AI, inputting an idea in the prompt, you could try to claim authorship of that idea, but not the output. And in general it is considered that idea by itself is insufficient to be credited as anything more than that. So that sequence of words you wrote in the prompt, that is something you can try to claim authorship on. The rest forget it, you need to tell AI made it, that is you can’t claim you are the author. But i know, who is going to sue you, AI can’t be an author either. Well, the original author might at some point in the future, the author that made the work AI derivate from and didn’t credit.
Plus another minor/major consideration if for example you are an author/maintainer of some FOSS or beyond. You can’t set any licence or ToS on AI generated parts of the codebase, oher people can chose to disregard that part completely. As only an author could do that.
Geck,
I would hold off on making such blanket statements.
Yes, if you give a prompt and it generates a random image, there is no copyright in there.
But if it is assisting you to write code, you would still retain ownership. After all, we have been doing this for decades with extended code assistance, like ReSharper or CLion that would generate a significant portion of boilerplate or refactors with simple “prompts” from a menu
There was never such an extended rule “all keys must belong to you”
The truth will be somewhere in between, and will probably take several large court cases to reach.
Andreas Reichel,
Exactly. This has been done repeatedly throughout software history. As long as the code expressions are not taken verbatim from the source, the reproduction of ideas therein is not infringement under copyright law. Now we’re introducing a new scenario where this work is accomplished by humans versus automation, but there’s no provision to distinguish between these in copyright law.
If we wanted to say the AI generated work does infringe but the human generated work does not, then we require new copyright laws to make this distinction. However this could end up being problematic. When an examiner is making a determination based on similarities, the evidence is right in front of them. But now if they have to consider who made the work above and beyond the similarities, evidence for this may not be readily available. Moreover this seems to increase the risk of discriminatory application of copyright law.
1) One party might claim they didn’t use a machine when they did.
2) Another party might allege a machine was used when it was not.
In both cases the examiner would now have to make a determination not based on the similarity of work, but based on allegations that could be unprovable.
.
Of course that is like comparing apples and oranges, as explained. But OK, the same question to you, so hypothetically, if i take leaked Windows source code and feed it into LLM. After for AI to produce compatible Doors operating system, written mostly in Rust but still a drop in replacement for Windows. Legally one can hence redistribute and sell such operating system without any issues involved?
Geck,
??
I did not say there would be no issue, If a human did that and got in trouble, then it would be no different using an LLM. My view has always been that LLMs are just tools, not that using LLMs gives you free reign to violate copyrights. Those who are against AI can dislike my opinions, which is fair enough. But still I’d like to at least be understood that I am not advocating for new LLM rights, only for copyright laws to be applied consistently regardless of whether LLMs are used or not.
Richard Stallman, creator of the GPL, was cancelled, that was basically the day we lost the GPL battle and when big tech won the war in this regard. GPL is now mostly just about legacy.
Geck,
Large corporations now employ most of the active open source contributions. Everything from Linux to the small essential libraries are either directly maintained by people working at corporate jobs, or those projects are heavily steered by such people.
And corporation do not like GPL. They love MIT and Apache licenses instead. Even BSD is “problematic”. (Who remembers About pages with hundreds of lines acknowledging open source portions)
That is not too hard to see where this will go.
GPL was a great license in a time when all the code was proprietary. For better or worse, those days are long gone.
For example, there is no GPL code in the mobile world at all. All libraries are licensed under either the Apache or MIT license. Even LLVM is not GPL.
GPL is tolerated only on servers or other places where there is a clear separation between the kernel and other parts of the system, like AOSP.
It’s possible to train AI using permissive licenses and completely avoid the GPL. Some models even do that already.
a_very_dumb_nickname,
Ironically, I’d say much more code is taking MIT/Apache and close them down. However those libraries produce a “Internet foundation”, a core everyone shares, so they survive not on competition, but necessary cooperation for need for standards.
Very different times, indeed.
Only a problem if you think ideas can be private property… much different in indigenous sociology where the person belongs to the land.
Difficult to take money out of minds of homo economicus
That’s far from settled, but my guess is the powers that have power, will decide that generated code is the firewall, and that it’s all perfectly not infringing (despite that we all know it is.) Just a guess – the people with power really want AI to be useful, and they’ll change whatever laws they need to to make sure it is.
CaptainN-,
Depends on how they reached there though.
One of the important milestones for “open” PC platform was Compaq reverse engineering IBM PC BIOS. That happened in a “clean room”, a separate, firewalled group that did the reverse engineering, and wrote the specs. And another independent group implemented those specs.
What will happen today is probably similar. The first AI can take an open source repo, document high level functions, algorithms, and interfaces, without sharing any piece of code (except new pusedocode and protocols)
And the second AI can take those instructions and implement a “fresh” version. Maybe in a completely new language like Mojo
You can even put a third AI in between that cleans up the original findings and writes a clean spec documentation to make it even more separated.
sukru,
I find this all to be very interesting in terms of philosophy. However what about another possibility: the software is observed and recreated from those observations without using any code at all? Sometimes this is the way human developers go about copying ideas.
Of course AI aren’t doing this yet, but it seems plausible that one day AI may be able to recreate a program / game / OS / etc just by watching a screen recording of said program. It would be hard to make the legal case that licenses on the original source code apply. The new source code would have no connection to the original source code. I wonder how people would feel about this?
Alfman,
There is a wide gap until that though. When I asked Gemini for example to make me a Tetris game in python using pygame, it takes only less than a minute to get something working.
But when I asked it to have specific features, like a better screen organization, it messes up. Really bad.
Why?
Because it has no concept of a world model.
It can regurgitate extremely good software patterns. But it cannot understand why a text was overflowing into the game well, without actually understanding visuals.
(I’m not saying it cannot be done. It will take a while)
sukru,
I think it’s a training problem. Most LLMs are trained to make indirect inferences about what code will do without the benefit of actually running it. Think about how hard this is to do. As software developers we have the ability to test and run complex code bases to see what they do. This process offers an inbuilt correction factor for our own mistakes. If we could only infer what source code did without running it then the difficulty of software development would go up by several orders of magnitude, which is how today’s LLMs try to solve coding problems.
I have some ideas for how this can be solved, though it would involve more than analyzing source code patterns in the abstract. AI would need to be trained to understand cause and effect between source code and software results. I think some of the popular I don’t think this is that far off, at least with appropriate kinds of training.
Alfman,
That would be awesome to have, and would be a billion dollar question. Or let’s say trillion.
It is harder to solve even than P=NP, since we already have proof it is quite literally impossible.
(We cannot even prove the software will behave a certain way even if we have precise records of its 100 prior runs with exact same input)
Rice’s Theorem: all non-trivial semantic properties of programs are undecidable
Security research would be trivial otherwise.
We can just observe “in this particular run, with this input it did this”, nothing more.
Or we can do formally verified software. But that is not only expensive, they are also not general purpose.
sukru,
How is it impossible? The fact we have complex software proves that it is already possible and we haven’t needed to solve P=NP to get here.
Software is deterministic. Input differences can be very subtle, such as an input coming in on a different clock, but it’s still deterministic. Variation might actually be desirable for training an AI to be more resilient against subtle differences. Never the less if for some reason you need 100% determinism, it still seems doable to me to create a VM that controls all input down to the clock.
Software developers usually face the opposite problem: we actually want to witness those rare edge conditions rather than avoid them! To this end we use input fuzzing tools designed to help create much more input variation. I don’t think any of this is a problem.
I think the mistake is in applying a theorem that’s about complexity at the mathematical extremes to rule out the possibility of using AI to solve problems that are not mathematically complex. Most application UIs, even “complex” ones, are mathematically trivial. Coding them can take a lot of human time and effort, but it’s not something that will keep mathematicians awake at night wondering how the heck we do it. I believe the problem is well within the scope of AI.
So although AI might have trouble solving P=NP because of these theorems on complexity…it would be a mistake to assert those same theorems as the reason why AI can’t solve every day coding work. I think they can and I don’t think we’re that far off. That said though I don’t think LLM models alone get us there, we need to incorporate software results into AI training & debugging, not merely looking at source code – that makes software development significantly harder to do and is the limiting factor today IMHO.
As a thought experiment: ask human developers to write software the same way we expect LLMs to do it. A request prompt with no tests/debugging/anything like that, just the expectation of outputting satisfactory code in one shot. What do you think their success rate would be? Even professionals like us would end up outputting very imperfect and buggy code. The expectations on LLMs to solve programing this way would be super human if they could achieve it. They can’t achieve it though, not like this.
In theory a 100% perfect entity would never need to test the output and inferences could be 100% reliable. in reality though we’re not perfect and we need regular feedback from real world results to incorporate back into the development process. I think this is a solvable problem for AI, but it means AI will need to rely less on the infallibility of inferences and more on result based feedback. IMHO this is what LLMs are missing today.
Alfman,
This is not a theoretical exercise. We know all non-trivial software is inherently undecidable. It has been proven beyond doubt.
We cannot even say they will behave the same with the same exact input every time.
(Have you never encountered hard to repeat flaky bugs? No flaky tests?)
Otherwise it would be possible to write perfect security and static analysis tools. And everyone would be happy.
sukru,
I know it’s not theoretical: software is deterministic. It would create a lot of events, but in principal you can account for every software interaction on every clock tick with 100% determinism every single time. In order for the software to behave differently on different iterations fundamentally requires the input to be different. On a real processor those changes may be very nuanced like input events happening at different times, changes in operating system responses, etc. The software itself is still strictly deterministic! This means that in principal if you have a virtual machine with a precisely record of all the inputs, then when you replay the inputs you will get the same output every single time with 100% reproducibility (barring things like hardware faults).
Yes, we all have, but the inputs are necessarily different.
For example in tool assisted speed-runs we can use the fact that software is deterministic to perfectly replay an entire game from beginning to end. Not just most of the time, but every single time. This only works because software is deterministic. Technically it would be physically possible to design a CPU architecture with unpredictable state transitions, but usually the goal is to stick to well defined transitions and unpredictable transitions would be considered faulty.
I’m not denying that race conditions can get very complicated, but it doesn’t make them non-deterministic with respect to IO.
We can do that already, writing compilers that guarantee failure modes that are protected against will not mathematically happen. The thing about “security” is that it’s never finalized because the scope is nearly infinite, but in terms of solving specific security problems static analysis tools can be very effective. Rust is a good example.
Alfman,
I know where you are coming from, however … when we look at the theory of computation, the real world a lot different.
It is not, but let’s entertain the idea is somehow is
That would seem like that. But that is the HALTING problem, the very first lesson taught in theory of computation, and is easily proven to be intractable.
We can never “iterate” 100% of all possible states of a non-trivial general purpose program.
Exactly.
Real programs have hidden states, in addition to their inherent undecidability. Again, this adds, not explains or replaces that.
Yes, if we have a perfect virtual machine, set the environment exactly, including timers, have precise recording of all clock cycles, interrupts, and cache hits, and so on, we can “replay” perfectly for a certain input.
But you should see here why this is a far cry from the real life.
I think we are mixing two things here:
1 – Yes, software is undeterministic in practice
2 – But, it is also undecidable with strong proof in theory
They are not the same thing.
Even if we have perfect hardware with perfectly deterministic execution, any non-trivial program will stay undecidable.
(This assumes we have general purpose programs. Something like seL4 microkernel has mathematical proof of correctness)
(Looking back I am probably the one that started undecidability and non-determinism confusion, sorry…)
sukru,
The halting problem does not disprove or contradict deterministic algorithms though. Deterministic algorithms concern themselves with repeatability.
https://en.wikipedia.org/wiki/Deterministic_algorithm
The question of whether an algorithm halts or not is tangential to the question of whether an algorithm is repeatable.
Incidentally, there are more issues with using the halting problem. The halting problem proof fundamentally depends on a theoretically infinite state machine. Math lets us make theorems about the properties of hypothetical machines like this, however such a machine does not actually exist in the real world. The halting problem is provably wrong for finite state machines.
While it would be practically difficult to enumerate every possible state of a finite state, it is still mathematically finite and therefor is mathematically 100% guaranteed to eventually halt or re-enter a previous state (ie loop). The contradiction genuinely disappears.
Yes, it’s natural to note that evaluating so many states is computationally intractable, and I’d agree. Nevertheless it still mathematically invalidates the contradiction that the halting problem set up. That proof depends on infinite state machines. Since the mathematical validity of the halting algorithm fails on finite state machines, it should not be used to make irrefutable conclusions about them. This is something most lessons I’ve seen on the halting problem skip unfortunately, but I find it quite important because too often people want to use the halting problem even in contexts where the mathematical proof does not hold.
Regardless of all this though, even if you still wanted to say the halting problem applies to finite state machine (which it can’t), that still does not contradict the notion of software repeatability. Software will still run deterministically whether or not we know if it will halt.
You can make the state machine as big as you want, but that doesn’t change the nature of deterministic algorithms.
I get these aren’t the same.
1) I still have to disagree, software is deterministic, you have to change the input to get a different output. I maintain those input changes can be very subtle (like minute change in the timing of an event) , but in principal if you provide *identical* input to a program, the output will also be *identical*. That’s what’s meant by deterministic.
2) You don’t have to know the result of an operation to know that it will evaluate the same way each time.
Edit:
I was a bit confused by that…haha.
Anyway I think we’re getting a bit off track in terms of the AI, I don’t know where to go from here.
Alfman,
Yes we veered too much from the main topic.
But I would not dismiss turing machines, nor the applicability of theory of computation to modern programs.
I would really suggest looking into how von neumann architecture maps to turing machines, and how our current ram sizes already have practically infinite memory states.
(a 1TB workstation will have 2^40 bytes, or immesurable amount of states… 2^2^43 or about 2^8 trillion or so, Age of the universe is 10^60 ~= 2^200 planck times, which is the upper limit of events. And this is small. I’ve worked with 100PBs, we already exhaust upper limit of our universe to “brute force” a compute. In fact we would need about 2^(8,000,000,000,000-200) universes)
Those “theoretical” things have pretty much proven implications on real life.
Again “where is my perfect static analyzer”?
sukru,
We can handwaive computational tractability by pointing to a very large number of states, but any amount short of infinite breaks the proof mathematically, so “practically infinite” doesn’t work.
I understand your point that the number of states is enormous, but even so it doesn’t matter because the halting problem applies only to an infinite state machine and is disprovable otherwise. Consider that the halting problem is often cited as the reason an efficient generic solution doesn’t exist “X is not possible to solve because of the halting problem”. However because the halting problem is wrong for finite machines, it’s no longer mathematically valid to cite it as the reason an efficient halting solution cannot exist for them.
I’m a bit confused by the question because there are plenty of static analyzers that catch faulty behaviors perfectly. But here “perfect” means they catch a fixed set of predetermined problems, not that they perfectly solve every problem today and in the future. So we’re probably not thinking the same thing.
For anybody thinking gen AI can be clean room and original, there seem to be heavy doubts regarding that: https://lcamtuf.substack.com/p/large-language-models-and-plagiarism https://dl.acm.org/doi/10.1145/3543507.3583199 https://www.theatlantic.com/technology/2026/01/ai-memorization-research/685552/
This suggests not only may the original chardet be harmed, but potentially any other FOSS project that the AI was trained upon. And for most AI models that seems to be whatever they could find online: https://archive.is/1EzVK/image https://www.anthropic.com/transparency
There have been concerning court cases too: https://www.twobirds.com/en/insights/2025/landmark-ruling-of-the-munich-regional-court-(gema-v-openai)-on-copyright-and-ai-training
I personally don’t understand anybody who’s declaring this is clearly fair use or any gen AI LLM code is clearly unproblematic as a dependency. It seems like everything gen AI touches, it potentially taints and harms.
I personally prefer to stay far away from it. Anyway, that’s just me. I’m not a lawyer. This isn’t legal advice.
Every single piece of AI generated code I write I GPLv3 License. AI for commercial use is the largest theft in history. I would love for someone to develop an AI operating at the level of claude or chatgpt codex trained entirely on GPL code for use by open source projects.
Darkmage,
Others don’t seem very pleased when I say it, yes I agree. My opinion is that the FOSS community should actually put forward their own FOSS license respecting LLM. This would serve two purposes:
1) Secure the future of open & license respecting LLMs to make sure it doesn’t become monopolized by private companies.
2) Offering an open source license respecting LLM would help shift more users away from license disrespecting LLMs. This is sort of like limeware versus netflix: offer users no legitimate options and they will violate copyrights to get what they want, but make legitimate paths pragmatic and people will hop on board. We want good choices to be easy!
Tell me you don’t know how AI coding tools work without telling me you don’t know how AI coding tools work.
There is a need for a GPLv4 that prevents this and makes sure that if an AI learns from your code the result also has to be GPL’d. The only question is how. How would you prove something is an AI rewrite when the author denies it?
It is not just Open Source. The Windows source code is available on GitHub and elsewhere. The AI you use to write code has almost certainly read it. There are many “source available” proprietary programs. There have been many leaks and breaches. The AI has read it all.
If we have a good enough test suite, there is nothing stopping us from having AI write an Open Source Windows OS for us.
> “AI” is the single-greatest coordinated attack on open source in history
By “open source” here we obviously mean the GPL. Permissive licenses already support use for any purpose. Using AI to rewrite Clang or LLVM seems like a great waste of time when you can simply use the original directly.
“AI” does not care about Open Source. It does not target it specifically. AI does not consider the GPL at all which is, of course, the problem. People want an LLM that reads GPL code to either not use the GPL code as inspiration or to license the LLM output as GPL as well.
But we do not demand this of human coders. Modifying GPL source code requires me to release my changes as GPL. But the license let’s me “study” GPL code without obligation. As a human, if I read every GPL project source code there is and then write original code myself, there is no expectation that my code would also be GPL. But the GPL code I read is certainly an input into the code that I write. Same with an LLM. The LLM is just far, far faster.
LeFantome,
That’s a training implementation detail. In principal we can an AI on anything we like. Just because some unauthorized source code exists doesn’t mean we have to use it. IMHO the FOSS community should make an explicitly GPL compliant LLM so that it can be used in GPL software without rustling anyone’s feathers.
Indeed, I’ve made this point before too. Copyright traditionally lets people learn, apply, and teach what they learn without it constituting infringement, Now we have LLMs that automate this process. People want to bar this from happening, but it seems that copyright laws would need to change because humans have been doing it forever, only it was far more laborious.
Some tools, like for example autoconf, generate code and impose on the code the license the author wants. Again in the case of autoconf, the license doesn’t impose any conditions on the text and in fact encourages you to modify the code so it you know enough about the code you can change the license the code generator will spit.
What I am trying to say here is that the LGPL, or other opensource licenses, doesn’t have any restriction on how other read, understand and reimplement the code. The problem very likely needs to be further considered by lawyers if we want some form of protection. And then: if it’s python, it ships with the sources.
Relicensing open-source projects can be slow and complex, much like progress in cookie clicker city. Small steps and patience eventually unlock stronger possibilities for developers and the community.