It seems Microsoft is absorbing GitHub deeper into Microsoft. GitHub’s CEO Thomas Dohmke is stepping down, and GitHub will be integrated into a new department within Microsoft. Which department will become the new stewards of GitHub, and the massive pile of open source code it’s hosting?
You already know.
Still, after all this time, my startup roots have begun tugging on me and I’ve decided to leave GitHub to become a founder again. GitHub and its leadership team will continue its mission as part of Microsoft’s CoreAI organization, with more details shared soon. I’ll be staying through the end of 2025 to help guide the transition and am leaving with a deep sense of pride in everything we’ve built as a remote-first organization spread around the world.
↫ Thomas Dohmke
GitHub will become part of a new “AI” engineering group inside Microsoft, led by a former Facebook executive, Jay Parikh. As The Verge notes, this new group includes platform and development tools and Dev Div teams, “with a focus on building an AI platform and tools for both Microsoft and its customers”. In other words, Microsoft is going to streamline taking your code and sucking it up into its “AI” slop machines.
If you’re hosting code on GitHub, the best time to move it somewhere else was yesterday, but if you haven’t yet, the second best time is today. Unless you want your code to be sucked up into Microsoft and regurgitated to sloppify Windows and Office, you should be moving your code to GitHub alternatives.
OSS became way too dependent of this single repository portal. It’s downfall would be more than welcome for a more health ecosystem.
Nope, Github is effective, Gitlab is hard to work with and as “close sourced”. Gogs/Gitea or Codeberg perhaps ? Framagit if you are a rebel.
gitlab isn’t close-source. I’m actually using heptapod, which is a fork with mercurial support. Used to be the best thing in the world, until they dropped (free) support for the docker packages 🙁
I wrote in “double quote”, because their SaaS is far from being as friendly and effective as Github. Even registering on Gitlab can be a burden, with the validation link never coming into your mailbox.
CapEnt,
The problem is this has always been like that for several decades now.
Before GitHub, or even Git, there was Subversion and SourceForce. Almost all projects were hosted there.
The main reason is not convenience, I would say, but reliability. There were so many other smaller sites, even those from giants. (Does anyone remember Google’s “Google Code”?). Yet they can disappear with a very short notice.
Okay, you can migrate your code. What about your Wiki? Issues? KB? Feature Requests? and so on?
And this brings network effects. Anything you want is there, or mirrored there at GitHub. And they offer a nice path for closed source, commercial usage. Your company can easily continue using GitHub, where most of the developers would already know about it.
Now, there is Bitbucket from Atlassian, and they have been around hence pretty reliable. However the interface is different, and their tooling is not as good. But probably the best #2 at the moment.
I was amazed at how OK GitHub remained after being acquired. I though Microsoft has finally learned its Skype lesson of not fucking with things that are not broken. But here we go, may the enshitifiction begin (proper)!
Ensh*AI*tifiction .
If you have a moral objection to AI stuff, I recommend SourceHut and its flagship sr.ht instance. I use it for some small projects myself, and it is exactly as unobtrusive as I want a git forge to be.
Thom Holwerda,
I understand that people don’t like the idea of microsoft using github to train their AI and I support moving to Github alternatives, I think it’s shortsighted to be so dependent on centralized providers including github. However the problem isn’t just where a project is hosted, it’s also the license. A lot of FOSS projects may discover that their FOSS licenses prohibit developers from adding new restrictions on downstream users. Anyone who truly opposes their code being used to train AI will have to find a way to ditch GPL altogether since the GPL license is all or nothing, restrictions on AI aren’t permitted.
Alfman,
It goes beyond that, though.
Yes, having convenient access to basically all OSS code is a major plus for AI training.
But one of the next frontiers is solving “feature / bug requests”. Since the issue system is also deeply integrated into GitHub, Microsoft and their AI team can look at all the requests, the discussions that happen on it, the pull requests that come in, those who are rejected and why, and those who are accepted. And they can even go further down the line to check whether it actually solved the issue or the same / similar bug report opened again.
This is much more important, as it will help train AI to solve things like:
“Add a LLVM compiler support”
“Enable build on FreeBSD”
or “Fix screen flicker when mouse goes out of the view”
This cannot easily be done by looking at static code, or even changelogs.
sukru,
I’ve been predicting that AI will get much better at testing & debuging it’s own output by itself,. It will gain competency over iterative development without the need to “outsource” these tasks to humans. This will improve the quality and eventually the AI will even be able to fix bug reports without having a person giving instructions. Obviously we aren’t there yet, but we could get much closer over the next decade.
This of course brings us to matters of trust. Humans could be selected to audit AI output. However this costs money, so I don’t think most code will undergo human review and besides it won’t be long before AI is creating more code than we can keep up with.
For better or worse it seems likely that humans will become less involved in programming long term. Project managers could be next after that.
Alfman,
AI can one day reach that. Alpha Zero was exactly that way in the game of Go. There was no training except what it played against itself (a very important iteration after their previous AlphaGo which trained on human matches).
That uses “reinforcement learning”
However…
In order to do that they need to be actually able to run and evaluate programs. That is not always easy.
Yes, self reinforcement learning has proven to be extremely effective,
The reason alphago was able to beat humans when it did was mostly about having enough horsepower to throw at it. Given enough compute, all problems become shallow. Of course the compute power on planet earth is limited, and so it practically limits the search space of the problems we are trying to solve.
While programming involves a large search space, I do think it is within reach of reinforcement learning using upcoming technology. And just like go, there will be a lot of skepticism before the milestone of beating humans is reached, but after the fact it will seem like it was always inevitable that it would happen.
Alfman,
I’d say it is not about the search space, but rather the “reward” function (“loss” in practice).
For a game like Go, or even more complex ones like StarCraft there is a definite result: win or lose.
For software, any turing complete machine will have intractable algorithms, and other limits of computation. For example “HALTING” problem. One can never decide whether a given program will terminate or not for all cases.
That means no AI will ever be able to handle all algorithms. They will just be very good at what we can do.
And even that can be misleading. For example, they can assume they “solved” an issue by making all the compiler errors gone, and even maybe passing all existing tests. But the design could be opposite of what we want.
(Bottom line: we still have a very good amount of job security)
sukru,
I don’t know if you want to get into it but people often use the halting problem wrong to make bad conclusions. The *only case* that the halting problem ever disproves is one of self referential contradiction. Unfortunately people hear it disproves the general case and end the lesson there without understanding that this disproved case doesn’t actually apply to regular non-contradictory software. Useful software does not contain self-contradiction and the “halting problem” does not disprove the existence of a halting algorithm for software that we actually want to use.
Also the proof has other assumptions. A Turing machine differs from real hardware in one crucial aspect: it’s defined as a machine with infinite capacity whereas real hardware is strictly finite: states will either end up in a loop or terminating. The halting problem does not result in any contradiction on finite hardware.
Even within the theoretical domain of infinite state programs. the halting problem only disproves a two state halting function, but I actually count four unique states. The halting problem does not contradict or disprove the existence of a halting algorithm that can classify programs into these four states.
You’ll already be familiar with 1-3, but an enumeration of all possibilities provides a fourth:
This new case has the interesting property of not simply halting nor looping nor contradicting itself.
AI can only work up to algorithms that fit within the limits of memory and time allotted. If we keep adding more over the long term though, I doubt there will be any algorithms that AI can’t outperform humans on, although if you think there are these may be interesting to discuss.
Well yea, reinforcement learning has always been a matter of defining the right fitness function. We can equate software specs to being a type of fitness function. The AI may find “bad” answers to the fitness function. I mean local maxima, but rather ambiguous or poorly defined specs. We know this happens in real world software developed by humans too because the project specs weren’t clear enough. Therefor I envision AI software development to be an interactive process in a similar way to how we interact with clients to clarify design questions and update the specs. Between reinforcement learning for algorithms, conformance testing bots, and LLMs being able to communicate specs with clients, we may well end up with AI being able to do the whole job.
Short term I agree, long term though I think human devs will find themselves outmatched.
Alfman,
Maybe I misconstructed the argument.
The issue is not AI limits in general, but specifically AI playing by itself.
AlphaGo was based on human feedback. The current LLMs are similar (including ChatGPT which used actual PhD level researchers to give “correct” baselines)
AlphaGo Zero played only against itself
No LLM can do that for programming. (Or no AI in general)
Because the output are programs. And programs are generic. They cannot be “proven” to be fit for a purpose in general.
(Not talking about “return 2+2”, but maybe “return random()” when asked for “write a program that returns 4”. The first can be easily proven of course, the second… not so much, the AI might have an initial random seed that would always return 4. Expand from there for other more complex cases).
So, there will be a human in the loop.
You cannot have LLMs reviewing other LLM codes.
Though… some places do that.
“We have an Xcode plugin that writes code…
and we have a GPT that writes software specs from customer requests…
and we use automated LLM based tools to write acceptable / integration / unit tests based on software specs…
and we have a Git/Gerrit plugin that does peer review automatically…”
Do you see where this will go horribly wrong?
sukru,
Are you counting the fitness function itself as human feedback? If so, that’s interesting food for thought! In principal reinforcement learning can train NNs from scratch starting with zero knowledge. Providing human data is a shortcut. Of course AI models aren’t immune to local maxima, although neither are humans so it would be interesting to conduct more research in this area 🙂
Yes I agree. Although reinforcement learning is a great way to train AI to optimize to a fitness function, when the fitness function becomes “mimic human IO as closely as possible”, then at best it could only hope to approximate the humans and never become better than the source material. This is a limiting factor for LLMs and why I don’t consider LLMs that well suited to software development (other than as a starting point).
However when it comes to programming, there’s no reason to limit AI to mimicry. AI can, and I expect ultimately will, be trained using fitness functions that create models that are better than human at programming.
I don’t know what “fit for a purpose in general” means. Fitness for purpose has to mean a specific purpose. I would concede programs have a large search space, but I disagree: programs can absolutely be graded on fitness and at university many of them were.
On deterministic computers (that aren’t experiencing faults), programs *always* return the same values for the same input. “random()” is just a function called “random”, but it doesn’t actually have the real mathematical properties of randomness – it can’t. Under the hood “random()” may be defined in such a way that it inputs a seed into a hidden global variable. This may be good enough to approximate randomness, but hiding inputs and global variables behind the facade of a function call doesn’t fundamentally contradict the programs deterministic behavior in any way. It’s still 100% deterministic.
I don’t follow why humans need to be in the loop, other than supplying the fitness functions. Also hopefully my paragraph above explains why I don’t think LLMs are the panacea for software development, but I don’t think they are the end of the road either – reinforcement learning will get us further like it has with go and other AI milestones.
You seem to be especially focused on LLMs writing code, and on that topic I agree with you LLMs are the wrong tool to use for programming. We may be talking across each other because I’m thinking more long term. Long term I envision the role of LLMs to work primarily as human interfaces where they already excel, but the actual programming will be outsourced to more specialized AI models trained using reinforcement learning techniques. Unlike LLMs, these models are not limited to mimicking existing human code patterns.
Alfman,
My point exactly. The LLM can come to a false conclusion to use random(), as its random seed could be fixed in the “evaluations” across runs.
When it hits real life, the function will not work.
Yes it is possible we might be talking across each other.
My argument is the “fitness function” is fundamentally undecidable. There is no “this is a good code” function. We can just have heuristics, like “this is what people write when they encounter a bug report at Github” and “it compiles, runs and passes tests”.
Both of which require human input.
(Again if you try to write the specifications using LLM as well — the grading functions — it fails in hilarious ways).
sukru,
Complex maybe, I don’t know about fundamentally undecidable though. Even human intelligence itself is the product of nature’s fitness function. The real question is how much compute power it takes, obviously nature had millions of years, but then we don’t strictly have to restart from scratch. Anyway we’re not talking about AGI, that’s a new topic.
Human input beyond the fitness function? I’ll grant you the initial fitness functions may not be adequate and training can benefit from several rounds of interactive improvements. But once you have a good fitness function and training parameters, the learning phase doesn’t usually require more human input.
Alfman,
Maybe I am unable to articulate this.
So, I asked Google’s Gemini for help:
My code is so bad that i really want to keep it on Github so I can pollute the AI training with my garbage
> Unless you want your code to be sucked up into Microsoft and regurgitated to sloppify Windows and Office, you should be moving your code to GitHub alternatives.
Removing projects from github does remove the value the site gets from one’s social presence.
But the code itself? The “ai” industry will scrape it regardless of location or license. All one can hope to do in that area is that they have to expend a bit more effort to pirate it.
What I find so amusing is how many people are against AI scraping their sites and code. Yet these same individuals are perfectly fine submitting themselves to AI for identity verification, age verification, video call transcription, and all sorts of bio-metrics. These are the very same people who are perfectly fine taking and uploading a selfi of themselves in order to open an account on the latest mobile fiance/payments app.
tuaris,
I find that tons of people oppose it, but the dilemma is when you’re not given much choice and become coerced. I’ve noticed that more theme parks these days are using photos for entry verification for multiday passes. If you abstain on principal, you’ll need to pay more to stay anonymous. The IRS has begun forcing people to take photos of themselves to access their own files. If you abstain, you may be blocked from your own account. Youtube is apparently starting to rollout age verification, if it flags your account then you’ll be forced to supply your ID to them or else stay locked out of age restricted content. I think facebook has been doing that as well.
https://abcnews.go.com/GMA/Family/youtube-begins-rollout-new-ai-age-verification-tool/story?id=124619026
I don’t like these massive tracking operations and it sucks that privacy keeps being chipped away, but these changes are happening regardless and people who abstain are going to find themselves locked out of an increasing number of services.
I wonder how much proprietary code is going to get leaked by the AI