Yesterday I highlighted a study that found that AI and ML, and the expectations around them, are actually causing people to need to work harder and more, instead of less. Today, I have another study for you, this time focusing a more long-term issue: when you use something like ChatGPT to troubleshoot and fix a bug, are you actually learning anything? A professor at MIT divided a group of students into three, and gave them a programming task in a language they did not know (FORTRAN).
One group was allowed to use ChatGPT to solve the problem, the second group was told to use Meta’s Code Llama large language model (LLM), and the third group could only use Google. The group that used ChatGPT, predictably, solved the problem quickest, while it took the second group longer to solve it. It took the group using Google even longer, because they had to break the task down into components.
Then, the students were tested on how they solved the problem from memory, and the tables turned. The ChatGPT group “remembered nothing, and they all failed,” recalled Klopfer, a professor and director of the MIT Scheller Teacher Education Program and The Education Arcade.
Meanwhile, half of the Code Llama group passed the test. The group that used Google? Every student passed.
↫ Esther Shein at ACM
I find this an interesting result, but at the same time, not a very surprising one. It reminds me a lot of that when I went to high school, I was part of the first generation whose math and algebra courses were built around using a graphic calculator. Despite being able to solve and graph complex equations with ease thanks to our TI-83, we were, of course, still told to include our “work”, the steps taken to get from the question to the answer, instead of only writing down the answer itself. Since I was quite good “at computers”, and even managed to do some very limited programming on the TI-83, it was an absolute breeze for me to hit some buttons and get the right answers – but since I knew, and know, absolutely nothing about math, I couldn’t for the life of me explain how I got to the answers.
Using ChatGPT to fix your programming problem feels like a very similar thing. Sure, ChatGPT can spit out a workable solution for you, but since you aren’t aware of the steps between problem and solution, you aren’t actually learning anything. By using ChatGPT, you’re not actually learning how to program or how to improve your skills – you’re just hitting the right buttons on a graphing calculator and writing down what’s on the screen, without understanding why or how.
I can totally see how using ChatGPT for boring boilerplate code you’ve written a million times over, or to point you in the right direction while still coming up with your own solution to a problem, can be a good and helpful thing. I’m just worried about a degradation in skill level and code quality, and how society will, at some point, pay the price for that.
One additional aspect of this is that programming oftentimes involve subtle choices, small adjustments due to the environment the code will run in, adjustments for behavior that might only happen in certain cases (such as rare error paths). It’s the kind of things that require you to understand what’s going on, because if you don’t you’re likely to leave small issues behind that will make the code brittle or even dangerous, if any kind of security is implied. The opposite of this has been copy-pasting junk from Stack Overflow, which to this day is full of upvoted answers that are fundamentally wrong and which have propagated known bad patterns through projects. Using ChatGPT is basically turning this to 11. It’s the automation of cargo culting and the guarantee to introduce bugs everywhere at an even faster pace than we were doing before.
See my comment where I described so-called AI output as “ultra-processed content” in the last article in this series. If one group of university students were to eat only ultra-processed foods such as pot noodles and frozen pizza, and another had to cook fresh ingredients from scratch, which would spend the least time on food preparation? But despite the time saved, they might not have the better study performance, and may even get scurvy.
My primary use of ChatGPT has been as a tutor, having someone explain concepts and ideas, and being able to ask questions to understand the topic far better than I ever could from pre-existing sources. I wish this aspect of LLMs was pushed more, rather than it being seen as a magical source of solutions to be blindly copied.
Nairou,
+1
This is a very good use of LLM. LLM is getting a lot of hate for what it doesn’t do, but honestly I am blown away with what it can. It’s very good at quickly getting answers from unimaginably large corpus of information. And on top of that it’s very good and patient at repeated queries that clarify & refine answers.
Some people have complained that these LLMs don’t provide citations for the sources where information came from, this could be a very good enhancement. However the commercial and legal implications of providing links to sources for LLM content could backfire. Consider that some information is linked with a source and that source then sues to have that information removed. Traditionally when human editors do this, copyright law does not consider it infringement as long as the text is sufficiently reworded and the information is not being copied verbatim. But it does seem that many people have strong feelings that LLMs should be declared to infringe copyright anyway despite not copying sources verbatim.
I think this is a great question, but as chatgpt is still evolving and doesn’t yet produce optimal answers, there’s still a non-academic incentive for humans to become more proficient. However what are the long term implications as automated code generation becomes equal or better than human? It would be interesting to study what impact this could have on future generations. We might emulate this today by giving students/workers their own proficient oracle/mechanical turk for a period of time. If provided with an oracle that always gave them right answers, would students bother learning anything?
It’s pretty clear productivity would go up…but as the author and others are noting, this can create an odd disconnect between productivity and education for a net negative on education itself. This poses a rather strange question for us as a society: which is more important, education or productivity? Saying that education is not important would be taboo, but all employers really care about is productivity. In the workforce education is merely a means to an end. If it’s more efficient to replace humans that are proficient at X with humans using computers that are proficient at X, then the humans who are proficient at X simply become obsolete. It’s just the way the world works and there’s plenty of precedence for it. There will always be some academics types, but the corporate world only uses education as a means to an end. It seems that once automated code generation becomes proficient, by and large those educated with these skills will eventually become redundant.
I’ll say it once, I’ll say it 1000 times; AI is a good tool in the toolbox for experienced programmers who already know what they’re doing. It’s not a tool suitable for beginners.
IMHO development AI is at its best when it’s used in autocomplete mode or explainer mode. Autocomplete is good because when you already know what you’re going to write and the AI offers the exact fragment (or a slightly more efficient version), you can either agree with the result and accept the proposal, or keep writing and ignore it. AI is very strong when it’s only dealing with a few lines or a simple block. Explainer mode is good too, especially when you’re working with an existing project in languages you aren’t accustomed to; you’ll get a textbook breakdown of what the code is doing. BASH isn’t a language I work with regularly, so the contextual breakdown (and even explanations of possible issues) along with doc links when I ask it to explain a fragment is quite nice.
It’s when you go into chat mode and start acting like a manager instead of a developer, telling it to do work for you, that GPT will be a serious determent… It’s also the mode new developers use the most. The students here really prove it; they weren’t actually developing, so they don’t understand the code.
What would have been interesting in this study would have been making the students complete two projects. Have three groups; two groups learn the language, and one group go GPT for the first project. Then for the second project have two groups use GPT, and one group continue to learn. I’d be willing to bet the group that learned THEN used chat would have had the best of both worlds by the second project; the faster dev time + a good understanding of the code.
Kver,
In your experiment I think you are right. However in the context of your experiment, there is no opportunity cost for taking the time to learn coding. What if there were? So for example, what if you had students who learned to code and were allowed to use AI to generate code and then have another group who didn’t take much time to learn coding but used that time to learn other topics like networking and databases? Assuming we reach a point where AI code generation is competent, I would submit to you that maybe it would be more beneficial to forego a coding education and using that time to study other skills.
I liken this to doing calculations by hand versus using something like wolframalpha. Somebody who is really proficient with the tools can run circles around those doing calculations by hand at complex problem solving. I don’t care for absolute arguments and would never say that nobody should learn to do things by hand, but in the real world is it actually that important to make low level skills prerequisite when practical automation exists?
Don’t get me wrong, this is tough for me to admit since I really do appreciate low level mathematics and coding, I like the challenge and personal milestones knowing that I can do things on my own, but to be frank it’s not that clear it is a productive use of time when there is higher level automation that works.
It’s neat because we have a lot of examples where machines have taken the “middle” of a discipline, kind of like how you’re describing.
Like the math example and dollar-store calculators. Students might not learn pen-and-paper long-division and multiplication, but instead they’re using calculators and learning, say, geometric formulas. It’s not that the students are “less developed”, and more that they’re learning more higher-level content in exchange. Does it really matter if students know long division? When we look back at teachers who said “you won’t always have calculators” and laugh, maybe in 10 years we’ll look back at how we say “You won’t always have AI”/”You can’t rely on AI” and find that AI is just our generations’ pocket calculator rant.
Put another way; do the characters in Star Trek know C++ when they tell the holodeck to make a historic pub scene? Or does it even matter when they can produce set-pieces in seconds that takes teams of hundreds of people years to make today? In that example, someone who spent their time learning history would be more productive and accurate than someone who knows development, because they could far more accurately instruct the AI in scene reconstruction.
Kver,
A lot of good points. Your statement about history majors being hypothetically more productive than software developers seems comical in the context of society’s current skill hierarchies. But if AI eventually tackles coding like it did chess, then that’s quite an interesting point to ponder.
This is something I’m worried about too. Even if I didn’t have a host of other problems with this current swatch of overly hyped machine learning models (the companies behind them, the data they get trained on, the environmental issues their training and usage causes, etc), I still would not want to use them.
They have to be good enough to replace me entirely in a task, otherwise they are next to useless. Even if they become able to find some kind of solution to 95% of the problems that I face in a problem domain I work in, I would still need to keep all of my skills, insight and experience sharp for the last 5%.
Imagine for a moment a compiler that would compile about 95% of any program I gave it, but then hands it over to me for whatever 5% it couldn’t figure out for some obscure reason.
That means I would still have to maintain a competence for writing assembly, and I would also have to understand the first 95% of the compiled result. And the best way to do that is to just write those first 95% myself instead.
Thankfully though, compilers have in fact entirely replaced my need to be able to write any assembly. They can do that because they’re well-written algorithms that can cover the problem 100%. And the compilers themselves can be understood and fixed directly if they exhibit some kind of failure.
This is a perpetual problem with many attempts at machine learning solutions. They’re often unreliable in ways that are incredibly unhelpful to dangerous.
A car autopilot that works 99.9% of the time is wildly dangerous because the better it works without getting to 100%, the larger the risk that you’ve become complacent or unattentive the 0.1% of the time that it actively tries to kill someone.
Book Squirrel,
Why can’t something just be a tool to improve work efficiency? Plenty of professionals have assistants that help them work more efficiently even though the assistant cannot replace them entirely. So what if AI won’t finish your job for you, helping you tackle some of the preliminary work is not useless. This all or nothing view seems arbitrary.
Those aren’t the numbers I would use personally, but I’ll take them at face value…You are saying that even if somebody/something was willing and able to do 95% of your job and you only had to do 5%, then you would complain? Sheesh it sounds like a first world problem, haha. I wish I could have that amount of help.
I appreciate the point you are making here, but we need to consider that AI doesn’t have to be perfect to be better than human. It only has to be better than normal humans. And well about that, our human track record in vehicles isn’t great. Of course AI failures are bad, but setting the bar at super-human and then criticizing AI for failing to reach that bar implies a contest that is rigged in favor of humans. Giving humans the benefit of doubt if/when their failure rates are higher could actually be more dangerous, statistically.
My point isn’t about the current state of self-driving cars – I have no experience there. The point is how an objective determination about their merits isn’t going to be based on argumentation, but rather empirical data. That’s what you need to make your case.
I have plenty of tools that make life easier for me, I even write new ones every once in a while. But they’re built on algorithms that usually perform tedious and specific tasks completely so that I don’t have to worry about doing that manual labor at all. And those tools have mechanical failure modes that can usually be understood and fixed entirely if they fail at their task.
There is a difference between “somebody” and “something” here. If it’s a “somebody” then I can establish trust with them, they can convey to me what they’re capable of, and if they fail, they tend to do so in ways that I can understand and assist with as a fellow human. And both of us can usually learn from that experience as well.
If it’s a “something” in the sense of an algorithm or machine, then it can be understood. Except if this “something” is say, an ML model. Then it occupies a strange territory where I cannot trust it because it doesn’t operate in the context of the human experience and seems incapable of reasoning in any way that makes sense to me as a human, and I cannot understand it as an algorithm either because it’s a reinforcement-trained clump of weights without any human readable structure.
And this reality severely limits their actual practical use cases.
My issues with autopilots (or maybe I should say the idea of self-driving cars) in traffic is related to this. I don’t think there should be any more cars in a city than absolutely, strictly necessary, because they’re inherently dangerous to the people outside them. There is no way to make them totally safe. But there is a basic, human, gut-level trust between me and car drivers that none of them actually *want* to kill me. And to the extent that I have to coexist with cars as a cyclist, that is something I can work with when I have to sometimes navigate car-occupied streets safely. In short, for my safety I can act to help these poor drivers notice me, to compensate for the fact they have somewhat worsened sight and hearing from within those 4-wheeled murderboxes.
But there is nothing like this that I will ever be able to establish with a car-driving ML model. Their physical sensors may in theory detect me more easily, but I have no way to relate to anything that goes on in its “head”, so I would have no idea how to act as a cyclist to be safer around them.
I respect your choice not to use AI tools, but understand that not everyone will share your AI aversions.
Well, my reason for stating it that way is that there’s only a difference if we view it through a biased perspective, but otherwise we should be looking at and judging the technology as a black box without regards to the who/what that’s inside. This was the motivation for the turing test.
https://en.wikipedia.org/wiki/Turing_test
I am a proponent of blind testing because it helps to eliminate prejudice from our conclusions.
The car doesn’t “want to kill you”. I think the word “Want” is distracting here because it implies a degree of consciousness that’s not present in modern AI. AI doesn’t “want” anything other than to optimize the fitness function we give it, that’s it. If the training included killing people I’d agree with you about it being a problem, but that would be a very different human problem.
Just saying that AI causes car accidents is very misleading. We can agree that AI drivers will have accidents, but these need to be contextualized in terms of our own failure rates.
Subjective argumentation is exactly what I was calling into question though. The case for danger needs to be made using empirical data.
Let’s up the stakes with a thought experiment: Say that some day AI surgeons exist and have a statistical failure rate of 1/5th of a human surgeon. Your child gets into an accident and requires a life saving operation. Do you go with the statistics and allow the AI surgeon to operate on your child, or do you go with your AI bias and demand a human surgeon with a known 5X failure rate?
This experiment reminds me of when I was unwittingly part of something methodologically similar. As an undergrad in computer science in the early 2000s I had a couple of semesters with the worst teacher I ever had. She was a recent immigrant to the US who struggled with English and had been hired because she made a very good impression when presenting about her area of expertise to students and faculty, and had completed her master’s degree but had never taught. She struggled to explain anything beyond her area of expertise. Part of it was the language barrier, but also her apparent lack of knowledge about the subject. We students quickly learned we could just Google the professor’s questions on assignments and found them, with answers, on random professors’ websites, taken verbatim without attribution.
One semester I had her for a UNIX class (I had already switched my laptop to Linux, which impressed everyone) held in a lab isolated on campus that was “off network” from the rest of campus so we could safely play with networking without breaking anything, but was so physically distant that the campus’ T1 line slowed down to dial-up speeds. When the final exam came, which consisted of hand-writing commands on paper, we had the choice individually of either taking the final exam in the isolated classroom with actual machines (running Fedora in VMWare) in front of us, or in the normal classrooms where the Internet connection was fast. I was the only student who chose to take it in the remote lab. My classmates chose to take it with Internet access, so they could Google the questions.
Like the students in the article who only had access to Google, I did very poorly on the exam, but I failed honestly. I chose to let the test evaluate what I actually learned rather than my ability to second-guess the teacher.
Sort of reminds me of the panic that surrounded the introduction of copy-pasta back in the days. Certain academic circles were worried that students would, *en masse*, start to plagiarize.
A consequence of the grading-theater that has infested education.