spongessuck/stackoverflow.blog_2024_03_22_is-ai-making-your-code-worse.txt

## stackoverflow.blog_2024_03_22_is-ai-making-your-code-worse.txt
Transcript (auto-generated)


0:08
Hello, everybody.

0:09
Welcome back to the Stack Overflow podcast.

0:11
A place to talk All things, software and technology.

0:14
I'm your host, Ben Popper, director of content here at Stack Overflow.

0:17
Joined by my colleague, editor of our blog and newsletter, Ryan Donovan Ryan and the gen A I era.

0:24
There is an explosion of stuff, text arch code.

0:29
It's coming out of our ears.

0:31
But one thing we're not really sure about is what is the quality of this stuff and what will its impact be good?

0:37
Yeah, long term to productivity or the health of an organization.

0:41
So there was an interesting research study that came out and we are lucky today to get the chance to chat with Bill Harding who is a programmer and the CEO at get clear, they did some research to try and assess code quality of the stuff that comes out of a copilot or something of that nature.

1:00
So, Bill, welcome to Stack Overflow podcast.

1:04
Thanks for having me.

1:04
I love talking about code quality.

1:07
You're in the right place.

1:10
Just take a minute.

1:11
Give our listeners a quick flyover.

1:12
How did you get into the world of software and technology?

1:15
Yourself, become a programmer and then come to be CEO of a company.

1:19
Yeah.

1:19
So I have been fascinated by programming since I was a teenager.

1:23
I grew up in a time where there weren't a whole lot of other interesting things to do before cell phones and, and whatnot.

1:30
And so I've been programming for, I don't know, more years than I should probably admit.

1:36
30 plus at this point.

1:37
And I've been working as a programmer in my own company for about the last 15 years alongside my co-founder.

1:46
And we ended up creating our current project.

1:50
get clear, which is what the research that we've published is under in response to having experienced working on large DEV teams and having a hard time understanding sort of the visual picture of how much was getting done and whether we were doing a good job of maintaining code that would be adaptable as time passed and maintaining code that could be changed at a high velocity.

2:18
So that really was something I was passionate about both because I wanted to understand how our team was functioning.

2:25
But also just as a programmer, I find it really intriguing to look at my own programming velocity and understand how it changes from day to day and from project to project.

2:36
And so that is how I weasel my way into becoming a CEO which is really kind of an ad hoc job for me.

2:44
I'm mostly a programmer.

2:45
But since somebody has to be the CEO, that's me.

2:48
All right.

2:49
Yeah.

2:50
Yeah, I was actually looking at the,, the study the other day and it's interesting we talk about code quality and how do you measure code quality?

2:57
And most people do it by looking at the code itself.

3:00
And this study is interesting because you're looking at the signals around how the code is kind of committed and, and used.

3:07
So can you talk a little bit about the study itself?

3:11
Yeah, of course.

3:12
So what we wanted to understand was not just is code that is suggested by A I valid, will it be compiled?

3:24
Will it not introduce security problems and that sort of thing?

3:28
Which a lot of the research we've seen previous suggests that A I is pretty good at suggesting code that can run at least by the time a developer commits it, that code is usually sufficiently accurate, that it can be used.

3:44
But that doesn't really indicate whether the code that is being written by A I is going to be comprehensible by future maintainers.

3:54
And so what we looked at were the indicators around whether the code that has been added in the last couple of years since A I is really proliferated, whether that has a higher rate of being copy pasted code, which in my experience tends to be challenging to maintain.

4:14
Because when you change one thing, you have the implication that you should probably be changing it in all the other locations that this identical code lives in.

4:23
So we looked at that, we found that indeed copy pasted code has become far more prevalent in 2023 than it had been in 2020.

4:33
And the year after, we also looked at the speed at which code that had been recently committed was subsequently changed.

4:42
It's often referred to as churn code.

4:44
And we found that there was, I think about a 50% higher rate that code would be transformed or would be updated again within two weeks of it, having been pushed to the main branch.

4:58
And there was a reduction in the amount of code that was getting changed after a year or two years had passed.

5:05
And what we have found is that the process of keeping repos maintainable as they age is really dependent on developers, finding time and finding ways to go and replace and update that legacy code that inevitably is going to deteriorate over time.

5:24
We are seeing less of that happening since A I has proliferated.

5:28
So those were a couple of the signals that we really honed in on with our research just so folks have a clear picture and no pun intended like what is the business model for your company?

5:40
Get clear and what was sort of the research pool that you were able to access here, like you know, who is working on this code and if there was sort of like the control group that you were measuring them against.

5:52
Yeah, so the mission of get clear is to help developers write better code and to work with less tech debt on a day to day basis.

6:04
And we especially want to make it easier for developers to review code because pretty much all the developers I talked to would much rather be writing code than reviewing code.

6:14
And so we've built a code interpretation engine that took upwards of three years for us to initially architect.

6:23
I think that if we were a standard VC funded company, we would have probably not been allowed to spend three years just writing a code interpretation engine.

6:33
But since this was something that I was just personally fascinated by as a developer myself, I wanted for us to be able to recognize code in the same way that develop can.

6:46
And so not just looking at DS as a bunch of deletions and additions, but looking at D as a combination of deletions, additions, updated code moved code, move being when you cut a method and you paste it somewhere else, copy pasted code, which is what we looked at for the study and find replace code.

7:06
And so having all of that information available to us, open the door to be able to look at really large scale changes in how the prevalence of copy pasted code has changed over the last few years.

7:24
And that is where we were able to see the increase in copy pasted code.

7:29
Get clear, uses this information to allow developers well, uses it in a lot of ways.

7:35
But the main way is that we allow developers to see diff of their work either on an individual commit, ad hoc group of commits or a poll request where you don't have have to review as much code if you can minimize your attention that is getting devoted towards looking at the code that was merely moved from one file to another or from one part of a file to another.

7:58
When you can have that level of granularity in interpreting the changes that are happening, it opens the door for a lot of time that can be saved reviewing code.

8:09
And so that's why we started measuring it in the first place.

8:12
But then it had the happy I guess ability to allow us to look at the changes that were happening both across open source projects where we currently make it possible for people to visit what we call our open repos section of our site where there's about 50 projects, react react native tensorflow vs code.

8:37
All these large scale open source projects, we allow people to browse get clear's data for them.

8:43
And so we can also analyze those alongside our customers repose for the customers that have opted into anonymized data sharing in between those two sources.

8:52
We had I believe about 100 53 million change lines of code that we analyzed over the four year period between 2020 2023.

9:02
So that was sort of what we do and how we use that of the copy pasted code that came before this latest explosion of ja I how much came directly from stack overflow?

9:12
I'm just kidding.

9:13
You can't.

9:16
Yeah.

9:16
No.

9:16
And we did some research a while back on how much code was being copied from Sacro answers.

9:22
And it was something like one in four people was copying from a Sacro answer.

9:28
But one of the things I thought was, was really interesting in telling was the jump in how much code was edited, updated removed within two weeks.

9:39
And I think that went from like 64 in 2022 to like 74%.

9:45
Do you think that that speaks to the sort of like somebody's got to fix this Janky code you just committed or is it, are people using analysis tools to try to like fix things that may have the past human review?

10:01
That seems to be I I think the reasonable interpretation that a lot of people have taken in seeing this data.

10:09
And it does square with my personal experience using A I assistance over the last 18 months is that especially if it's the end of the day and I'm kind of fatigued and I just, I don't know having a harder time thinking through the full implications of the suggestions that I'm given, I'll be more inclined to accept those suggestions even if I'm not positive that they are going to be functional across all of the circumstances in which the method might be invoked.

10:41
And so I believe what might be happening is that it's more common that this not fully considered code is getting pushed.

10:49
And then shortly thereafter, either you'll have your C I tests come back and indicate that there's a problem or, you know, God forbid, the worst situation happens where you have something that impacts customers ability to access a feature or use the site.

11:06
And so that is the other case that I've personally contributed to on occasion of pushing code that was suggested to me, but ends up creating subtle bugs down the road that I probably wouldn't have created prior to having these suggestions that were just available through one keypress at any time in any method where I'm programming.

11:31
Mhm.

11:32
So one of the things I've I've also heard is that some of the A I code is, is a little less readable, little over engineered.

11:40
Do you have any indications of whether some of these edits are just sort of like making this cleaner for humans to read?

11:47
Yeah, we don't have that information specifically in the report that we've published so far.

11:53
But I think that is certainly an area that would be great for us or for another entity to look into in the, hopefully in the coming year, I think that A I is really good at suggesting what to add.

12:09
It's always at the ready with one or two or more suggestions for what you can add to your existing file.

12:18
But it doesn't have a path yet through which one can reduce code or dry code when you have multiple methods that are having a similar intent on them.

12:31
And so I think that it is reasonable to imagine that this proliferation of more and more added code might well have a tendency to be more complicated because it's reproducing what you've already created.

12:46
But in the context of whatever module you are currently implementing for X or Y new features.

12:52
So I suspect that may be the case, but we don't have any data at this point to specifically speak to that, right?

13:00
And you looked at copy pasted code, it seems like a lot of what ends up being A I generated is almost like an autocomplete, you know, like you're writing something and it can just kind of finish it out for you.

13:12
Did you also look at that or is there not a way to track that in a similar method?

13:17
The way that we designated copy paste on behalf of this study was that if there was a commit where the same line would occur in more than two places, and that line was not a keyword, we can identify language specific keywords.

13:34
So obviously you're going to have begin and end and that kind of thing, you're going to, to see potentially many instances of that within a commit.

13:42
But hopefully you aren't going to have many instances of a 10 line if block that are committed together.

13:49
And so that was what we assessed and that was what we saw becoming more common in the last 18 months.

13:57
And so whether or not those blocks of identical code were completely suggested by the A I or were the result of somebody offering by themselves the first half of it?

14:09
And then accepting the A I suggestion is the other half of it.

14:12
We can't say, I certainly have anecdotally experienced the situation where A I will suggest that I insert a new method that is entirely the same as an existing method, you know, 20 lines that are identical to what we already have.

14:29
And so I know that this happens even with the latest version of Copilot.

14:35
But yeah, we couldn't specifically look at when that code was inserted, what had been there prior to the developer, accepting the suggestion.

14:47
Yeah.

14:47
And another one I thought was really interesting was the drop in moved code and viewing that as a drop in refactoring.

14:57
Can you talk a little bit about the thinking behind that, that conclusion?

15:01
And what it sort of means for A I code?

15:04
Yeah, absolutely.

15:05
I I think that's one of the more interesting and perhaps underappreciated aspects of what we saw is that historically move code is a huge percentage of the overall change that average developer will make in the course of their daily work.

15:22
We found that in 2020 moved code was I wanna say around 30%.

15:30
yeah, 25%.

15:32
So that is right on par with in 2020.

15:36
That was more than we detected as deleted code, more than updated, more than copy, pasted.

15:40
The only thing that happened more frequently than moving code was adding code.

15:45
And so it's something that I believe is really integral to the average developer commit is that you're trying to rearrange code in a way that allows you to reuse similar methods as much as possible.

15:58
And reusing similar methods typically means moving a method that had started within the modular the class for some specific feature and then extracting that to a utility file or a utility library.

16:13
And so it's a I guess a signature of human developers that they will often be finding opportunities to reuse code which implies moving code.

16:24
And since there is no analog in how A I assistants currently work, they don't have a way to suggest removing code only adding it, I think is reflected in the data that we see where in 2023 that 25% of all changes from 2020 that was moved code has now shrunk to only 17% of all changes.

16:50
And that is a pretty significant change relative to, to where we started.

16:55
And it definitely tracks with the experience that I have as a developer using copilot that when there's just a single tab press that can get me the answer to whatever I'm doing, I'm going to be less likely to go looking for an existing method that I might be able to repurpose.

17:15
And that seems to be what's happening at a larger scale in the last 18 months is that there's less opportunities people are undertaking to take an existing method, move it somewhere else, adapt it and then use it across those multiple locations.

17:31
One of the other big studies to come out that made claims about productivity was from github itself and obviously, you know, each company yours and theirs has, you know, an incentive to discuss this kind of stuff aligned to their business.

17:45
But I guess maybe I would ask, do you think that both can sort of be true at the same time?

17:51
Which is to say that they don't necessarily contradict each other?

17:54
You know, if there says developers finish their tasks, you know, 55% of the tasks are finished faster, that could be true.

18:02
But also then later they have to, as you say, be moved or changed.

18:05
And then, you know, you mentioned you're feeling, you know, burnt out at the end of the day and you're just looking for a little bit of help.

18:10
You know, they mentioned things like developer satisfaction increasing or folks feeling like they could stay in the flow state longer because they could avoid some repetitive tasks based on your own experience.

18:20
And some of the research do you think that?

18:22
Yeah, like both of those things could be true and then, you know, you'd have to go a little deeper, I guess, to measure sort of like, well, you know, what is the cost benefit analysis of?

18:32
Like you're producing more code, you're feeling more energized, you know, you're getting projects done quicker.

18:37
But then as you said, maybe you're reverting them more frequently, you know, you're churning through stuff, you're moving stuff and maybe the stuff that's in there is not quite as high quality.

18:46
Yeah, absolutely.

18:47
I think that is what all of the data that has been produced to date suggests is that there's this story that A I is suggesting usually good code, usually valid code and functional code.

19:01
And by virtue of accessing that valid functional code more quickly than it would take if you had to go through, you know, a directory that had a bunch of different potentially reusable methods.

19:14
Of course, it's going to be faster to just recreate the method in your file.

19:19
And especially if you are a new developer developer that is new to the project, you might not even be aware that there is an existing method that can make whatever transformation you are looking to make within the existing architecture of the project.

19:33
And so if you can save the time, having to go look up that method, then yes, you're going to be 55% more productive in the short term.

19:42
But it's really a question of what does that imply?

19:45
A year or two or three down the line.

19:47
When for years, there is a low percentage of code getting moved.

19:52
Thus, the implication, there's a low percentage of similar methods being consolidated and dried up so that they become reusable.

20:01
Of course, the other benefit to reusing methods is that you're going to have better test coverage around them the more time that you use your print currency method or, or whatever method it is that multiple modules need to access the more avenues through which you are testing the degenerate cases for that.

20:20
And so when A I is making suggestions that will work well enough to pass your test will work well enough to finish your given ticket that you're working on.

20:31
But then imply down the line, you're going to have 3 to 5 similar methods that none of them have been really well tested around the edges.

20:40
I think that is the risk that a lot of companies are taking right now without necessarily knowing that they're taking that risk.

20:49
There's several new large language models that have these, you know, massive million token context windows and we were talking on the podcast the other day about a program that takes your entire repo and puts it in the context window.

21:02
Do you think something like that will improve the problem?

21:06
Like it's not just copying in janky code?

21:09
It has the entire code base as the context one likes to hope.

21:14
So I I think that at some point it is going to become apparent to the extent that teams are measuring how their velocity is changing over time.

21:25
I think they will measurably see that as their lines of code continue to increase their velocity tends to decrease.

21:33
So at some point, teams that want to, you know, maintain a project for five or 10 years and have that ability to change things and to add things be as fast in the future as it is when the project begins.

21:47
I think it's going to be necessary for teams to find opportunities to reuse their existing code.

21:54
But so far, I have not seen any evidence that any team has succeeded in proposing a way to do this.

22:02
And moreover, I haven't even seen any precedent for how that would be presented to users.

22:08
I know, you know, it's, it's fairly commonly known in projects that you'll have a couple files that are like the dungeon of, of the repo where if you go into this file, it's gonna be a mess and hard to understand and hard to maintain.

22:24
And so you just kind of try to avoid that.

22:28
And it's not very common that teams will go back and specifically make a ticket to revisit methods like that because usually what management is telling you is get more done, you know, get this next ticket done.

22:41
We don't have time to go revisit sloppy code just because it is unpleasant to work in.

22:47
But I think that unless you have some kind of incentive or unless you have a technical leader that is aware that all of the long term cumulative detriment of tech debt is going to eventually slow down the team to the extent that they can't get their projects done.

23:10
I don't know that they're going to carve out time specifically for that rewriting and specifically for that cleanup.

23:18
And so that's what I wonder with regards to the larger token windows is, does that mean that teams actually would stop and look for opportunities to clean up their code?

23:28
And if not, then we would have to hope that that these newer LL MS can just afford people opportunities to allow code to be moved, to approve moved code in the court of their normal development.

23:42
And it's a little bit hard for me to imagine like what kind of U I that would look like.

23:46
It's, it's not just going to be a tab because it has to be removing code and adding code.

23:51
So, I think that's a pretty difficult problem, but one where I would imagine that interest for it is going to increase as awareness increases, that we are adding code more than is advantageous for our long term interests.

24:06
Right?

24:07
Let me ask you a question and maybe our book a little bit, you know, you mentioned, right?

24:11
Like it would be perhaps more productive, although more work to go back and look through the code base to see what kind of, you know, methods or things that you and your colleagues had already implemented, maybe you could reuse them knowing that they sort of fit into the code base but alter them in a different way.

24:26
So if, for example, the gen A I system was trained on your own code base as opposed to all, you know, code on the internet or something, you know, it had that style and it had, you know, it could say something more along the lines of, well, you know, I think the solution might be X and here's a citation to where I found it, you know, within our code base.

24:44
That's sort of like the stack overflow dream of what it would look like.

24:47
Right.

24:48
Yeah.

24:49
Yeah, I think that would be a big step forward from from where gen A I has gotten so far and, you know, given how rapidly these things are evolving, it's certainly conceivable that that will exist.

25:00
But I still, I think that that is the first step is having gen A I with A large enough token size to enable it to recognize when an existing method could be reused.

25:10
But I think that still doesn't get to the second step that a team is going to need if it hopes that it's gen A I is going to produce code on par with what humans can currently produce, which is not just reusing what already exists, but finding opportunities for something that exists to be modified slightly so that it can be used in more different contexts.

25:33
And so that you can have more avenues through which that code is exercised and tested.

25:38
And so that I think will take more than just citations and awareness about what exists.

25:45
But we will take an A I that has been trained on how code evolves and where it can see examples of what humans often do, which is in the course of evolving code, they will make tweaks to existing methods that make them more flexible and more applicable to different kinds of similar situations.

26:03
And that is the bigger challenge that I'm not sure there is a solution on the short term horizon.

26:14
All right, everybody.

26:15
It is that time of the show, we want to shout out someone who came on stack overflow and shared a little bit of their knowledge or contribute a little bit of their curiosity awarded two minutes ago to nw a famous question badge, a question with 10,000 views or more.

26:28
How to make a grid with integers in Python helped over 10,000 people.

26:33
So appreciate the question and it has a great answer.

26:36
As always, I am Ben Popper.

26:38
I'm the director of content here at Stack Overflow.

26:40
You can find me on X at Ben Popper.

26:42
Email us with questions or suggestions for the show podcast to stack overflow.

26:46
Let us know what you want to talk about or come on as a guest.

26:48
And if you enjoyed the conversation today, leave us a rating and a review really helps.

26:52
I'm Ryan Donovan.

26:53
I edit the blog here at stack overflow.

26:56
You can find it at stack overflow dot blog.

26:58
And if you want to reach out to me, you can find me on X at R Thor Donovan.

27:03
I am Bill Harding.

27:04
I am the programmer and CEO for get clear, you can find us at Get clear.com.

27:10
We're also on Twitter slash X as get clear.

27:14
We are going to be publishing research every quarter this year.

27:18
And so our existing research can be found.

27:21
If you just Google LLM code quality, get clear, you can download our research which has all of what we've talked about today along with charts and visualizations to help make the case for better code quality on your team.

27:36
If, if that's something that matters to you and yeah, get clear is working to be a code review tool that can make pull request review faster than what is possible.

27:48
On github.

27:48
So any team that is interested in spending less time reviewing code and more time writing code ought to check us out.

27:56
Great.

27:57
We will put a bunch of those links in the show notes.

27:59
Thanks for listening and we will talk to you soon.