dwbapst/Programming_borrowReuseSteal_09-21-20.md

## Programming_borrowReuseSteal_09-21-20.md

      
    Raw
  

              Programming_borrowReuseSteal_09-21-20.md
            
          
    Programming and How To Use Code from Other People in Your Code

In programming, we often "borrow, reuse or steal" other people's code. This can make the line between what is plagiarized work and what is original work unclear. So, let's dive in a little to how code is reused, and what proper reuse and attribution look like.
Essentially, All Programming is Borrowing

First, all programming depends on previous programming: the code you will be write in this class (and pretty much any other data science class) relies on a language (R, Python, Java, etc) with a wide number of commands, and the commands available in the languages are what make the language useful. (In R, and many other languages, we refer to 'functions', not commands.) Those 'functions' themselves are pre-packaged segments of code that we can call upon by simply uttering the necessary command -- and they were written by previous developers. So, anyone who develops using such programming languages automatically relies on other people's work constantly.
Now, when we call those 'functions' or 'commands' in a programming language, the computer runs that code, but all we had to do was name the right command and give it the right inputs. The code itself was (in most cases) supplied by a package, containing dozens of commands, all written by the same team. In many cases, these packages are part of a larger ecosystem (like CRAN is, for R), which automates obtaining other packages they depend on, and updating packages when they updated by their maintainers. The packages are loaded into a system that keeps track of the commands that can be called, usually called a library, such that loading new packages will expand your available library of commands available.
But not all code exists in such simple pre-packaged envelopes. Some code is just abbreviated scraps, and half-finished fragments of packages -- this is the case for a lot of code on GitHub for example (GitHub is a giant repository of code projects, most of it open-source). When you use a search engine to figure out the solution to a problem, you may find yourself at StackExchange, a network of hundreds of related questions-and-answers sites, which are particularly popular among programmers, engineers, statisticians and other data-science related fields. Often, smaller 'snippets' of code will be there to help you figure out the single-line fix to your problems, just a simple cut-and-paste away (hence the slang 'snippets'). Or, perhaps, the person writing the solution doesn't provide the code directly, but you can pretty much infer how to write it just as well anyway. And this is often the solution you will fall onto, and back again, over and over, while working on programming in the long hours. To use any of these code 'snippets' that you will trip over while furiously googling solutions, you'll have to read the code in yourself, usually by copy and pasting it into your own code.

There's an apocryphal joke that a manager once called a software engineer into their office and told the engineer to sit down. "Someone just told me that 90% of the code you write is copy/pasted from StackExchange!" the manager roared.
The software engineer nodded with agreement. "That's correct."
"Then why should I pay you an engineer's salary when anyone could copy and paste code from the internet?"
The engineer smiled. "Because I know which pieces of code to copy/paste from StackExchange."

That joke encompasses many aspects of typical programming in a nutshell. In case that you need more evidence than my saying-so, there is even an academic study that examines how often bits of code from one StackExchange site (StackOverflow) end up in GitHub projects (answer: quite often). Clearly, scenes from action movies showing programmers at work must be skipping over the scene where a hacker is searching "how to hack the bad guy's computer" for a few hours, before figuring out what code is necessary to trick a door into unlocking itself.
Now, there are a few issues with this approach of 'borrow, reuse or steal'.
Issue 1. Do you understand what this code actually does, and does it actually work?

The first is that you might not understand 100% how the code is solving your problem. To you, the code might as well be doing it as if by magic. But such 'magical thinking' almost certainly applies already in many cases with the existing commands and functionality we rely on from 'real' packages. I don't know the exact nature of the sorting algorithm in R when I call the command sort, and ninety-nine days out of a hundred, I really don't care to know -- and I don't expect you to know either! I just know that for the purposes of sorting that I encounter, that method is sufficient.
The difference though, if I was using some code for sorting from StackExchange, would be that there isn't anyone maintaining the code. If it was part of a package, someone (a maintainer for the package, or just a user) might have noticed a bug in the code, communicated with the maintainers, and the bug would have been fixed in an update. (We'll talk a lot more about bugs and how we find bugs in code elsewhere in this course.) No one is doing that for the code from StackExchange, and while the scraps of 'useful' code on GitHub might get fixed, many do not ever get fixed. In fact, the most commonly reused bit of code from StackExchange is bugged, as the original author of that code realized years later.
Formal packages from experienced developers also often have automatic testing suites, making sure that at least in some situations, the code is doing what it is expected to do. Of course, actual test coverage is often very poor across packages, especially those that exist for specialist or boutique analysis. However, it still provides more of a safety net than static, unmaintained snippets of code. The only thing that can ensure that code from a response on StackExchange is correct is you, and your ability to read and understand what the code did. If you use code from another source, you need to understand what it is doing and how it is doing it, and that it works across the range of scenarios you are trying to apply it to. All responsibility falls on you, just as if you'd written that code yourself. It doesn't mean that you could have written the code independently of having checked StackExchange. It just means, once you know one possible solution to your problem, you know how that solution was reached.
This is very similar to expectations of how teams of professional developers write code together. Different developers are likely to be responsible for different bits of code, but if one developer is writing code that their colleagues don't understand, then there is a considerable danger that someone will write code that breaks someone else's code. Thus, in a team effort, everyone needs to have at least a working understanding of everyone's code in order to ensure that the code is functional.
Issue 2. Could someone else figure out how you wrote this code?

The second issue is that code needs to be transparent to the next person who reads it. Most code that exists isn't very transparent to the average reader, for a variety of reasons that we will explore more in this course later.
Now, if your code is extremely minimal statements that involve rather straightforward calls to formally packaged commands, then your code already has a certain degree of transparency automatically. To make the code run exactly as it ran for the author, all a person would need to know is exactly what packages you used (as different versions of the same command might be in different packages), and which versions of those packages you used. However, code that does anything more... nonintuitive, such as code transcribed from StackExchange often does, can be much more difficult to read. Afterall, to solve outlier problems, you often need outlier solutions. In the latter, if someone goes back to read the code and tries to figure out what you did, mingled and mixed up with code that perhaps someone else wrote, with a completely different programming style to it, and minimal commenting, then it could be quite difficult to figure out what the code is doing without disassembling it and piecing it together themselves.
You might be wondering 'wait, style matters'? Style matters a great deal, and despite having ultimately so few style conventions to differ on[^1], programmers who use the same language can have strikingly different styles, the result of stubborn adherence to long-held habits. If, when you copy/paste code from another source, you might simply change the names of a variable or two, and pop it into your code, and call it a day. This makes a jarring change in code style, and if there is no commenting, the next programmer who reads your code will only know that you clearly got the code from another source, and infer that you didn't even understand how the code worked. You aren't doing the bare minimum necessary to make your code human-readable, which implies that you couldn't read it yourself.
[^1] Almost all code is written in plain-text, so style is limited to how whitespace is used, where line breaks are placed, how code is indented, what conventions are used for naming objects or new functions, etc.
So, code needs to be documented and explained if it is to be reproducible. This is key not just for programming in isolation, but also in programming as a team. Commenting is what helps the rest of your team know what your code does without having to ask you a million questions. Furthermore, it will even help you in a year or five years, when you decide to read this particular chunk of code again.
Finally, making your code fully documented, including attribution, ensures reproducibility. This is particularly relevant to academic analyses, as well modern data science, where the ability to easily replicate someone's analysis speeds up the ability to communicate, visualize, and add further analyses. The need to replicate computational science is just as integral, if not more, as the need to replicate the science that is performed on a physical lab bench. Furthermore, by having attribution that includes a URL link to the original source, future developers can check a link (to StackExchange, for example) to see if the code used has possibly been updated, or reported to be buggy, or if a different alternative solution is now recommended. Without the attribution and a URL link, this sort of easy improvement of code by checking a URL would be much harder, allowing the code to be static despites bugs being found by others using that solution, or the need to change some commands because of changes to the underlying code libraries.
Issue 3. Are you appropriately using code that you stumbled upon within your legal right?

So, if one was to reuse code from a stumbled-upon source like StackExchange, clearly there is a need to personally document and comment that code, to ensure that you understand the code, as well as others who might examine your code both now and in the future. All of that withstanding, there are further complications. Do you even have the right to reuse the code in the way that you are reusing it? Do you need to make additional allowances to reuse this chunk of code? Computer code is a product of a specific set of authors, like a book or the lyrics of a song, and thus is subject to copyright. That means to use someone else's code, they need to issue a license that allows you to reuse it in the circumstances that you are reusing it. Licenses can make all sorts of restrictions and requirements on your reuse of their code.
In particular, everything on StackExchange is covered by a Creative Commons license ("CC-BY-SA 3.0"), with clauses that require that any reuse of the code attribute the use, and any product re-using it carry a compatible or identical license with the same conditions. Much of the code on Github specifies a specific license as well, often some variety of open-source license. Creative Common licenses are perhaps the most common form of open-source licenses you are likely to encounter, especially online. Today, because of strong activism from a broad community of software developers, librarians, technologists, artists and writers to develop an open-source culture, there has been a proliferation of works licensed under such 'open source' licenses, which allow relatively broad re-useage, with fixed limitations.
The popularity of the Creative Commons licenses is partly due to their modularity, allowing different components to be combined or included as separate clauses indicated in the license name, such that invoking the shortened name itself satisfies licensing requirements. For example, the 'CC BY-SA' license applied to code on Stackexchange contains the 'BY' Attribution clause and the 'SA' Share-Alike clause -- requiring that any derivative work must give attribution to the original work, and that all derivative works must carry a license identical to the CC BY-SA license. Two additional clauses that one may run into with the CC license is 'NC', for non-commercial (derivative works can only be for non-commercial purposes), and 'ND', for No Derivative Works (any reuse of the material must be verbatim copies of the entire work, without edits or remixing). Licenses you will likely encounter that are widely used for software are the Apache License, the BSD-2 and BSD-3 licenses, the GNU General Public License (GPL), the MIT license, and the Mozilla Public License. All of these differ in small and large ways from CC licenses.
Unfortunately, even software-specific licenses can have unintended consequences, because programming isn't like writing a book. If we were to copy a sentence from a book and put it in a graded paper or our own published book without clearly indicating to the reader that this segment was a direct quote, we'd be committing plaigarism. Copying several pages of text, even when you clearly indicate you are excerpting directly from another work, would still probably be considered beyond the bounds of 'fair use'. Things are less black and white with computer code, and have been for a long time. Is copying a line, or an entire function, from another work that lacks an explicit license a violation of copyright? With attribution? What if the code is something so simple you could have worked out that exact solution independently? What if you instead use this code as inspiration to write a fix to your particular issue? That sounds great, but how much differently do you need to make the code for it to be considered an independent work?
I don't have simple answers to these questions: these are issues that our society is still actively grappling with. Some of these questions are so sticky, you might need a background in intellectual property law to know which way is up. What I will say is that should absolutely care about these scenarios and the ethics involved, rather than only caring about them when they become important in hindsight. If you work professionally developing software, and especially if your code is part of the product itself, rather than service, then you must strive to only use code that explicitly states its license, and use code only within the bounds of the included license. For example, remember how the most widely copied piece of code from StackExchange mentioned above had a bug? It turned out to be used in production code from Oracle, a software company, ironically the employer of the author who wrote that code, but not at the time he was employed there, nor at the time that the code was placed without attribution into the Oracle codebase. Whoops! And Oracle is often regarded as a company that takes these licensing issues very seriously, so the offending code was scrubbed and replaced with an alternative piece of code.
Why would anyone care? Because licenses have consequences.
Let's consider code released under the CC BY-SA license, as described above. Such code is licensed for reusage only if attribution is given, and only if all derivative works also carry the CC BY-SA license. While this makes sense (perhaps) for creative works, for software it gets complicated, especially for code that has been modified many times. Does each author who has added a modification need to be attributed, or only the most recent? The answer is that all of the authors must be attributed. Furthermore, it means whatever code that you might include some StackExchange code into, that code must now carry a CC BY-SA license. It doesn't mean you can't commercially profit off the code, but you will need to make the source code available, and allow for derivative works that follow the same license. For some companies, that's fine, and for others, that's terrible.
Furthermore, the CC BY-SA license restricts all downstream code to the CC BY-SA license forever. In academic science where computational analyses are common, this has the potential for wreaking havoc. Even the most scholarly academic works often will cite certain recent overviews in order to provide a connection to a large number of earlier authors from whom originated many of the basic ideas and contributions in that area. This scenario is exceedingly common in science -- but its not a scenario that would satisfy the CC licenses with BY and SA clauses. Thus, in day-to-day science, citing a paper which cites some other paper is nearly as good as citing the original paper (if you don't directly refer to or reuse specific elements of the work that you aren't directly citing), but with CC BY-SA, any reference to the original or products derived from the original must be directly attributed in any downstream work, and carry the same license.
Similarly, if code is published under a CC license with both the NC (non-commercial reuse only) and SA clauses, that code can now only be used in non-commercial works, works which themselves must also require that all downstream works are non-commercial. The only way to use that code for a commercial application is to contact the author of the code and ask them for permission (which is effectively its own license). While its understandable why many creators could sensibly want to use a CC NC-SA license with full understanding of the implications, it is probable that there are many cases, especially in programming, where CC licenses with SA clauses are used without full understanding of the downstream implications for all derivative works. This 'zombie' property of CC-SA licenses is also found in some other non-CC open-source licenses, widely used for software. The most appropriate licenses for academic or non-governmental organization work, where typically we try to force no legal obligations onto derived works, may be a classic public domain license, or the somewhat more formal Creative Commons Zero license, which specifies that the work is meant to be a open-source product, with no expectations made on derivative works.
Again, I don't have easy solutions to recommend here, along the lines of 'use this license and you'll always be fine'. Which license you use for a given work or project is up to you, and dependent on what you're doing, who you're doing it for, and what you envision people doing with the product into the far future.
So What Does This Mean For Us?

I expect that in this class you will regularly find yourself using a search engine with mad abandon, trying to find solutions to a programming obstacle that is tripping you up, and that you will sometimes find elegant solutions that I did not realize existed, and require minimal alteration to be applied to the presented problem. I expect this because this is what I think many programmers do, and my job is to train you to think like a programmer. I also expect that you may be working together, in pairs of small groups, or even the entire class united against me. This is entirely natural, and expected, as programming in a professional setting is often a group effort.
With that said, if you do that, I will expect you that you also do the following.


I expect that you will follow whatever licensing restrictions are placed on the code you've found, whether it be a CC license, an MIT license, a GNU license, a BSD license, a Mozilla Public license, etc. If derivative works must be released under a compatible license, then you should name the license you are using at the top of a your code. You do not need to include a license file, containing a complete copy of the license, as is required by some licenses. It will be enough for my purposes, as someone trying to assess your code, if you simply indicate which license you will use somewhere at the start of a programming script.


I expect that you will extensively comment all of your code, to a level that demonstrates you understand how that code works. If you happen to be working with another classmate, I expect your commenting to differ and be in your own voice, as there is no reason for commenting to be identical, even though code might be identical. If I don't think you understand how the code works, then how can I believe that you even trust it to do what you says it will do?


I expect that you commenting will indicate which lines of the code come from an outside source, and where that outside source can be found, and attribute it to the author (if the author is indicated). I expect this regardless of what license is on the original work. In all cases where it is possible, include the URL of the originating source, regardless of whether that is a StackExchange thread, a Github repository, or a more private site, like a blog.


I expect that you will reformat the code so it matches whatever styling preferences you are using elsewhere for code. Styling can encompass you use white-space, where you place line breaks, how much you use tabbing to indent nested blocks of code, how you capitalization or symbols in naming new variables and functions, and many other aspects. This requirement may mean nothing if the solution is extremely simple, but making sure the styling conforms may mean a lot if it is a lengthy chunk in the middle of an already lengthy block of computer code.


Note that as you develop your own stylistic tendencies more, and you work at solving more difficult stumbling blocks, you will find yourself rewriting the code more and more. Eventually, the line will blur between whether your code is the result of copying and then modifying the code, or is new code, simply inspired by the original code. The line between these is indistinct, and even 'inspirational' sources should be properly attributed. Again, I cannot answer exactly what this distinction is because it is a question our society still hasn't reached a simple answer to. However, as you become more advanced in programming, you will gain a sense of what sort of solutions are really unique and that copyright could be argued for, versus mechanisms that are so generally well known across the community that no realistic claim could be made on it. Ultimately, this ability to better distinguish common knowledge from intellectual property will best inform you of the line the seperates your own original work from derived work.