Skip to content

Instantly share code, notes, and snippets.

@math-a3k
Created July 2, 2023 21:18
Show Gist options
  • Save math-a3k/35acc4f10727e611afaef071a77650fb to your computer and use it in GitHub Desktop.
Save math-a3k/35acc4f10727e611afaef071a77650fb to your computer and use it in GitHub Desktop.
On Github’s Copilot copyright concerns
======================================
On Github’s Copilot copyright concerns
======================================
(Submission for https://www.fsf.org/blogs/licensing/fsf-funded-call-for-white-papers-on-philosophical-and-legal-questions-around-copilot)
In Software, a machine-learning model is the application of a mathematical model - usually from Probability and Statistics - to the decision process.
The same mathematical model can be implemented to perform an analysis on other realms - i.e. in gene sequencing it may be called a Bioinformatics model.
The machine “learns” because it bases its “decisions” through the model taking into account "what has been observed", they are not "directly" specified into the software.
It is not "if a == 2:", it is more like "if a in pattern_recognized_in_previous_input:".
What kind of patterns are recognized and how is specified through the mathematical model. Its usefulness - generally speaking - depends on the data supplied and the adequacy of the model to grasp the underlying process generating them.
An abstract mathematical model provides relations among entities which allows to deduce conclusions. It is needed to define the concrete parts of the domain which are mapped to the entities that the model relates to - to give a "meaning" to the model in the domain - and then identify which "concrete" one will apply best to the problem.
Models may be "generic" or "concrete".
When stated, the key parts are "generic" and can be loosely regarded as "parameters in some space" ( independently of the model-based or algorithmic approach), i.e. "a population with a normal mean of mu". Those "parameters" are estimated from sample data "observed" from a population or a generating process. Once this has been done - "a population with a normal mean of 3" - the model is "concrete" for a task.
This is generally referred to as "inference", or, in the machine-learning vocabulary, as "training".
Once the model is "concrete", conclusions can be drawn, i.e. "if I observe this, it should be that".
The accuracy or correctness of the conclusions depends on the adequacy of the model for the situation and the "observed" data provided.
A "software implementation" is the result of providing the machine instructions - specific computational algorithms, data processing, etc. - to make a computer perform the calculations for an input and provide the results according to what the mathematical model specifies in a certain context.
Implementing those models into the decision making process of a software in order to provide its functionality is referred as "machine learning" or "artificial intelligence", as the machine makes decisions on what it has "observed".
A mathematical model is not copyrightable though their implementations are.
As such, Copilot’s implementation of mathematical models to predict the following code according to the code written given what has "observed" from previous code (a "knowledge base") is copyrightable and its copyright is holded by its authors and/or Github.
Training is not analogous to compilation. Compilation is the process of translating from one language to another, while training is the process of estimating the model’s parameters in order to make the model useful to a certain situation.
It is neither analogous to linking. Once the inference has been performed on a model and its parameters has been estimated - trained - it becomes independent of the data: "the knowledge has been abstracted from the sample in the form of concrete parameters". It is neither linked nor needed for its functioning. Also, different samples (training data) can lead to the same results, making it infeasible to determine exactly which training data has been used to arrive at that conclusion in most of the situations.
A trained or untrained model does not constitute any difference in the implementation, it only makes it applicable or useful to a certain problem or domain.
By adding data and re-training the model you are not extending the software, you are trying to extend its usefulness. It can be seen analogous to an OpenStreetMap server installation and map data: an installation with an empty database may be less useful than one with a "full" database, but data does not modify the software at all.
The results of training - the estimation of the parameters of the model - are independent from the software itself and copyrightable to the user (and/or entity) that provided the input and executed the software, analogous to the conclusions of an analysis or any output of a software execution - i.e. calculations, documents or images - if it is executed on the user's computer.
If it is not executed on "owned" hardware, the "Terms of Service" apply.
While it may be "usual" that you have the copyright of the output while you provide the right to store and analyze it, it may not.
In this particular case, the results of the training are copyrightable to Github, as it is the result of the execution of their software by them - or someone in behalf - while the results of "predicting the following code according to the code written" (Copilot’s output) "should" be copyrightable to the user that is providing the input and executing the software (Copilot’s editor's plugin).
Note that "should" is because unless specified in the Copilot’s Terms of Service, nothing prevents Github from declaring "by combining the output of the service with your data, your data is automatically licensed to Github" on them.
Copilot’s Terms of Service are not publicly available formally at the moment of writing.
Part of the controversy arises because of the confusion of how machine learning software works and Free Software licenses treat source code as "the preferred form of the work for making modifications to it" but they do not consider it explicitly as content.
Machine learning models consider their input as data, independently of its nature - whether it is numbers, text, images or source code (a particular form of text).
In the case, the software learns from "software", but "software" is not executed, combined, modified distributed, linked or interacted with it remotely through a computer network: it is copying the source code to "extract" the common patterns and treated as "content" - "the authors' opinion about what a computer should do in a situation".
This type of modelling may be referred as "Text Mining" or "Natural Language Processing" depending on the emphasis. Examples can be seen in Google's GMail and Grammarly - both suggests you natural language snippets and changes while you type based on the analysis of a "knowledge base". In the case, the natural language should be a "computer programming language".
Training algorithms from publicly available data should be considered Fair Use, as anyone should have the right to extract knowledge from what it is seen, using the tools deemed proper. Otherwise, data analysis would be restricted to those who met "certain conditions" and only those may achieve higher levels of understanding.
The snippets and code generated by Copilot should not be likely to succeed claims of copyright violations, because the output should be an abstracted result of an automated analysis, analogous to "a common opinion about a pattern that emerged 'objectly' from data analysis". If it happens to be the same as one or some of the data points used for training - or even a codebase not included in the training - it should be analogous to "it came to the same conclusion".
"for [item] in [items]:" shall not be considered a plagiarism but "Hey, I did a 'machine learning software' which I trained on a dataset (llvm and gcc) and I wrote '# give me the best compiler' and it generated a copy of gcc which I own the copyright" would not be likely to succeed in a claim, as you would have to demonstrate that the results of your software are analogous to "a common opinion about a pattern that emerged 'objectly' from data analysis" when auditing the machine-learning code - and probably that you somehow acted in good faith to reduce the penalty.
As a "closed service"[1], evaluating Copilot’s results as "a common opinion about a pattern that emerged 'objectly' from data analysis" can't be done independently.
What can be done independently is a reproducible analysis between Copilot’s output and a sample of codebases to estimate the likelihood of copyright infringement assuming it as "black box".
This analysis would rely on a definition of plagiarism - the "representation of another author's language, thoughts, ideas, or expressions as one's own original work"[2] - for software under which a "similarity analysis"[3] can be performed.
This is not a simple task and may have profound implications.
Any concrete definition of "equal code", i.e. "literal in X lines", "literal in structure in...", will give the opportunity of "serious" plagiarism to avoid being detected, which may have worse practical consequences than relying on a case-basis inspection. Although some cases may be discovered, "serious" plagiarism may know up to which point needs to be the code refactored for not being detected or considered, being the cost of refactoring significantly less than providing an original implementation. This may have worse practical consequences than what was originally intended to address.
With a given definition, a simple algorithmic approach would be: given a large sample of codebases, for each codebase feed sequentially Copilot with the code as it would have been written and check if the output is contained in others in the sample.
If it is positive to a copylefted codebase and the license of the software being written is not compatible, it would be a license violation, or, if it is a non-propagation license, if there is no attribution.
This would give an estimation of how likely is a person writing a software with Copilot aid to incur in an involuntary plagiarism - and possibly copyright infringement - by accepting Copilot’s suggestions.
If this is not available, the inclusion of Copilot’s output is reliant only on "trust" or "confidence" in the service implementation.
If Copilot’s service would be maintained as a "Open Service" or "Public Infrastructure" - public code, public data, non-discriminatory access, i.e. a gratis road, a toll road, a vaccination center or a Public Hospital - examination of the code by experts could accept or reject the hipotesis ("a common opinion about a pattern that emerged 'objectly' from data analysis").
While this would not eliminate the problems that may arise from using Software as a Service Substitute[4] (SaaSS), all the legal litigation that may arise from the use of a tool may be minimized or totally prevented.
In the Free Software space, this can be resolved when acting in good faith by "Change your license, give attribution, or change your code".
Besides being legally copacetic, is there something fundamentally unfair about a privative software company building a service off Free Software work?
Github managed to convince enough humans to work on a project and managed to deliver a tool which can increase developers' productivity.
This should be a merit.
Increasing developers' productivity should lead to more and better software.
Free Software is analogous to Public Infrastructure - the more infrastructure freely available in the society is, the more the society can accomplish.
Building services on Free Software is not unfair, it is granted that you are free to use and modify it on any purpose, and free to *not* distribute it - it is better for the service builder as her Freedom is respected.
Service encapsulation - SaaSS - provides a unique-supplier commoditization-like way of monetization, as the functionality or use of the software can be provided on a per-user basis - analogous to a non-free binary distribution meant to provide the functionality on users' hardware.
Since copyleft is only applicable to distribution - the Software Freedoms of the recipient are guaranteed - is its spirit circumvented?
For the cases that the software that makes sense using it on a server, the AGPL license addresses this to some extent, requiring providing the source code of the program, and the SSPL license to a greater, requiring the source code of the program plus all the source code to reproduce the service.
SSPL was deemed non-free in spirit[5] as it was introduced in a context where the "No Surrender of Others' Freedom" clause would lead to many installations unavailable because of the software involved in the service unless other license provided by the vendor was obtained, leading to vendor lock-in - a consequence which Free Software intentionally minimizes.
"No Surrender of Others' Freedom" clauses does not restrict your freedom, they restrict power over others, as "one's freedom ends when others' start" (social freedom), i.e. "a judge is not more free than any other citizen (abide to the same law), the person has been conveyed with the power to restrict others' freedoms"[6].
These licenses still do not resolve the SaaSS problems or risks, but they give you the alternative in the case of not trusting the service provider of executing the software under "owned" hardware guaranteeing your privacy while changes done may return to the community.
It is not unfair to build a business on Free Software work as software, commercial use is granted in the Software Freedoms - and encouraged, what may be found unfair is to build a competing business with other's work, i.e. closing the source and enhance to compete with the source if it is not allowed, which is not the case.
If the unfairness is perceived under the unequal distribution of monetization that would generate the service built upon Free Software work, if learning from public sources would not be considered Fair Use, would it be solvable by licensing?
Although Free Software licenses shape the market, i.e. by rendering closed-source distribution unavailable, it is unlikely to entirely determine it, and perceived inequalities that may arise in "internal markets" of the Software Production Chain may not be addressed by them.
In a non-Fair Use world, considering monetization may be crucial in sustaining a long-term effort on sizable projects, would be a dual license approach - Source Code as Content and Source Code as Software - a practical "for the cause" concesion where those who want to monetize in a form of "closedness" should contribute one part of that monetization to the Free Software community as they may do with other closed inputs?
Licensing Source Code with something analogous to CC-NONCOMMERCIAL[10] in order to trigger a compensation negotiation would be against its spirit.
ODBl licenses addresses data as an input for generating knowledge[9], but they don't surpass the data domain, i.e. the research made publicly available based on the data should be free as the data on which has been built on, they ensure the combination and re-distribution of it to be free so anyone can extract knowledge from it.
To cross from the input domain to the output domain, i.e. the use of free inputs in a process should produce free outputs, can't be done generically as each domain has its specificities.
A No-Derivatives meant to preserve context on opinions makes sense on the Artistic or Opinions domain but not in the Software domain. In this line, books printed in a press which uses Free Software for its process should not be free as in freedom automatically, as they may go against a No-Derivatives will.
In research, the data sustaining your conclusions may be as important as the conclusions themselves, i.e. in experimental sciences such as Medicine, in the absence of data it cannot be incorporated into the general praxis. It makes sense to ask for access to the conclusions if they are made public and data was provided.
Being practical, this seems not to be a source of income that may address the sustainability of a project of any size.
Being even more practical, if sustainability is compromised, the way shouldn't be compromising the spirit, just more hacking is needed :)
Free Software monetizes on work, not on access.
It is the selling of work (future or past) without compromising anyone's freedom what makes it different.
Providing access to past work is what makes possible for future work in that direction to be incremental, and therefore maximizing the output of the effort both for the individual and the collective.
Limiting access to past work is what makes possible a unique-supplier commoditization-like logic.
Monetizing through an Open Service is selling the work of maintaining the service.
Monetizing through a Closed Service is selling only the results of past work - access to functionality.
Copyright applies to implementations, not to ideas, it does not prevent from building an analogous product or service.
The mechanism designed for generating a monetization stream to those who find fruitful ways of doing things (ideas) is the patents' system, which conveys a unique-supplier commoditization-like logic of the idea, which is publicly known but forbidden to apply unless a license is obtained.
The patents' system has been highly questioned, and there seems to be a consensus in the Free Software community that it hurts the Software ecosystem.
Free Software licenses are designed to guarantee the Software Freedoms, in which the "Freedom to Study a Program" is included.
It is in its spirit to foster the study of it, for whatever purpose shall be.
Consider a case where a teacher learns from studying the source of the most successful Free Software projects and then teaches best practices to the employees from a privative software company.
She is building her service of Free Software work and bringing it to the non-free space.
Is she being unfair?
An engineer learning from Free Software codebases and writing privative software is being unfair?
Is an activity at-scale for profit like the previous ones being unfair?
Once Knowledge has been acquired, it becomes free in the acquaintance.
You may be restricted to use or reproduce it, but not from building on it.
Knowledge is a broader cause which includes Software.
[0] Software Freedoms: https://www.gnu.org/philosophy/free-sw.en.html
[1] Open Service: https://opendefinition.org/ossd/
[2] Plagiarism: https://en.wikipedia.org/wiki/Plagiarism
[3] Content similarity detection: https://en.wikipedia.org/wiki/Content_similarity_detection
[4] SaaSS problems: https://www.gnu.org/philosophy/who-does-that-server-really-serve.en.html
[5] OSI on SSPL: https://opensource.org/node/1099
[6] Eduardo Platero
[7] Dual Licensing: https://www.fsf.org/blogs/rms/selling-exceptions
[8] Fair Use: https://en.wikipedia.org/wiki/Fair_use
[9] ODBl licenses https://opendatacommons.org/faq/licenses/
[10] CC for Data: https://wiki.creativecommons.org/wiki/Data
whitepaper.txt
Displaying whitepaper.txt.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment