k0001/handwavy.md

## handwavy.md

      
    Raw
  

              handwavy.md
            
          
    k0001's handwavy guide to high quality software

This is a work in progress where I try to convey what high quality software
means to me, and how to build it. Be warned: This is a very opinionated and
handwavy guide.
Copyright Renzo Carbonara, 2016. This work is licensed under a
Creative Commons Attribution 4.0 International
License.
Basic principles

Code should be written with the following goals in mind.
Purity

The outcome of stateful computations where you don't control the initial state,
or where state can be implicitly altered, is practically impossible to predict
reliably. Avoid these kind of computations as much as possible by pushing them
to the boundaries of your software.
When relying on a stateful third-party service, it should be possible to replace
any impure implementation to such service's interface with a pure implementation
that can be used for reference and testing purposes.
The avoidance of state is the main reason why Functional Programming becomes a
necessity as the complexity of projects grow, and why solutions such as Haskell
and Nix are an order of magnitude ahead of the competition in terms of reasoning
power, reproducibility, and coping with the complexities of the real world in
general.
Meaning

To write software is to explain possibly complex problems and their solutions in
simple terms. Don't write code that can't be systematically reasoned about,
described, and eventually mapped to concrete and lawful concepts.
Mathematical abstractions, equational reasoning and declarative approaches can
help us come up with meaningful solutions, as well as better communicating our
ideas.
Meaning, readability and maintainability are closely related, yet not implied by
each other. You can have a solution that can be understood but is hard to read
and maintain, or you can have a solution that is easy to read but hard to reason
about, or you can have an implementation that's easy to reason about and read,
but hard to modify when necessary. You can have nothing, or you can have it all.
Good software has it all.
Type-safety

As much as possible, code should be type-safe. This means, for example, that a
function from A to  B should return a meaningful B for any A I give to
it.
It is not always easy to achieve this to a full extent, so it is important that
we explicitly highlight the places where this type-safety is missing—likely by
thoroughly documenting the funny behavior, and either not exporting any of the
“malfunctioning” parts as part of a public API, or exporting them with a
terrible name such as unsafePerformIO.
Many times, the trick to achieving this type-safety is being able to identify
the smallest mathematical structures that can effectively express the solution
you are trying to write, and pair that together with meaningful descriptions of
the concepts that concern you particular problem domain.
Compilers and computers don't need types to work. Instead, types are a tool
allowing human beings to better reason about problems. Embrace this tool.
Composability

We should be able to combine things that have the same shape in a way that the
shape and properties of the original parts are preserved. We should be able to
build bigger things out of smaller things in a meaningful and type-safe way.
Documentation, types, code and tests

The entry point for any software project should be documentation giving a broad
overview of what problem the software is trying to solve and what approach is
being taken to solve it. In cases where the intended usage of the software is
not immediately obvious, tutorials and examples should be provided.
As much as possible, we should be able to understand the intended usage of a
library by reading the types of the things that are exposed as part of its
public API. Documentation should be there to support types when the meaning of
the exposed pieces is not immediately obvious, as well as when it is not clear
how to compose them.
Executable code is a proof that there exists a valid implementation for the type
assigned to a particular name. As such, the details of such code shouldn't be
particularly important, only the fact that the code exists should be. This means
that we should write our executable code keeping in mind that it is not
important as types nor as documentation are in terms of conveying purpose.
Having that said, any code that is written must be easy to understand and
maintain, and the ease with which code can be read, understood and maintained,
in that order, is much more important that the ease with which code can be
written—we write once, but we and our peers will need to read and understand
that piece many times, and occasionally it will be necessary to modify it as
well, so do optimize for those things.
It is important that when we do write executable code, we do it in a way that
would cause the compilation of our software to fail if some of the assumptions
it makes stops being true. This is another way of saying that we should embrace
type-safety, as described before.
Tests are the price we pay for having failed to express guarantees about our
software in the type-system. As such, tests are to be avoided as much as
possible. On the other hand, everything that we cannot guarantee in the
type-system, or for which a mathematical proof can't be given, must be tested.
Standalone tools decoupled from all business rules, such as a function for
adding two numbers, deserve to be tested individually. However, it's usually not
useful to test helper functions, types or tools that are not particularly
meaningful without knowledge of the business rules. For those cases, it is often
sufficient to just test the public APIs that are at the boundaries of the
business rules.
Distribution

It should to be possible to build, install and run software in a predictable
manner, with predictable outcomes, without any manual intervention.
Software should not rely on proprietary software, nor on cloud services that
don't provide a way for setting up a local server that fully mimics its API and
behavior.
Software must not make implicit assumptions about the environment where it runs.
Instead, any special environment requirements must be explicitely communicated
and made configurable.
It should be possible to fully replicate locally the production environment
where the sofware is expected to run, except in a smaller scale, without any
manual intervention.
Software deployment needs a holistic and deterministic understanding of the
environment where it runs. Limited distribution solutions that can't guarantee
this kind of understanding are insufficient.
Security

There are many issues concerning software security, and our software must
avoid them all.
Always follow the principle of least privilege: In your software, in your
distribution practices and across your team.
Enforce security through the type system. For example, by relying on a type-safe
API to access array elements we can be certain we'll never see a memory
violation issue related to this array. Another example is making use of the type
system to annotate the meaning of otherwise opaque blobs of data, like in the
case of using different types to differentiate an arbitrary blob of data from
data that we've verified to be correct for our particular needs.
Security credentials must never be written in the software's codebase, instead,
they shall provided by the environment where the software runs, and kept
encrypted for as long as possible until deployment.
The software should follow well-known modern industry practises in its choice of
cryptographic solutions. Unencrypted networking channels of communication with
the software shall not be supported.
Our software should be careful to avoid sensitive data from lingering in memory
for more than necessary. It is also desirable to adopt a Zero Knowledge approach
to storing and processing user's data.
It is often very hard for software to excel in both security and usability
matters, for example regarding credentials management or storage choices. When
faced with a choice regarding this, never sacrifice safety for the sake of
usability, instead try to come up with creative ways to convey the importance of
security and cryptography to your users, making the security experience as
meaningful and pleasant as possible.
If the software requires users to authorize access to some data of their own on
some third-party service, then the principle of least privilege applies, and
the software must be very clear about what kind of access it needs, and why.
The software must never ask for more permissions than it needs in order to
support the features of the software to which the user has explicitly opted-in.
One should assume that the software will run in a hostile and shared environment.
This implies that the software should be defensive, should never make assumptions
about the environment setup and resources that are available, and must not
pollute said environment.
The software must never make assumptions about their user's security policies,
and must never force users into a particular approach to doing so. On the
contrary, users's security policies within and outside the software are to be
respected, and if insufficient then the software shall communicate so, but never
actively modify those policies.
Performance

Depending on the problem we are trying to solve, achieving as much performance
as possible may or may not be necessary. Nevertheless, there are some
performance guarantees that shall always be satisfied by every software.
User interfaces are expected to be immediately responsive to user actions, even
when the software might be busy or blocked. Short-lived programs are expected to
have a negligible start-up time.
As a starting point, software should use algorithms and data-structures with
reasonable time and space complexities. These need not be the most
efficient implementations, but they certainly should behave reasonably well for
any reasonable input size.
Ideally, software performance should scale at least linearly with more hardware
resources. Nevertheless, software should not rely on significant hardware
resources being available in order to be performant, instead it should rely in
good engineering.
Sophisticated performance improvements should only happen after the software
effectively solves the problem it expects to solve. Simple performance
improvements might happen as part of the initial implementation, as long as they
don't hinder the understanding of it.
Type-safety and security might be sacrificed for performance, but never as part
of a public API, and only as long as the more performant implementation
continues to be meaningful, readable and maintainable.
Humans

Software is made out of human will, and it exists as long as humans care.
Human are your contributors, and they deserve encouragement, involvement,
gratitude. Human are your users, and they deserve transparency and a good
product. Human are your stakeholders, and they deserve involvement, commitments
and understanding. They all deserve respect as well as explanations when they
are mistaken.
Legal

If you expect your software to be installed by users on their computers and
other devices, then your software shall be libre. Propietary software is
invasive, restrictive, hard to deploy, and raises many security and privacy
related questions. If you are thinking about deploying propietary software to
your users, change your business model and deploy software libre talking to your
propietary online service instead. When doing this, embrace a Zero Knowledge
approach as much as possible so that users are not forced to trust you, and make
sure the user can verify that any executable code you deliver from your
propietary service runs in a sandboxed environment respecting the principle of
least privilege.
Accompany your software libre with a license that grants users the expected
rights, and that protects your contributors as well as the existence and growth
of the project going forward.
The software must be clear and upfront about any legal terms. Do not invent your
own software libre license, pick one of the well-known licenses, likely the one
that will be least surprising to your target audience.
Consider using a license with a patent retaliation clause if you are worried
that your business or project might be hurt by software patents.
Readability and maintenance

Readable code means that, as humans, it is easy to parse a piece of code and
uncover its meaning. Compilers and interpreters don't care about the readability
of code, readability is for human beings.
The input

It should be obvious how to use a piece of code by explicitly stating what
inputs are expected.  When the meaning of inputs is not immediately clear from
their types, there needs to be accompanying documentation that further explains
this input.  A function should only take as much input as necessary, never more,
and it should take this input in a precise representation for each acceptable
concept.
It is recommended that, for example, if A and B are both different ways of
expressing the same concept “foo”, then a function caring about “foo” should
only accept one of A and B, not both. Nevertheless, at times it may also
be convenient to support different input types through polymorphism, but one
should be careful about not making the meaning of code less obvious or
significantly less inferable when introducing polymorphism. Parametricity is a
very powerful tool for reasoning about polymorphic code, use it.
The output

Like with inputs, the output of a function must be obvious and clear. It should
be possible to infer a function's output type from as few inputs types as
possible. A function's output may be polymorphic as well, thus leaving it to the
caller to decide what concrete type to use, which makes this function easier to
use and compose in different scenarios. Parametricty is a very powerful tool for
reasoning about polymorphic code, use it.
Importing and using names

At call sites, it should be obvious where names or bindings in scope are coming
from. This ideally means that names are either well-known, in which case they
don't deserve any special attention, or that they are used in a qualified manner
so that the module exporting the name becomes obvious.
Alternatively, if the name is intended to be used in an unqualified manner,
then the name must communicate in a clear manner its intended purpose and
expected input types, and it should be explicitly imported at the call site so
that readers of the code can look through the module imports to find an
explicit mention of this name in order to learn about its origin. However,
this way of bringing names into scope is not without its problems. In
particular, even though you might think that names such as userAddress and
printUserWithDefaults are unambiguous, readers of code using these names
might not be as aware as you were at the time of writing about what type of
“user” and “defaults” the code is talking about. Unqualified imports, unless
well-known or ubiquitous, are best avoided.
There is a whole different category of names that deserves special attention,
which is the category infix operators. To be blunt, infix operators should be
avoided unless they are well-known, and even then they may best be avoided
anyway unless their usage can be justified. One problem with infix operators
is that they are never intended to be used in a qualified manner, and that is
often not optimal as discussed before, even more so considering that infix
operators can't possibly communicate their intended argument types, and that
attempting to use infix operators in a qualified manner, when possible,
defeats their purpose. Another problem with infix operators is that
they bring their own operator precedence rules, and except for well-known
infix operators, one needs to guess precedence rules both when writing the
code the first time, as well as when reading it at a later time. Yet another
problem with infix operators is that once you have two of them in a same
(small) expression, parsing the expression and deriving meaning from it
becomes too hard due to the non-obvious precedence rules each infix operator
brings, and modifying these expressions over time becomes harder as well.
Haskell side note: Avoid the RecordWildCards extension as a means to
bringing names into scope. Probably. This extension not only brings an unknown
and implicit number of names to scope, but it also brings them with a different
type than originally intended. For example, whereas in data X = X { a :: A }
the name a has type X -> A, it will have type A when brought to scope
using RecordWildCards. This, in combination with the fact that when reading
code where a is brought to scope with RecordWildCards one can't even
identify this situation before first exhausting all other scoping mechanisms,
makes RecordWildCards a bad extension to use whenever readability and
maintainability are some of your goals, as they should.
Defining and exporting names

Modules exporting new names should be explicit about any names they export. This
makes the origin of a name obvious and searchable in a codebase.
Be careful about code generation techniques that create new names. Finding the
origin of these names, unless explicitly exported, can be an odyssey.
Haskell side note: Avoid TemplateHaskell for generating names as much as
possible, except perhaps when maintenance of these names over time would be
error-prone. TemplateHaskell has its own problems, such as long compilation
times, unexpected compiler and linker issues in some scenarios and platforms,
and an unstable and error-prone API that offers little understanding about
what the TemplateHaskell does. Using TemplateHaskell to generate code that
is not particularly hard to get right if written by hand, even when it may
consist mostly of boilerplate, is just not worth the cost.
As a side note, consider exporting some of your internals as well, even if you
can't guarantee a type-safe usage for them, nor the same quality of
documentation as for of your public API. The reason for exporting internals is
to allow your users to combine them in ways different than those you've
thought about without forcing them to fork your software. Make it clear,
however, that these internals are not supported and are not guaranteed to be
as resilient as the public API.
Naming things

Top-level names should clearly state their purpose, and they should be designed
either for qualified usage or for unqualified usage, not for both. For
example, if used in an unqualified manner, address is a terribly bad name
for a function returning the address of a person; a name like personAddress
would be much better. Nevertheless, if address is actually intended to be
used in a qualified manner as Person.address, then the name is optimal. Do
not introduce top-level names for compositions like peopleAddress that would
just apply Person.address to each of the elements of a list of person
elements (i.e., “people”). On the contrary, users of Person.address should
be encouraged to compose it with the appropriate high-order functions
themselves as they see fit.
Haskell side note: In the case of “field accessor” functions such as
address, many of which are often Optics, it is important that their types
can be easily inferred, because otherwise whenever one tries to use them at
the wrong place—usually just as a side-effect of modifying the code that uses
this optic—figuring out what the type errors one gets mean will be an odyssey.
In other words, you may want to avoid overloaded record fields.
Non top-level names are mostly an obstacle, as they increase the cognitive
load for readers and maintainers of the code; they should be avoided as
much as possible and inline expressions should be used instead. Nevertheless,
some non top-level expressions should be given their own names whenever doing
so aids to understanding of the code or to improving its structure. Non
top-level names shall exist within a limited scope occupying as little
real-state as possible, and within this scope a short name should be used for
them. For example, if you are introducing a name for an expression of type
MonthlyVegetablesProvider that will only be relevant to the next 4 or 5
lines of code, then myMonthlyVegetablesProvider is a terribly bad name for
that expression, whereas mvp or even p are optimal. Being clear about the
meaning of things by explicitly stating their types is much more effective
than using verbose names, both from a readability and a maintenance point of
view.
Code structure

The basic structure of a solution should be obvious; one shouldn't need to
spend more than a handful of seconds in order to effectively grasp an overview
of the general approach that's being taken to solve a problem. It might not be
immediately obvious how to achieve this, but the following guidelines should
help.
Use well-known abstractions. For example, if implementing a recursive solution
then use fold or the fix-point combinator instead of explicit recursion. If
doing error handling or state passing, then use a monad. If you are statically
establishing relationships between entities, then use applicatives or arrows.
By doing this, readers and maintainers of your code need not spend any time
familiarizing themselves with obscure abstractions before proceeding to ignore
mostly irrelevant implementation details, as they should always be able to do.
Factor out expression and types only as long as the factored-out parts are
meaningful on their own. Wrong abstractions are worse that no abstraction at
all, and they become a maintenance nightmare.
Visual cues

Functional programming languages, being expression-oriented,
particularly lean themselves towards a very rare form of programming where the
visual shape of the program can tell you a lot of what's going on. For this to
work, however, it is necessary to use real-state conservatively, consistently,
and to respect the geometric shapes that show up as we code. Be warned, the
following is very handwavy.
For example, using Haskell syntax, if we encounter an expression with the
following geometrical shape:
1.  ___ _
2.     (__ __ $ __
3.        ____
4.        ______ __ ___ $ __
5.           ___ __ ___
6.        ___ __ _)
7.     (___ ____ _)
8.  ___

In this shape can easily see that we can ignore lines 3 to 6 inclusive if we
don't care about what line 2 says, and that lines 2 and 7 are arguments to the
expression started at line 1. Here is another example:
1. ____ _ ____
2.    ( ___     (\_ -> __ _ _____)  *|*
3.      _____   (\_ -> __ _)        *|*
4.      __      (\_ -> ___ _ __)    *|*
5.      ____ __ (\_ -> _ + _)
6.    )

Of course it is not obvious what this code does, but one can still get some
hints from its shape. For example, one could say that *|* might be some kind
of associative infix operator, due to its symmetrical shape and deliberate
vertical alignment, and that likely one doesn't need to fully understand all
of the lines 2 through 5 in order to just grasp a general idea of what
this code might be doing, but instead just understanding one of those lines
should suffice, seeing as the shape of the arguments to *|* seems to be
repeated time and time again. Additionally, one could hypothesize that quite
likely the *|* operator is used to somehow obtain some value that eventually
is passed to those functions with shape (\_ -> ...) that stand out in the
code.
However, for this visual aid to work effectively, it is important that
real-state is used wisely. In particular, vertical alignment is quite
important, as the last two examples illustrated, and vertical white-space
(blank lines) should be avoided, as they break the flow that allowed us to
skip lines 3 to 6 in our first example as the vertical real-state required by
our code grows.  If you feel that you need a blank line to separate concerns,
write a comment, give something an explicit type, or factor out some code
instead.
In Haskell, type signatures need to be easy to parse by humans in order to
reduce the cognitive load. Luckily, Haskell gives some delimiters for this:
_______
  :: forall _ _ (_ :: _) _
  .  ( _________ _
     , _______ _ ~ ____ (___ _)
     , ______ ____ _ )
  => _____ __
  -> ________ (_____ _)
  -> (forall _. _____ _ -> ____)
  -> ___________
  -> _ ( ______ ________ _
       , _________________ _ ____ )

Notice how even if we have no idea what this function does nor have we read
its type in full, we can easily recognize in the shape of its type signature
that it takes 4 value arguments, that 3 constraints need to be satisfied, that
it returns some sort of tuple and that higher rank types are somehow involved
in this solution. If we don't make an effort to vertically align type
signatures in a predictable and easy to parse manner, deriving meaning from
the type signature becomes much harder:
_______ ::
  forall _ _ (_ :: _) _.
  (_________ _, _______ _ ~ ____ (___ _), ______ ____ _) =>
  _____ __ ->
  ________ (_____ _) ->
  (forall _. _____ _ -> ____) ->
  ___________ ->
  _ (______ ________ _, _________________ _ ____)

One last visual cue that can aid readability significantly is the choice of
names polymorphic type variables and their term-level representations. For
example, m, f and g are good names for functor-like types; f, g and
k are nice names for functions values; a, b and x are good generic
names for various types and values. When we pick these familiar names, we
immediately save readers of our code from the burden of having to care about
what these types really are, and instead they can focus on type constraints or
documentation.