Skip to content

Instantly share code, notes, and snippets.

@k0001
Created June 23, 2023 21:53
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save k0001/2c6cc164db9b4c57878b4678af4cd7ad to your computer and use it in GitHub Desktop.
Save k0001/2c6cc164db9b4c57878b4678af4cd7ad to your computer and use it in GitHub Desktop.
k0001's handwavy guide to high quality software

k0001's handwavy guide to high quality software

This is a work in progress where I try to convey what high quality software means to me, and how to build it. Be warned: This is a very opinionated and handwavy guide.

Copyright Renzo Carbonara, 2016. This work is licensed under a Creative Commons Attribution 4.0 International License.

Basic principles

Code should be written with the following goals in mind.

Purity

The outcome of stateful computations where you don't control the initial state, or where state can be implicitly altered, is practically impossible to predict reliably. Avoid these kind of computations as much as possible by pushing them to the boundaries of your software.

When relying on a stateful third-party service, it should be possible to replace any impure implementation to such service's interface with a pure implementation that can be used for reference and testing purposes.

The avoidance of state is the main reason why Functional Programming becomes a necessity as the complexity of projects grow, and why solutions such as Haskell and Nix are an order of magnitude ahead of the competition in terms of reasoning power, reproducibility, and coping with the complexities of the real world in general.

Meaning

To write software is to explain possibly complex problems and their solutions in simple terms. Don't write code that can't be systematically reasoned about, described, and eventually mapped to concrete and lawful concepts.

Mathematical abstractions, equational reasoning and declarative approaches can help us come up with meaningful solutions, as well as better communicating our ideas.

Meaning, readability and maintainability are closely related, yet not implied by each other. You can have a solution that can be understood but is hard to read and maintain, or you can have a solution that is easy to read but hard to reason about, or you can have an implementation that's easy to reason about and read, but hard to modify when necessary. You can have nothing, or you can have it all. Good software has it all.

Type-safety

As much as possible, code should be type-safe. This means, for example, that a function from A to B should return a meaningful B for any A I give to it.

It is not always easy to achieve this to a full extent, so it is important that we explicitly highlight the places where this type-safety is missing—likely by thoroughly documenting the funny behavior, and either not exporting any of the “malfunctioning” parts as part of a public API, or exporting them with a terrible name such as unsafePerformIO.

Many times, the trick to achieving this type-safety is being able to identify the smallest mathematical structures that can effectively express the solution you are trying to write, and pair that together with meaningful descriptions of the concepts that concern you particular problem domain.

Compilers and computers don't need types to work. Instead, types are a tool allowing human beings to better reason about problems. Embrace this tool.

Composability

We should be able to combine things that have the same shape in a way that the shape and properties of the original parts are preserved. We should be able to build bigger things out of smaller things in a meaningful and type-safe way.

Documentation, types, code and tests

The entry point for any software project should be documentation giving a broad overview of what problem the software is trying to solve and what approach is being taken to solve it. In cases where the intended usage of the software is not immediately obvious, tutorials and examples should be provided.

As much as possible, we should be able to understand the intended usage of a library by reading the types of the things that are exposed as part of its public API. Documentation should be there to support types when the meaning of the exposed pieces is not immediately obvious, as well as when it is not clear how to compose them.

Executable code is a proof that there exists a valid implementation for the type assigned to a particular name. As such, the details of such code shouldn't be particularly important, only the fact that the code exists should be. This means that we should write our executable code keeping in mind that it is not important as types nor as documentation are in terms of conveying purpose. Having that said, any code that is written must be easy to understand and maintain, and the ease with which code can be read, understood and maintained, in that order, is much more important that the ease with which code can be written—we write once, but we and our peers will need to read and understand that piece many times, and occasionally it will be necessary to modify it as well, so do optimize for those things.

It is important that when we do write executable code, we do it in a way that would cause the compilation of our software to fail if some of the assumptions it makes stops being true. This is another way of saying that we should embrace type-safety, as described before.

Tests are the price we pay for having failed to express guarantees about our software in the type-system. As such, tests are to be avoided as much as possible. On the other hand, everything that we cannot guarantee in the type-system, or for which a mathematical proof can't be given, must be tested.

Standalone tools decoupled from all business rules, such as a function for adding two numbers, deserve to be tested individually. However, it's usually not useful to test helper functions, types or tools that are not particularly meaningful without knowledge of the business rules. For those cases, it is often sufficient to just test the public APIs that are at the boundaries of the business rules.

Distribution

It should to be possible to build, install and run software in a predictable manner, with predictable outcomes, without any manual intervention.

Software should not rely on proprietary software, nor on cloud services that don't provide a way for setting up a local server that fully mimics its API and behavior.

Software must not make implicit assumptions about the environment where it runs. Instead, any special environment requirements must be explicitely communicated and made configurable.

It should be possible to fully replicate locally the production environment where the sofware is expected to run, except in a smaller scale, without any manual intervention.

Software deployment needs a holistic and deterministic understanding of the environment where it runs. Limited distribution solutions that can't guarantee this kind of understanding are insufficient.

Security

There are many issues concerning software security, and our software must avoid them all.

Always follow the principle of least privilege: In your software, in your distribution practices and across your team.

Enforce security through the type system. For example, by relying on a type-safe API to access array elements we can be certain we'll never see a memory violation issue related to this array. Another example is making use of the type system to annotate the meaning of otherwise opaque blobs of data, like in the case of using different types to differentiate an arbitrary blob of data from data that we've verified to be correct for our particular needs.

Security credentials must never be written in the software's codebase, instead, they shall provided by the environment where the software runs, and kept encrypted for as long as possible until deployment.

The software should follow well-known modern industry practises in its choice of cryptographic solutions. Unencrypted networking channels of communication with the software shall not be supported.

Our software should be careful to avoid sensitive data from lingering in memory for more than necessary. It is also desirable to adopt a Zero Knowledge approach to storing and processing user's data.

It is often very hard for software to excel in both security and usability matters, for example regarding credentials management or storage choices. When faced with a choice regarding this, never sacrifice safety for the sake of usability, instead try to come up with creative ways to convey the importance of security and cryptography to your users, making the security experience as meaningful and pleasant as possible.

If the software requires users to authorize access to some data of their own on some third-party service, then the principle of least privilege applies, and the software must be very clear about what kind of access it needs, and why. The software must never ask for more permissions than it needs in order to support the features of the software to which the user has explicitly opted-in.

One should assume that the software will run in a hostile and shared environment. This implies that the software should be defensive, should never make assumptions about the environment setup and resources that are available, and must not pollute said environment.

The software must never make assumptions about their user's security policies, and must never force users into a particular approach to doing so. On the contrary, users's security policies within and outside the software are to be respected, and if insufficient then the software shall communicate so, but never actively modify those policies.

Performance

Depending on the problem we are trying to solve, achieving as much performance as possible may or may not be necessary. Nevertheless, there are some performance guarantees that shall always be satisfied by every software.

User interfaces are expected to be immediately responsive to user actions, even when the software might be busy or blocked. Short-lived programs are expected to have a negligible start-up time.

As a starting point, software should use algorithms and data-structures with reasonable time and space complexities. These need not be the most efficient implementations, but they certainly should behave reasonably well for any reasonable input size.

Ideally, software performance should scale at least linearly with more hardware resources. Nevertheless, software should not rely on significant hardware resources being available in order to be performant, instead it should rely in good engineering.

Sophisticated performance improvements should only happen after the software effectively solves the problem it expects to solve. Simple performance improvements might happen as part of the initial implementation, as long as they don't hinder the understanding of it.

Type-safety and security might be sacrificed for performance, but never as part of a public API, and only as long as the more performant implementation continues to be meaningful, readable and maintainable.

Humans

Software is made out of human will, and it exists as long as humans care.

Human are your contributors, and they deserve encouragement, involvement, gratitude. Human are your users, and they deserve transparency and a good product. Human are your stakeholders, and they deserve involvement, commitments and understanding. They all deserve respect as well as explanations when they are mistaken.

Legal

If you expect your software to be installed by users on their computers and other devices, then your software shall be libre. Propietary software is invasive, restrictive, hard to deploy, and raises many security and privacy related questions. If you are thinking about deploying propietary software to your users, change your business model and deploy software libre talking to your propietary online service instead. When doing this, embrace a Zero Knowledge approach as much as possible so that users are not forced to trust you, and make sure the user can verify that any executable code you deliver from your propietary service runs in a sandboxed environment respecting the principle of least privilege.

Accompany your software libre with a license that grants users the expected rights, and that protects your contributors as well as the existence and growth of the project going forward.

The software must be clear and upfront about any legal terms. Do not invent your own software libre license, pick one of the well-known licenses, likely the one that will be least surprising to your target audience.

Consider using a license with a patent retaliation clause if you are worried that your business or project might be hurt by software patents.

Readability and maintenance

Readable code means that, as humans, it is easy to parse a piece of code and uncover its meaning. Compilers and interpreters don't care about the readability of code, readability is for human beings.

The input

It should be obvious how to use a piece of code by explicitly stating what inputs are expected. When the meaning of inputs is not immediately clear from their types, there needs to be accompanying documentation that further explains this input. A function should only take as much input as necessary, never more, and it should take this input in a precise representation for each acceptable concept.

It is recommended that, for example, if A and B are both different ways of expressing the same concept “foo”, then a function caring about “foo” should only accept one of A and B, not both. Nevertheless, at times it may also be convenient to support different input types through polymorphism, but one should be careful about not making the meaning of code less obvious or significantly less inferable when introducing polymorphism. Parametricity is a very powerful tool for reasoning about polymorphic code, use it.

The output

Like with inputs, the output of a function must be obvious and clear. It should be possible to infer a function's output type from as few inputs types as possible. A function's output may be polymorphic as well, thus leaving it to the caller to decide what concrete type to use, which makes this function easier to use and compose in different scenarios. Parametricty is a very powerful tool for reasoning about polymorphic code, use it.

Importing and using names

At call sites, it should be obvious where names or bindings in scope are coming from. This ideally means that names are either well-known, in which case they don't deserve any special attention, or that they are used in a qualified manner so that the module exporting the name becomes obvious.

Alternatively, if the name is intended to be used in an unqualified manner, then the name must communicate in a clear manner its intended purpose and expected input types, and it should be explicitly imported at the call site so that readers of the code can look through the module imports to find an explicit mention of this name in order to learn about its origin. However, this way of bringing names into scope is not without its problems. In particular, even though you might think that names such as userAddress and printUserWithDefaults are unambiguous, readers of code using these names might not be as aware as you were at the time of writing about what type of “user” and “defaults” the code is talking about. Unqualified imports, unless well-known or ubiquitous, are best avoided.

There is a whole different category of names that deserves special attention, which is the category infix operators. To be blunt, infix operators should be avoided unless they are well-known, and even then they may best be avoided anyway unless their usage can be justified. One problem with infix operators is that they are never intended to be used in a qualified manner, and that is often not optimal as discussed before, even more so considering that infix operators can't possibly communicate their intended argument types, and that attempting to use infix operators in a qualified manner, when possible, defeats their purpose. Another problem with infix operators is that they bring their own operator precedence rules, and except for well-known infix operators, one needs to guess precedence rules both when writing the code the first time, as well as when reading it at a later time. Yet another problem with infix operators is that once you have two of them in a same (small) expression, parsing the expression and deriving meaning from it becomes too hard due to the non-obvious precedence rules each infix operator brings, and modifying these expressions over time becomes harder as well.

Haskell side note: Avoid the RecordWildCards extension as a means to bringing names into scope. Probably. This extension not only brings an unknown and implicit number of names to scope, but it also brings them with a different type than originally intended. For example, whereas in data X = X { a :: A } the name a has type X -> A, it will have type A when brought to scope using RecordWildCards. This, in combination with the fact that when reading code where a is brought to scope with RecordWildCards one can't even identify this situation before first exhausting all other scoping mechanisms, makes RecordWildCards a bad extension to use whenever readability and maintainability are some of your goals, as they should.

Defining and exporting names

Modules exporting new names should be explicit about any names they export. This makes the origin of a name obvious and searchable in a codebase.

Be careful about code generation techniques that create new names. Finding the origin of these names, unless explicitly exported, can be an odyssey.

Haskell side note: Avoid TemplateHaskell for generating names as much as possible, except perhaps when maintenance of these names over time would be error-prone. TemplateHaskell has its own problems, such as long compilation times, unexpected compiler and linker issues in some scenarios and platforms, and an unstable and error-prone API that offers little understanding about what the TemplateHaskell does. Using TemplateHaskell to generate code that is not particularly hard to get right if written by hand, even when it may consist mostly of boilerplate, is just not worth the cost.

As a side note, consider exporting some of your internals as well, even if you can't guarantee a type-safe usage for them, nor the same quality of documentation as for of your public API. The reason for exporting internals is to allow your users to combine them in ways different than those you've thought about without forcing them to fork your software. Make it clear, however, that these internals are not supported and are not guaranteed to be as resilient as the public API.

Naming things

Top-level names should clearly state their purpose, and they should be designed either for qualified usage or for unqualified usage, not for both. For example, if used in an unqualified manner, address is a terribly bad name for a function returning the address of a person; a name like personAddress would be much better. Nevertheless, if address is actually intended to be used in a qualified manner as Person.address, then the name is optimal. Do not introduce top-level names for compositions like peopleAddress that would just apply Person.address to each of the elements of a list of person elements (i.e., “people”). On the contrary, users of Person.address should be encouraged to compose it with the appropriate high-order functions themselves as they see fit.

Haskell side note: In the case of “field accessor” functions such as address, many of which are often Optics, it is important that their types can be easily inferred, because otherwise whenever one tries to use them at the wrong place—usually just as a side-effect of modifying the code that uses this optic—figuring out what the type errors one gets mean will be an odyssey. In other words, you may want to avoid overloaded record fields.

Non top-level names are mostly an obstacle, as they increase the cognitive load for readers and maintainers of the code; they should be avoided as much as possible and inline expressions should be used instead. Nevertheless, some non top-level expressions should be given their own names whenever doing so aids to understanding of the code or to improving its structure. Non top-level names shall exist within a limited scope occupying as little real-state as possible, and within this scope a short name should be used for them. For example, if you are introducing a name for an expression of type MonthlyVegetablesProvider that will only be relevant to the next 4 or 5 lines of code, then myMonthlyVegetablesProvider is a terribly bad name for that expression, whereas mvp or even p are optimal. Being clear about the meaning of things by explicitly stating their types is much more effective than using verbose names, both from a readability and a maintenance point of view.

Code structure

The basic structure of a solution should be obvious; one shouldn't need to spend more than a handful of seconds in order to effectively grasp an overview of the general approach that's being taken to solve a problem. It might not be immediately obvious how to achieve this, but the following guidelines should help.

Use well-known abstractions. For example, if implementing a recursive solution then use fold or the fix-point combinator instead of explicit recursion. If doing error handling or state passing, then use a monad. If you are statically establishing relationships between entities, then use applicatives or arrows. By doing this, readers and maintainers of your code need not spend any time familiarizing themselves with obscure abstractions before proceeding to ignore mostly irrelevant implementation details, as they should always be able to do.

Factor out expression and types only as long as the factored-out parts are meaningful on their own. Wrong abstractions are worse that no abstraction at all, and they become a maintenance nightmare.

Visual cues

Functional programming languages, being expression-oriented, particularly lean themselves towards a very rare form of programming where the visual shape of the program can tell you a lot of what's going on. For this to work, however, it is necessary to use real-state conservatively, consistently, and to respect the geometric shapes that show up as we code. Be warned, the following is very handwavy.

For example, using Haskell syntax, if we encounter an expression with the following geometrical shape:

1.  ___ _
2.     (__ __ $ __
3.        ____
4.        ______ __ ___ $ __
5.           ___ __ ___
6.        ___ __ _)
7.     (___ ____ _)
8.  ___

In this shape can easily see that we can ignore lines 3 to 6 inclusive if we don't care about what line 2 says, and that lines 2 and 7 are arguments to the expression started at line 1. Here is another example:

1. ____ _ ____
2.    ( ___     (\_ -> __ _ _____)  *|*
3.      _____   (\_ -> __ _)        *|*
4.      __      (\_ -> ___ _ __)    *|*
5.      ____ __ (\_ -> _ + _)
6.    )

Of course it is not obvious what this code does, but one can still get some hints from its shape. For example, one could say that *|* might be some kind of associative infix operator, due to its symmetrical shape and deliberate vertical alignment, and that likely one doesn't need to fully understand all of the lines 2 through 5 in order to just grasp a general idea of what this code might be doing, but instead just understanding one of those lines should suffice, seeing as the shape of the arguments to *|* seems to be repeated time and time again. Additionally, one could hypothesize that quite likely the *|* operator is used to somehow obtain some value that eventually is passed to those functions with shape (\_ -> ...) that stand out in the code.

However, for this visual aid to work effectively, it is important that real-state is used wisely. In particular, vertical alignment is quite important, as the last two examples illustrated, and vertical white-space (blank lines) should be avoided, as they break the flow that allowed us to skip lines 3 to 6 in our first example as the vertical real-state required by our code grows. If you feel that you need a blank line to separate concerns, write a comment, give something an explicit type, or factor out some code instead.

In Haskell, type signatures need to be easy to parse by humans in order to reduce the cognitive load. Luckily, Haskell gives some delimiters for this:

_______
  :: forall _ _ (_ :: _) _
  .  ( _________ _
     , _______ _ ~ ____ (___ _)
     , ______ ____ _ )
  => _____ __
  -> ________ (_____ _)
  -> (forall _. _____ _ -> ____)
  -> ___________
  -> _ ( ______ ________ _
       , _________________ _ ____ )

Notice how even if we have no idea what this function does nor have we read its type in full, we can easily recognize in the shape of its type signature that it takes 4 value arguments, that 3 constraints need to be satisfied, that it returns some sort of tuple and that higher rank types are somehow involved in this solution. If we don't make an effort to vertically align type signatures in a predictable and easy to parse manner, deriving meaning from the type signature becomes much harder:

_______ ::
  forall _ _ (_ :: _) _.
  (_________ _, _______ _ ~ ____ (___ _), ______ ____ _) =>
  _____ __ ->
  ________ (_____ _) ->
  (forall _. _____ _ -> ____) ->
  ___________ ->
  _ (______ ________ _, _________________ _ ____)

One last visual cue that can aid readability significantly is the choice of names polymorphic type variables and their term-level representations. For example, m, f and g are good names for functor-like types; f, g and k are nice names for functions values; a, b and x are good generic names for various types and values. When we pick these familiar names, we immediately save readers of our code from the burden of having to care about what these types really are, and instead they can focus on type constraints or documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment