The following was written by Simon Korzunov
Decided to capture my thoughts in a document. The intent of this document is to be a starting point of a discussion.
So here are my recollection of thoughts/feedback on freshly added testing doc (thanks @felixfbecker!).
In general I do agree with the stated goals, but I would love to simplify/reduce number of them. Just to recap:
- Our product and code works as intended when we ship it to customers.
- Our product and code doesn’t accidentally break as we make changes over time.
- Failures are clear about what is not working and give as many hints as possible to why it’s not working.
- Tests do not commonly need manual updates when the implementation is changed
- Tests are easy to write and maintain.
Proposed:
- Making changes with confidence
- Tests are an aide not a burden
Assuming good intentions, we all want to ship high quality software all the time, but how do we know that the quality is high? There are many different models/opinions/practices how to achieve good quality, and most of them kinda have good points all over the place. So I think I would love to step back from "how" and focus on "what" instead: we want to be confident that we ship a high quality software.
I personally believe in unit testing and the pyramid of tests, but I would love to rigorously apply the lense of "does this test give me confidence that we are shipping with high quality?" to all tests and testing practices.
What I like about this approach that it allows many opinions and ideas to exist. As a FP enthusiast it is easy for me to argue for "purity" and "input -> output" approach for testing, but I feel like we need more of a shared goal to unite around rather than "how to get there" details arguments.
I think about writing tests as a tradeoff (or many tradeoffs): time writing them vs. confidence in correctness, tests maintenance vs. speed of making a change. But what is important about all of these tradeoffs that they should collectively bring a value for a specific team/company/product. It is important to remember that tests are written by people for people(or maybe even for future people?), and people are... different. Which implies that the tradeoffs are context dependent, thus there is no single "true way" of doing things. If we try have a discussion about what is a good test in the first place, it is useful to anchor it around "does it help us or holding us back, what are the tradeoffs for us?".
With that I would love to apply these two principles to several points from the document and provide context around my personal experience. Note "personal" here is important, while I do have strong opinions on the subject, it is one opinion/experience out of many people on the team and should be treated as such.
I was hired by a company to support a legacy C++ project that nobody wanted to touch. They "forgot" to mention at the interview that it was written by a "compiler" person who only knew assembler at the time and this project was his first(ish) experience using C++ coming from assembler. It was an engine that generated PDF reports out of data + some custom layout logic (usually for finance reports). The heart of it was a file with a function of 20000 (yep, 20 thousands) lines of code with 13 goto
statements. I was given a task to introduce changes to it and make sure that I didn't break anything in the process. Note that there was no concept of tests for this project.
After struggling solo for a while my tech lead suggested to use a concept of "snapshot" testing (I had no idea what that was). The idea was simple: handcraft custom layouts with predefined data that exercised different features + try to make these layouts as atomic as possible (each exercised a single feature), generate PNGs out of them and check them in into the source code. Then the the test will produce a fresh PNG and compare with the saved snapshot. Note that PNGs are super easy to compare both by humans and machines.
After several weeks I produced enough snapshots (about 100 if I recall correctly) and started implementing changes.
In the end I was able to introduce required changes with only a few prod glitches along the way (or my memory is too kind to me).
Looking back I think there were several key factors
- I was able to treat the system as a blackbox. I was able to model it as "input -> output" system with no need to peek inside.
- The desired behavior was to never change the output. Probably the most important aspect out of all that from a business point of view we wanted to preserve the logic rather than evolve it. I will come back to this point later.
- Snapshots were easy to compare/check for a human. Aka within a second looking at the PNG diff (there was a tool for that) I was able to tell what was wrong with the image, it even painted a red bounding rectangle of the difference.
- Snapshots never changed in batches. Of course, unless I broke something obvious :) That allowed me identify impacted areas quickly and only with a few images at a time to triage a problem.
So I would love to weight enzyme snapshots against the points above (please feel free to share objections/your personal experience)
On the contrary, react components are core to our product, and hopefully where we spend a lot of our time and focus.
The desired behavior is to always iterate on the product.
Changes in snapshots are hard to identify as correct vs something is broken. If there is an additional class
or div
is it correct or wrong?
In my experience for enzyme snapshots the opposite is usually true. Example: updated version of @reach/ui or internal reusable component will break a lot of tests. More importantly, it is not clear if the change is desired or not.
Now back to the proposed goals:
They give confidence only if none of them broke, but don't give much when it comes to failed tests. "Why?", "what changed?", "is this ok?" these are the questions you will face with every failed test. So my point is enzyme snapshot tests give you confidence only if you product doesn't evolve, which is probably the opposite of what we want.
Recently I read a tweet about the experience of updating a component from a library broke 28 snapshot tests out of 50, and for each of them you need to answer "Why?", "what changed?", "is this ok?".
So to reiterate: snapshot tests give a lot of value for parts of the product that change very infrequently, and less and less for more volatile ones.
They score more favorably against the points above.
Storybook tends to be used for small(ish) components -> more atomic
Percy captures images -> images don't care about nested div
s and easy for humans to compare.
I’m generally not a fan of using any extra tools for tests such as sinon, expectObservable
and mocks in general.
While I do believe they are useful, they introduce a barrier for future changes and apply pressure on present and future readers, e.g. assumes specific knowledge in order to understand what is actually going on. Will you remember what sinon does tomorrow? Will your future teammate have experience with sinon? Are all teammates know exactly what you know about expectObservable
?
So in my practice this directly contributes negatively to “not a burden” statement.
I understand and respect the concept of “just learn it". Yes, all of them are not "rocket science" things, but it is useful to apply the perspective of "learning budget", e.g. people have finite amount of things that they can learn within a time frame. We should spend that budget wisely.
In my experience sinon or other tools can help to clarify “why?" and hint to "what to do?” (thus reduce burden), but I yet to find a test that cannot be rewritten without them. I think if error messages is a concern we should strive for adding them to asserts directly rather than relying on sinon (maybe even a lint rule?).
I do believe that for tests the property of local reasoning
is extremely important. The reason for that is that it minimizes the time a developer spends between "broken test in CI" to "I understand what is going on". A lot of the times people will break things with exactly zero context of your code and tests. Thus, it is useful to minimize the required context to understand a failed test because it is likely that other people will need to fix them.
beforeEach()
/afterEach()
directly breaks local reasoning
because there is one more thing that happens before and after. A common reason for failed tests using setup/teardown logic is stale state left from the prev test, thus when you see beforeEach()
/afterEach()
you can never just look at the body of the test, but also have to assess all the tests in the suit plus setup/teardown logic.
It is worth mentioning the idea of "it is more ok to copy/paste in tests", the assumption here is that it may (or may not) improve local reasoning
property.