Skip to content

Instantly share code, notes, and snippets.

@jordanorelli
Created April 27, 2021 17:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jordanorelli/8fc2452901960679f65f4da0cd9af7c2 to your computer and use it in GitHub Desktop.
Save jordanorelli/8fc2452901960679f65f4da0cd9af7c2 to your computer and use it in GitHub Desktop.
on integration testing

when people say "integration testing", the feeling I get is that the definition that most people are using is "unit tests that happen to perform i/o". is this the definition that most people are using?

there's another definition, which is the definition I learned when I first learned about unit testing, which I have never seen anyone actually use: a unit test is an individual unit of testing, and "integration testing" is when you sequence the unit tests to create an integrated suite of tests. that is ... integration testing is when you integrate your unit tests, not when you test how your system integrates with another system. Those are distinct concepts! My suggestion here is not that the latter concept isn't valuable, it is valuable, it's just distinct, and rarely do I see the first concept being executed well.

For example, let's say you were testing some CRUD API and you wanted to test two things: the create and the update. The strategy that I most commonly witness is as follows:

  • create a unit test for your create action. Start with a fixed, known state (let's call it c0), then run the create. The system is in some new state c1. Check the response to the create routine, as well as check that c1 is the value of the state that you expect.
  • independently, create a unit test for your update action. Start with some known-good state (let's call it u0), a state of an existing object in a database. Creating this state is itself work: it's new work to create this platonic starting state. Run your update against this platonic state (u0), producing some new state (u1). Check the response to your update action and check that u1 is the new state value that you expect.

That's all well and good, but you've now created a handful of new problems:

  • how do you define the success criteria of the create action (that is, the verification p such that p(c1) indicates that the test for create passes) that is not in terms of the read action, in order to guarantee isolation of the things under test? What value is provided by testing the create action alone? Does this not create a new hazard where the verification logic of the create test can diverge from the actual logic of the read action?
  • how do you define the initial state for the update test? (in this example, u0.) Is that not simply the result of the create action? The update action is now being tested off of a platonic starting state. How do you know that this platonic starting state is reachable by your system? Is it not the case that c1, the output of the create test and u0, the input of the update test, should always be equal or your tests are invalid? If that state is reachable now, how do you ensure that it continues to be reachable as your system changes?
  • if your update is tested off of a platonic starting state that is not the exact output of the create action, you now have two problems: your update is not testing the state reached by the create routine, and you've created a new, false requirement that the update action be usable against a state that is not reachable by your system. You had to go through all of the trouble of creating this state, which is new work, when the create action ... literally does that work. The value provided by the isolation has to be significantly greater than the cost of having created that state, otherwise you're just creating busywork.

anyway, this comes up a lot for me since my primary project is a stateful multiplayer server whose only job is to contain and communicate the state of a game. integration testing this thing is ... hard. curious what people do for integration testing from a conceptual level, not from like a tools/language/library level. do other people also face the problem I'm facing, or are people finding testing against platonic states relatively unproblematic and it sounds more like I'm doing it wrong?

@jmoiron
Copy link

jmoiron commented Apr 27, 2021

My POV on this topic comes from working on data storage systems.

Of your 3 problems, the 3rd is definitely the worst and I do not find tests of that nature to be useful. If the system changes and those states are no longer possible, what is the test telling you? These types of things are not durable to change, and detecting failures due to changes is the entire point of having these tests.

I look at a data storage system as a set of axiomatic actions, usually with some higher level actions built on top of those axioms. Since I can't just declare that create works because it is axiomatic, I typically will try to build high confidence in it to "axiomize" it, at which point I'm happy to then define the rest of the systme behaviour naturally in terms of those actions.

Isolation is useful because it tells you what went wrong, but I see it more as a desired thing than a required thing. It's more important to catch bugs than to write tests that are well isolated but do not catch them.

Despite that, you can usually find some isolation in things like create by expanding your coverage of its side effects. What is observable about the null state? How much of that changes?

Typically, I will want a list verb that is not defined in terms of read, which returns all of the state in a trivial way. Systems will also typically expose a count verb that efficiently counts the number of records, either by tracking it as a sequence or just by not having to return all of the data. I also usually expose some internal counters for telemetry purposes, and these can be useful signals for establishing the correctness of core operations, so eg. a create from the null state might increase the number of pages allocated, or a count in the number of queries received.

With those signals, now there's a lot more you can test about create and some of those should fail in isolation. If create fails to increment the number of records, your problem is probably not in list or read, eg. You can define your null state; count(null) == 0, list(null) == [], read(any) == null, and then run a create and ensure all of those signals changed appropriately. The strongest signal is the create -> read round trip, but you can use list as a backup check on data integrity. If list worked but read failed, the problem is probably in read.

Without list or count, it's potentially difficult to verify that create did not modify additional state in error, on top of the expected state modification, so these are actually really good for checking all of the bounds around your core operations.

Now, create's behaviour from null state has a bunch of tests of various strength. Some may be weak signals, and some are worryingly coupled to the implementation, but the goal is to define create so thoroughly that subsequent tests can take its correctness for granted. Later, if those details about the initial null state transition change, so long as the behaviour still works, the rest of your test suite will succeed and the problem should be pretty obvious.

If list and count end up being too expensive for your system, or you don't want to expose that, then you don't have to, but you can still use them to verify the exported API.

@jordanorelli
Copy link
Author

jordanorelli commented Apr 28, 2021

oh man am I glad to hear from you.

Without list or count, it's potentially difficult to verify that create did not modify additional state in error

Ah! yeah, this is a good example. You test that you got what you did want, but forgot to test that nothing else happened, and so maybe you had unintended side-effects. I've definitely had this problem before.

If I'm understanding correctly, you're basically creating a sort of proxy value of the system state and testing against that, a sort of neutral frame of reference that is more simply described than the entire state of the system; so long as things look correct from that frame of reference, we're ok, we just test the validity of the frame of reference separately. That makes sense to me. I think that's a pretty good and general solution to this category of problem. It's also probably data you already want anyway because that aggregate state data is directly observable as your observability metrics for a lot of projects.

For my project, one of the big challenges is that we have domain situations where we might have a sequence of many stateful steps. So a challenging situation that I had to test for a multiplayer game was:

  • a game creates a room (gameplay session) with profanity filtering
  • the game poses a question to the connected players, but does not have to specify that it wants to use profanity filtering for this question, because that was already indicated when we created the room. This means that to turn on profanity filtering we only have to update the logic of where we create the play session, not the logic everywhere we pose a question.
  • a player sends in a profane answer to the question. they receive a reply rejecting their answer because it is profane.
  • ensure that the game and other players do not see the profane submission. This ensures that we don't have to update the logic everywhere that we receive a player submission; just as in the case of profanity filtering not existing, all submissions are treated as accepted since the server accepted them.
  • the player can send a new, clean answer, which is accepted and seen by the game and other players.

This was really hard to test because each of the steps relies on the state created prior, so going in full-isolation mode, the process of creating the starting state for the last step wound up being a lot of work, especially since you have to model out whether a client is receiving notification of another client's activity. Eventually modeling the beginning states started to become so complex that the complexity of creating these initial state values and maintaining them over time was so complicated that people weren't writing tests. Also everything is in-memory so I can run hundreds (thousands?) of tests in under a second because it's Go, so any perceived performance benefit of test isolation was irrelevant.

Isolation is useful because it tells you what went wrong, but I see it more as a desired thing than a required thing. It's more important to catch bugs than to write tests that are well isolated but do not catch them.

yeahhh. We definitely had that problem in the past: well-isolated tests that weren't catching bugs.

I wound up writing a testing library for writing dependent tests: tests that depend on the "output" of other tests: https://github.com/jordanorelli/tea/ (the incr example is probably the most straightforward to understand)

I know it looks abandoned but ... I'm actually using it in production now; I have a graph of ~600 tests that block CI for deploying our multiplayer server, and that suite of tests creates http servers and establishes websockets and tests that when one client does something, other clients do (or do not) see those things. So I'm starting and stopping hundreds of http servers and creating hundreds of websocket connections. I'm hitting ulimit problems now because many tests involve at least 3 connections (the game and two players) so I'm opening over a thousand websockets when I run go test.

It works by creating a tree of tests, where each test is a value defined by the following interface:

type Test interface {
    Run(t *testing.T)
}

so like if you had three test types named startAPIServer that tested that the API server starts and is reachable, createBook that used some create endpoint to create a book record, and getBook that used the API endpoint to get the record back:

func TestAPI(t *testing.T) {
    root := tea.New(&startAPIServer{})

    created := root.Child(&createBook{title: "everybody poops"})

    root.Child(&getBook{title: "everybody poops", expectError: ErrNotFound})
    created.Child(&getBook{title: "everybody poops"})

    tea.Run(t, root)
}

The tests use Go sub-tests, so you get tree output using just go test with no additional tooling. If a test fails, all of its subtests are still printed in the tree but marked as skipped, so that when a test fails you know how many dependent tests were skipped. It uses struct tags to pass state data from one test to the next in a given sequence. Running a given test means re-running every test back to the root of the tree, so you write tests that ergonomically look like dependent tests, but they actually execute as isolated tests.

Writing this library dramatically increased our test coverage and test productivity. I write the test types, and any developer can create new test cases and new test sequences very easily by just composing the types that I've defined. But now I have a new problem: maintaining the test library is work, and its output is very bad and very confusing. I'm at a point where I have to work on the test library itself, but I'm not sure if this is just a very silly concept and I should throw it out or if I should double-down on it.

But it's interesting that you're not just like, throwing down a testing library that you think solves this problem elegantly and instead describing how to hand-roll this because the entire category is a bit dicey. Everyone I've talked to about this describes a different ad-hoc way of dealing with it.

@jmoiron
Copy link

jmoiron commented Apr 28, 2021

oh man am I glad to hear from you.

😄 👍

For my project, one of the big challenges is that we have domain situations where we might have a sequence of many stateful steps. So a challenging situation that I had to test for a multiplayer game was: [...]
This was really hard to test because each of the steps relies on the state created prior, so going in full-isolation mode, the process of creating the starting state for the last step wound up being a lot of work, especially since you have to model out whether a client is receiving notification of another client's activity

I think it's totally reasonable to write out a test with all of your bullet points happening in sequence.

These kind of remind of doing transaction isolation tests and rollback tests, where some state transformations get rejected and they shouldn't be seen by other users. For these, I write tests more like stories or scenarios; the bullet points you listed would probably be comments in the test, and it would be long, but I think it's useful to test these behaviours as a sequence of user actions rather than as a collection of isolated behaviours under different state conditions. These tests are very easy to read, but they only test a very specific permutation of actions, so they miss a lot of bugs.

For this reason I split tests into ones that cover the "intent" of a feature and ones that cover its "implications" in the wider system. Intention tests tell the story of what a feature is supposed to do, they're mostly intended to get the feature from zero to working, and for future developers to understand the intentions of the new feature. Coverage tests are there to test how that feature interacts with the wider system. Randomized testing, especially property testing (though I usually do this without a framework) is really good at the 2nd, but it makes for bad reading, and they are harder to write.

I wound up writing a testing library for writing dependent tests: tests that depend on the "output" of other tests: https://github.com/jordanorelli/tea/ (the incr example is probably the most straightforward to understand)
The tests use Go sub-tests, so you get tree output using just go test with no additional tooling. If a test fails, all of its subtests are still printed in the tree but marked as skipped, so that when a test fails you know how many dependent tests were skipped. It uses struct tags to pass state data from one test to the next in a given sequence. Running a given test means re-running every test back to the root of the tree, so you write tests that ergonomically look like dependent tests, but they actually execute as isolated tests.

Cool! So you basically write tests as an n-ary tree, and each path to a leaf gets executed separately, but you've only had to define each node once for the tree instead of once per path. That could be really good esp. if you have big N's and a lot of depth. Unpacking the tree structure from the lexical code structure lets you support much larger N and more depth. I also love that failure at a node can short circuit the rest of the path so I don't see an enormous amount of failure output when I break something fundamental.

I can also see such a tree getting unwieldy to manage and hard to follow. It reads like "small functions" code because each step is broken up spatially and you have to follow its descendents much less naturally.

I've never been a cucumber/convey fan but one benefit to its approach is that the paths it builds through the call tree read a lot like the linear stories I like to tell in my tests through comments.

It looks good for enumerating a lot of permutations at each level, but while this is good for coverage, I've it is kinda mixed in terms of ROI for "catching bugs". This might be the problem you're running up against... having loads of paths gives you great coverage but the coverage of the average test is very low.

For a randomized test for transaction isolation, to continue my example.. if you have a huge class of writes going through the same latch internally, then you can settle on the write that is the easiest to verify for your property test, and then verify that the other writes hit the latch to ensure their "coverage" doesn't get invalidated. This saves a lot of time and a lot of code and while it is technically a sacrifice in coverage, the quality of the test suite is still really high, and the class of bugs that could remain are really low.

A lot of stateful systems have things like this in them, because their verbs are built on top of each other, and I'd say that the single property test + latch check gives me a higher degree of confidence in the system than a hand-built tree of all possible writes through whatever sequence of events that I imagined when writing the test, because the property test will run sequences that I didn't anticipate.

But it's interesting that you're not just like, throwing down a testing library that you think solves this problem elegantly and instead describing how to hand-roll this because the entire category is a bit dicey. Everyone I've talked to about this describes a different ad-hoc way of dealing with it.

Maybe there is one out there, but I don't know about it. I've not seen many testing libraries or testing frameworks that are about providing you with a tailored approach to testing a specific kind of system. Even things like Quickcheck and Hypothesis, which are probably the closest I know of, don't really tell you how to approach this. They give you a generalized approach towards producing inputs and checking outputs and are therefore still kind of "pure".

There's also this consistent push from people from various angles (infra, FP advocates, etc) for everything to be "stateless" which.. yeah, fine stateless and immutable things are easier to reason about and preferable where possible, but also the world is stateful and everything in computing built on top of state, so we need to be honest and admit that we're offloading those problems and they're still important to solve instead of just avoid.

There's probably a whole lot of cognitive bias that is going on.

  • People think their problems are special when they aren't.
  • Projects involving state storage & transition tend to be big and long lasting so the same person doesn't see a lot of them during their career to find the patterns in them

And finally:

That being said, I think that, as engineers, we tend to discount the complexity we build ourselves vs. complexity we need to learn.

@jordanorelli
Copy link
Author

jordanorelli commented Apr 28, 2021

write tests as an n-ary tree, and each path to a leaf gets executed separately, but you've only had to define each node once for the tree instead of once per path.

yes, exactly. It's called "tea" because you're "reading the tea leaves" to tell your fortune.

I also love that failure at a node can short circuit the rest of the path so I don't see an enormous amount of failure output when I break something fundamental.

yeah one of the shortcomings of using Go's sub-tests natively is that if your test fails and exits early, the sub-tests are never created. So a pass might be like "300 tests passed", and a failure might be like "150 tests passed, and 1 test failed", when the reality is what happened was "150 tests passed, 1 test failed, and 149 tests were skipped". Although the Go sub-test allows you to mark tests as skipped explicitly, the ergonomics of doing so means that it's very easy to mess that up. tea handles that for you automatically.

I can also see such a tree getting unwieldy to manage and hard to follow. It reads like "small functions" code because each step is broken up spatially and you have to follow its descendents much less naturally.

this is a massive problem we have now with tea. Writing tests is super easy, looking back at the tree and adding tests to a large tree that already exists is nightmarishly confusing. I have to do some work to improve the ergonomics of larger test graphs.

I've never been a cucumber/convey fan

yeah so one of the problems that convey has is that because a test accesses the side-effects of its ancestors via closures, and stack frames always have a single parent, a given test can only appear along a single path. Since tea uses structs and struct fields to persist the runtime environment from test to test instead of stack frames and closures, tea has no such constraint. For example:

func TestCreateBook(t *testing.T) {
    // since you're creating a test -plan- and executing it in separate phases,
    // you can manipulate the plan arbitrarily -before- running it. This function takes
    // a node in a test plan as its input, and adds to it a bunch of children.
    addChildren := func(root *tea.Tree) *tea.Tree {
        root.Child(&getBook{title: "the giving tree", expectError: ErrNotFound})

        withBook := root.Child(&createBook{title: "the giving tree"})
        withBook.Child(&getBook{title: "the giving tree"})
    }

    sql := tea.New(&startSQLServer{})
    mem := tea.New(&startMemServer{})

    addChildren(sql)
    addChildren(mem)

    tea.Run(t, sql)
    tea.Run(t, mem)
}

You can do that now and it works. The entire tree is a data structure that can be manipulated arbitrarily, so you can adopt tea easily into projects that use table-driven tests.

Also any two equal test values are equivalent. I think this makes them "referentially transparent" but I've never done FP so I dunno. This:

a := tea.New(&A{})
a.Child(&B{X: 1}).Child(&C{Z: 10})
a.Child(&B{X: 2}).Child(&C{Z: 10})

Is exactly the same thing as this:

a := tea.New(&A{})
c := &C{Z: 10}
a.Child(&B{X: 1}).Child(c)
a.Child(&B{X: 2}).Child(c)

It doesn't matter that they're pointers to the same struct because tea doesn't actually use that struct: that value is only treated as a template, it's copied before it's ever used. That value is never actually mutated; we create a new value to mutate to ensure isolation. This works presently and I rely on it.

I've been working on making it so that you can combine nodes in the plan to treat it as a DAG instead of a tree. So long as there are no cycles you can always break the DAG apart into its component paths. E.g., that prior example would hypothetically turn into this:

func TestCreateBook(t *testing.T) {
    sql := tea.New(&startSQLServer{})
    mem := tea.New(&startMemServer{})

    allDBs := sql.And(mem)
    allDBs.Child(&getBook{title: "the giving tree", expectError: ErrNotFound})
    withBook := allDBs.Child(&createBook{title: "the giving tree"})
    withBook.Child(&getBook{title: "the giving tree"})

    tea.Run(t, allDBs)
}

(that's not implemented yet though.)

things like Quickcheck and Hypothesis, which are probably the closest I know of, don't really tell you how to approach this.

oh cool these weren't on my radar, thanks for mentioning them, they'll be good prior art to look at.

There's also this consistent push from people from various angles (infra, FP advocates, etc) for everything to be "stateless" which.. yeah, fine stateless and immutable things are easier to reason about and preferable where possible, but also the world is stateful and everything in computing built on top of state, so we need to be honest and admit that we're offloading those problems and they're still important to solve instead of just avoid.

yeahhhh I encounter this -a lot-. My project serves only a single purpose: to handle the state management so that other systems don't have to. So much conventional wisdom is poorly suited to projects of this nature because usually people are just using a database, but what I'm making has many similarities to databases.

anyway thanks for the feedback, it sounds like making a library like this isn't raising all sorts of red flags to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment