bridgethillyer/onyx-jepsen-walkthrough

## onyx-jepsen-walkthrough
Jepsen
* client w/# of threads running on test machine
* nemesis - orchestration of the faults on the nodes
* operations that....


onyx-peers.jobs.basic-test
* jobs
* no nemesis
* close
* read ledger back
* invoke client manually on each client
* run checker on the events
  - which properties have held over the test
* write to bookkeeper ledgers
* close the ledgers
* wait for completion of jobs
* read peer log back
* pass full history

Checker
* takes test-setup, peer-config, #peers, #jobs
* look at history of all events run on client
* is that history valid?
* plays back log
* count # peers on replica
* check pulse of peers in Zookkeeper
* check invariants to make sure they are all true
* job-invariants
  - check whether ther was an exception for the job
  - loks up all ledgers that were read back
  - checks results that they are from the correct job
    - so segment did not go through the wrong job
    - values that were written were the ones that were read
  - the Checker knows which invariatns should be applied per job

onyx-jepsen.onyx-basic-test
* :random-halves nemesis (covered in blog)
* :awake-ms - ms peers will be connected before partitions occcur
* generator
  - generator to set frequency at which jobs are submitted
  - staggered at 1/10 s - so bursts, then nothing
  - start-stop-nemesis-seq - partitions halves, stops, starts, etc
* jepsen-test
  - os, client, generator, nemesis, etc
  - checker is always the same (adds different invariants based on the job)

onyx-jepsen.onyx-client
* sets up client with all these parameters (ledger-handle, etc)
* 5 clients, see different events, but have a full history of what happened to the cluster
* client gets client events
* nemesis gets nemesis events
* end
  - heal network
  - read ledger
  - read jobs?
  - then Checker

onyx-jepsen.simple-job
* build-job
  - each client has ledger id
    - one thread writing to one ledger
  - when a client sees a submit-job event, it builds the job based on the client ledger
    - generate a job based on those ledgers
    - add-read-ledgers
  - write ledger per client
    - one task per ledger - with ledger id (eg. read-ledger-5)
    - creates links in workflow
* simple-job will end up looking like the visualization in the blog post
* 5 clients that can write to bookkeeper ledger, can create jobs
  - then read linear job history
  - checker checks

Orchestration
* docker images customized for Onyx with Zookkeeper installed
* 5 nodes
* adding to container saves time in test setup
* upload uberjarred peers
* run-peer script
  - starts jar
  - launch-prod-peers entrypoint
  - with aeron settings to make sure aeron doesn't die during startup

onyx-jepsen.onyx-aggregation-test
* also add a segment with a unique ID
  - random age and event ID
* add window-job
* build-window-state-job
  - window on annotate job which is global
  - uses conj aggregation
  - task has a uniqueness key
  - shouldn't see a segment twice, but should see all segments
  - peer partitioned by - segment should be seen on new peer
  - trigger - every one element - only written at :job-completed event
    - add entry to bookkeeper ledger

* window-state-job-invariants
  - at playback, look for bookkeeper entries, finds trigger ledger
  - does it have all my results? only once?
  - then all the invariants from the simple test

Future work:
* granular trigger work
* improve the testing of the windows
* does it do the trigger - only once?

Outputs
* store directory - the directory per test, timestamped
* copy any logs as well as stdout
* then (by hand) add issues to onyx-issues-log.txt along with teimestamp, directories for context
  - jepsen history: results.edn - all kinds of info in there
  - onyx.log - use timestamps to correlate
  - flight recorder files

Testing workflow
* ./scripts/start-containers.sh
  - starts docker-in-docker
* then run the test script (in README)
* wait for it to finish
* it reports success/failure
* then /store directory is shared with localhost
  - so look at those files to see what happened
* future: maybe make the test running a single step, which would help with CI
	Jepsen
	* client w/# of threads running on test machine
	* nemesis - orchestration of the faults on the nodes
	* operations that....


	onyx-peers.jobs.basic-test
	* jobs
	* no nemesis
	* close
	* read ledger back
	* invoke client manually on each client
	* run checker on the events
	- which properties have held over the test
	* write to bookkeeper ledgers
	* close the ledgers
	* wait for completion of jobs
	* read peer log back
	* pass full history

	Checker
	* takes test-setup, peer-config, #peers, #jobs
	* look at history of all events run on client
	* is that history valid?
	* plays back log
	* count # peers on replica
	* check pulse of peers in Zookkeeper
	* check invariants to make sure they are all true
	* job-invariants
	- check whether ther was an exception for the job
	- loks up all ledgers that were read back
	- checks results that they are from the correct job
	- so segment did not go through the wrong job
	- values that were written were the ones that were read
	- the Checker knows which invariatns should be applied per job

	onyx-jepsen.onyx-basic-test
	* :random-halves nemesis (covered in blog)
	* :awake-ms - ms peers will be connected before partitions occcur
	* generator
	- generator to set frequency at which jobs are submitted
	- staggered at 1/10 s - so bursts, then nothing
	- start-stop-nemesis-seq - partitions halves, stops, starts, etc
	* jepsen-test
	- os, client, generator, nemesis, etc
	- checker is always the same (adds different invariants based on the job)

	onyx-jepsen.onyx-client
	* sets up client with all these parameters (ledger-handle, etc)
	* 5 clients, see different events, but have a full history of what happened to the cluster
	* client gets client events
	* nemesis gets nemesis events
	* end
	- heal network
	- read ledger
	- read jobs?
	- then Checker

	onyx-jepsen.simple-job
	* build-job
	- each client has ledger id
	- one thread writing to one ledger
	- when a client sees a submit-job event, it builds the job based on the client ledger
	- generate a job based on those ledgers
	- add-read-ledgers
	- write ledger per client
	- one task per ledger - with ledger id (eg. read-ledger-5)
	- creates links in workflow
	* simple-job will end up looking like the visualization in the blog post
	* 5 clients that can write to bookkeeper ledger, can create jobs
	- then read linear job history
	- checker checks

	Orchestration
	* docker images customized for Onyx with Zookkeeper installed
	* 5 nodes
	* adding to container saves time in test setup
	* upload uberjarred peers
	* run-peer script
	- starts jar
	- launch-prod-peers entrypoint
	- with aeron settings to make sure aeron doesn't die during startup

	onyx-jepsen.onyx-aggregation-test
	* also add a segment with a unique ID
	- random age and event ID
	* add window-job
	* build-window-state-job
	- window on annotate job which is global
	- uses conj aggregation
	- task has a uniqueness key
	- shouldn't see a segment twice, but should see all segments
	- peer partitioned by - segment should be seen on new peer
	- trigger - every one element - only written at :job-completed event
	- add entry to bookkeeper ledger

	* window-state-job-invariants
	- at playback, look for bookkeeper entries, finds trigger ledger
	- does it have all my results? only once?
	- then all the invariants from the simple test

	Future work:
	* granular trigger work
	* improve the testing of the windows
	* does it do the trigger - only once?

	Outputs
	* store directory - the directory per test, timestamped
	* copy any logs as well as stdout
	* then (by hand) add issues to onyx-issues-log.txt along with teimestamp, directories for context
	- jepsen history: results.edn - all kinds of info in there
	- onyx.log - use timestamps to correlate
	- flight recorder files

	Testing workflow
	* ./scripts/start-containers.sh
	- starts docker-in-docker
	* then run the test script (in README)
	* wait for it to finish
	* it reports success/failure
	* then /store directory is shared with localhost
	- so look at those files to see what happened
	* future: maybe make the test running a single step, which would help with CI