Skip to content

Instantly share code, notes, and snippets.

@bridgethillyer
Created March 29, 2016 19:16
Show Gist options
  • Save bridgethillyer/4cb1d572adb7acb1975713b7ce015219 to your computer and use it in GitHub Desktop.
Save bridgethillyer/4cb1d572adb7acb1975713b7ce015219 to your computer and use it in GitHub Desktop.
Jepsen
* client w/# of threads running on test machine
* nemesis - orchestration of the faults on the nodes
* operations that....
onyx-peers.jobs.basic-test
* jobs
* no nemesis
* close
* read ledger back
* invoke client manually on each client
* run checker on the events
- which properties have held over the test
* write to bookkeeper ledgers
* close the ledgers
* wait for completion of jobs
* read peer log back
* pass full history
Checker
* takes test-setup, peer-config, #peers, #jobs
* look at history of all events run on client
* is that history valid?
* plays back log
* count # peers on replica
* check pulse of peers in Zookkeeper
* check invariants to make sure they are all true
* job-invariants
- check whether ther was an exception for the job
- loks up all ledgers that were read back
- checks results that they are from the correct job
- so segment did not go through the wrong job
- values that were written were the ones that were read
- the Checker knows which invariatns should be applied per job
onyx-jepsen.onyx-basic-test
* :random-halves nemesis (covered in blog)
* :awake-ms - ms peers will be connected before partitions occcur
* generator
- generator to set frequency at which jobs are submitted
- staggered at 1/10 s - so bursts, then nothing
- start-stop-nemesis-seq - partitions halves, stops, starts, etc
* jepsen-test
- os, client, generator, nemesis, etc
- checker is always the same (adds different invariants based on the job)
onyx-jepsen.onyx-client
* sets up client with all these parameters (ledger-handle, etc)
* 5 clients, see different events, but have a full history of what happened to the cluster
* client gets client events
* nemesis gets nemesis events
* end
- heal network
- read ledger
- read jobs?
- then Checker
onyx-jepsen.simple-job
* build-job
- each client has ledger id
- one thread writing to one ledger
- when a client sees a submit-job event, it builds the job based on the client ledger
- generate a job based on those ledgers
- add-read-ledgers
- write ledger per client
- one task per ledger - with ledger id (eg. read-ledger-5)
- creates links in workflow
* simple-job will end up looking like the visualization in the blog post
* 5 clients that can write to bookkeeper ledger, can create jobs
- then read linear job history
- checker checks
Orchestration
* docker images customized for Onyx with Zookkeeper installed
* 5 nodes
* adding to container saves time in test setup
* upload uberjarred peers
* run-peer script
- starts jar
- launch-prod-peers entrypoint
- with aeron settings to make sure aeron doesn't die during startup
onyx-jepsen.onyx-aggregation-test
* also add a segment with a unique ID
- random age and event ID
* add window-job
* build-window-state-job
- window on annotate job which is global
- uses conj aggregation
- task has a uniqueness key
- shouldn't see a segment twice, but should see all segments
- peer partitioned by - segment should be seen on new peer
- trigger - every one element - only written at :job-completed event
- add entry to bookkeeper ledger
* window-state-job-invariants
- at playback, look for bookkeeper entries, finds trigger ledger
- does it have all my results? only once?
- then all the invariants from the simple test
Future work:
* granular trigger work
* improve the testing of the windows
* does it do the trigger - only once?
Outputs
* store directory - the directory per test, timestamped
* copy any logs as well as stdout
* then (by hand) add issues to onyx-issues-log.txt along with teimestamp, directories for context
- jepsen history: results.edn - all kinds of info in there
- onyx.log - use timestamps to correlate
- flight recorder files
Testing workflow
* ./scripts/start-containers.sh
- starts docker-in-docker
* then run the test script (in README)
* wait for it to finish
* it reports success/failure
* then /store directory is shared with localhost
- so look at those files to see what happened
* future: maybe make the test running a single step, which would help with CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment