Skip to content

Instantly share code, notes, and snippets.

@jonludlam
Created September 29, 2015 14:43
Show Gist options
  • Save jonludlam/a9ebf00a024da7b89f08 to your computer and use it in GitHub Desktop.
Save jonludlam/a9ebf00a024da7b89f08 to your computer and use it in GitHub Desktop.
XenVM testing plan
The purpose of this part of the CAR is to improve the reliability and
robustness of the XenVM component of the Thin LVHD feature. This will
be achieved by three main activities: Expanding the existing
dev-tests, implementing some new features and applying some formal
methods to prove models of how bits xenvmd works are correct.
Expanding dev tests:
There are already many dev tests being run on every single build and
pull request going into xenvm. Currently, these mainly cover the
functionality of xenvmd and xenvm, and only partly cover the
activities of the local allocator. In order to increase the coverage
we propose to do the following:
- Extend the mock device-mapper component such that it can be used
between processes.
Using the real device mapper is limiting, in that udev becomes
involved and is a source of delays. Additionally, using the one
system device-mapper means we can't do multi-host testing: By
extending the mock to work between processes, we can simulate a
pool of as many hosts as we like using only 1 real host. This is a
small amount of work to change the mock to use 'read-modify-write'
with filesystem locks on each call rather than keeping state in
memory as it currently does.
- Functorize high-level logic over the lower-level modules.
This is a neat trick that we already use today, but can be
extended. The idea is to make mock modules that simulate parts of
the code. As a concrete example, we can functorize over the
'shared-block-ring' code in order to use a more convenient on-disk
layout of the messages, so that each message sent over the ring
becomes a file on disk that is easily examined. This would be
particularly useful in testing invariants over the set of all
messages sent over the ring, as in the 'real' shared-block-ring the
messages get overwritten. Another example is functorizing over the
'Time' module, so that the current 5 second poll interval can be
changed to be much shorter or longer or randomized in order to
cause the tests to run more quickly or to explore differences in
thread interleaving.
- Implement new component tests:
Once the above two are completed, the new functionality will be used
to enable the following tests:
- Stress test
This would be a multi-host, many LV test that would quickly
simulate thousands of VDIs across 16 hosts, with lots of
allocations.
- Restart tests
This would test restarts of xenvmd and the local allocator. Both
are built to be 'Crash-only software' (not as bad as it sounds,
honest!) https://en.m.wikipedia.org/wiki/Crash-only_software. The
idea is to put fist points in to cause an exit at particularly
important points and to verify that operations either succeed
or fail, and not leave the system in an intermediate state.
- Invariant post-processing
Having written out the journals and rings into easily readable
files, they can be post-processed to ensure that the invariants
required are indeed being held. These are statements such as
'For all FreeAllocation messages sent from xenvmd to all
local allocators, the blocks allocated must be unique unless
the messages have the same generation count'
Targetted Formal Methods
We already have a Promela model of the shared-block-ring
suspend/resume protocol, and a model of a previous (broken) version of
the xenvmd -> local allocator messages. We would like to spend some
time examining some of the more critical aspects of system to try to
find any other lurking issues. This is to be done alongside the
functorization work outlined above in order to simplify the logic
in xenvmd/local allocator such that it is more obviously peforming
the same logic as the models are testing.
As a group and team, we have limited exposure to these methods so it's
hard to predict how long this will take and what the benefit will be.
However, I believe knowledge of these methods will be very beneficial
not only to Thin LVHD but to XenServer Engineering as a whole. I
suggest our approach should be to time-box the CP ticket to do this
aspect of the work.
New feature implementation
- Watchdog. Xapi has a watchdog that makes sure that it's running,
and restarts it if it crashes. Since both xenvmd and the local
allocator were designed to be crash-only, a watchdog is almost
trivial to do.
- PV resize. I'm unsure whether this should go in this CAR or whether
it should be just done under the CA ticket. We need to implement PV
resize in the underlying LVM library - mirage-block-volume
@simonjbeaumont
Copy link

Looks good!

In the New feature implementation section, can we make xenvmd and the local-allocator a service or is this too much work because we currently have one per SR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment