yuvalif/teuthology-on-locked-machine.md

## teuthology-on-locked-machine.md

      
    Raw
  

              teuthology-on-locked-machine.md
            
          
    locking a machine

first, use:
teuthology-lock --brief

to see if you already have a machine locked. if not, use:
teuthology-lock --lock-many 1 --machine-type smithi

to lock a machine.

in theory "smithi" could be replaced with other machine types (e.g. "mira", "gibba")
not specifying the OS will increase the chance of lockign a machine. you can later on reimage the machine with whatever OS you need.
e.g. to get a CentOS 9 OS on an already lociked machine:

teuthology-reimage -v --os-type centos --os-version 9.stream <hostname>


reimaging does not complete properly, abd you would see a message that looks like that (for RGW test suites):

paramiko.ssh_exception.BadHostKeyException: Host key for server 'smithi191.front.sepia.ceph.com' does not match: got 'AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBIiEnmRAHiRzxJ8A8VHbp6Sfj/cZlObX5agO2bSneMsIjVB9gBU+F8yqw+ZMthTf+dL2AuUJ1zqRBifpjSXRuzY=', expected 'AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEOfQy5E9ekjRHzGsi3vO8EdY9oIhotS67hhc/7DEbu5Y44D3wVb9UzeT+mOyxULkTif20vMskwMezi+mNhFgR4='

you can stop the process at that point (e.g. Ctrl-C) as the reimaging was done.
copy the first hash (the inside the single quotes after the work "got"), and copy it to orig.config.yaml under the "targets" section.
for the above case it would look like that:
targets:
  smithi191.front.sepia.ceph.com: ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBIiEnmRAHiRzxJ8A8VHbp6Sfj/cZlObX5agO2bSneMsIjVB9gBU+F8yqw+ZMthTf+dL2AuUJ1zqRBifpjSXRuzY=

when no reimaging is needed, running the test for the first time would give a similar error, you should stop the test and copy the hash from the error mesage into the orig.config.yaml, and re-run the test
once you don't need the machine, please unlock:

teuthology-lock --unlock <hostname>

preperations to the test


for ansible to work, you must be able to non-interactively ssh to the machine. to do that, ssh to the machine:

ssh <hostname>

and select "yes". if you already sshed to that machine in the past, you will have to delete the old lines from ~/.ssh/known_hosts referencing this machine.
make sure thta all relevant lines are deleted by calling ssh <hostname> and making sure that there is no interactive step.

make sure there exists a directory called "archive_dir"
the file that controls the execution of the test is orig.config.yaml. this is an example:

targets:
  <hostname>: ecdsa-sha2-nistp256 <hash from reimaging>
archive_path: <full path to archive dir>
verbose: true
interactive-on-error: true
## wait_for_scrub being false makes locked runs go a lot faster
wait_for_scrub: false
owner: scheduled_<user>@teuthology
kernel:
  kdb: true
  sha1: distro
overrides:
  admin_socket:
    branch: <branch name>
  ceph:
    conf:
      client:
        debug rgw: 20
        rgw crypt require ssl: false
        rgw crypt s3 kms backend: testing
        rgw crypt s3 kms encryption keys: testkey-1=YmluCmJvb3N0CmJvb3N0LWJ1aWxkCmNlcGguY29uZgo=
          testkey-2=aWIKTWFrZWZpbGUKbWFuCm91dApzcmMKVGVzdGluZwo=
        rgw d3n l1 datacache persistent path: /tmp/rgw_datacache/
        rgw d3n l1 datacache size: 10737418240
        rgw d3n l1 local datacache enabled: true
        rgw enable ops log: true
        rgw lc debug interval: 10
        rgw torrent flag: true
        setgroup: ceph
        setuser: ceph
      mgr:
        debug mgr: 20
        debug ms: 1
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
      osd:
        bdev async discard: true
        bdev enable discard: true
        bluestore allocator: bitmap
        bluestore block size: 96636764160
        bluestore fsck on mount: true
        debug bluefs: 1/20
        debug bluestore: 1/20
        debug ms: 1
        debug osd: 20
        debug rocksdb: 4/10
        mon osd backfillfull_ratio: 0.85
        mon osd full ratio: 0.9
        mon osd nearfull ratio: 0.8
        osd failsafe full ratio: 0.95
        osd objectstore: bluestore
    flavor: default
    fs: xfs
    log-ignorelist:
    - \(MDS_ALL_DOWN\)
    - \(MDS_UP_LESS_THAN_MAX\)
    - \(PG_AVAILABILITY\)
    - \(PG_DEGRADED\)
    wait-for-scrub: false
  ceph-deploy:
    bluestore: true
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        osd default pool size: 2
      osd:
        bdev async discard: true
        bdev enable discard: true
        bluestore block size: 96636764160
        bluestore fsck on mount: true
        debug bluefs: 1/20
        debug bluestore: 1/20
        debug rocksdb: 4/10
        mon osd backfillfull_ratio: 0.85
        mon osd full ratio: 0.9
        mon osd nearfull ratio: 0.8
        osd failsafe full ratio: 0.95
        osd objectstore: bluestore
  install:
    ceph:
      flavor: default
      sha1: <sha1 of branch on ceph-ci>
  openssl_keys:
    rgw.client.0:
      ca: root
      client: client.0
      embed-key: true
    root:
      client: client.0
      cn: teuthology
      install:
      - client.0
      key-type: rsa:4096
  rgw:
    client.0:
      ssl certificate: rgw.client.0
    compression type: random
    datacache: true
    datacache_path: /tmp/rgw_datacache
    ec-data-pool: false
    frontend: beast
    storage classes: LUKEWARM, FROZEN
  s3tests:
    force-branch: ceph-master
  selinux:
    whitelist:
    - scontext=system_u:system_r:logrotate_t:s0
  thrashosds:
    bdev_inject_crash: 2
    bdev_inject_crash_probability: 0.5
  workunit:
    branch: <branch name>
    sha1: <sha1 of branch on ceph-ci>
roles:
- - mon.a
  - mon.b
  - mgr.x
  - osd.0
  - osd.1
  - client.0
repo: https://github.com/ceph/ceph-ci.git
sha1: <sha1 of branch on ceph-ci>
suite_branch: <branch name of test suite>
suite_relpath: qa
suite_repo: <repo for test suite>
tasks:
- install:
    extra_system_packages:
      deb:
      - s3cmd
      rpm:
      - s3cmd
- ceph: null
- openssl_keys: null
- rgw:
    client.0: null
- <name of test suite>:
    client.0:
      rgw_server: client.0

Notes

values marked inside rectangular brackets (<>) should be filled
shaman builds takes time, and not needed when only test code is changed. this is why the suite_repo is usually your fork of the ceph repo, and not ceph-ci


run the test

teuthology -v --archive archive_dir orig.config.yaml

during the setup of the test the step that takes most of the time is the "ansible" one
to make sure that progres is done during that step, track the ansible log file: ~/archive_dir/ansible.log
test test will stop on error, so that the machine where it runs could be used for debugging
the test log would be printed to the terminal, and also to ~/archive_dir/teuthology.log
once debugging is done, hit Ctrl-D to do the cleanup
due to an issue with the cleanup process, when rerunning the test the machine has to be reimaged, otherwise, the folowing error is likely to happen:

/dev/vg_nvme: already exists in filesystem

see the instruction on "known_hosts" cleanup, after reimaging.