Skip to content

Instantly share code, notes, and snippets.

@tanabarr
Last active October 31, 2019 09:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save tanabarr/b82a0c83a02ccef2d5be3749a733dd95 to your computer and use it in GitHub Desktop.
Save tanabarr/b82a0c83a02ccef2d5be3749a733dd95 to your computer and use it in GitHub Desktop.
Run DAOS pool test suite with OFI PSM2 provider, DCPM modules and NVMe SSDs.
CONFIG:
name: daos_server # map to -g daos_server
port: 10001 # mgmt port
provider: ofi+psm2 # map to CRT_PHY_ADDR_STR=ofi+psm2
nr_hugepages: 4096
control_log_mask: DEBUG
control_log_file: /tmp/daos_control.log
transport_config:
allow_insecure: true
# single server instance per config file for now
servers:
-
targets: 11
first_core: 1
nr_xs_helpers: 1
fabric_iface: ib0
fabric_iface_port: 31416
log_mask: ERR
log_file: /tmp/server.log
# Environment variable values should be supplied without encapsulating quotes.
env_vars: # influence DAOS IO Server behaviour by setting env variables
- CRT_TIMEOUT=2000
- CRT_CREDIT_EP_CTX=0
- PSM2_MULTI_EP=1
- CRT_CTX_SHARE_ADDR=1
- CRT_CTX_NUM=8
# Storage definitions
scm_mount: /mnt/daos # map to -s /mnt/daos
scm_class: dcpm
scm_list: [/dev/pmem0]
bdev_class: nvme
bdev_list: ["0000:5e:00.0", "0000:5f:00.0"] # generate regular nvme.conf
SERVER:
ssh intel-2 "rm -rf /mnt/daos/*"
rm -rf /mnt/daos/*
Running on both hosts (-np 2) with the client on the second host can successfully run daos_test with SCM pmem device file & NVMe SSDs specified on the server config file.
When SSDs specified in config file, tests returned with -1003, solution was to double the size of pool SCM/NVMe used during test. See below for more details.
[root@intel-1 daos]# orterun --map-by node --mca btl tcp,self --mca oob tcp -np 2 -H intel-1,intel-2 --allow-run-as-root --report-uri /shared/urifile daos_server -o /shared/daos_server_psm2_intel-2.yml
# orterun --map-by node --mca btl tcp,self --mca oob tcp -np 1 -H intel-1 --allow-run-as-root daos_server -a /shared/uri -o /shared/daos_server_psm2_intel-2.yml
# orterun -np 1 -H intel-1 --allow-run-as-root daos_server -a /shared/uri -o /shared/daos_server_psm2_intel-2.yml
The following lines should appear in standard out after successful format:
daos_io_server:0 DAOS I/O server (v0.6.0) process 121715 started on rank 1 (out of 2) with 11 target, 1 helper XS per target, firstcore 1, host intel-2.
daos_io_server:0 DAOS I/O server (v0.6.0) process 62470 started on rank 0 (out of 2) with 11 target, 1 helper XS per target, firstcore 1, host intel-1.
CLIENT:
<start server as above>
[root@intel-2 daos]# daos_shell -l intel-1:10001,intel-2:10001 storage format -f -i
[root@intel-2 daos]# daos_agent -i &
[root@intel-ap2 daos]# orterun -np 1 --ompi-server file:/shared/urifile --allow-run-as-root --mca mtl ^psm2,ofi -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 -x CRT_PHY_ADDR_STR=ofi+psm2 -x POOL_SCM_SIZE=8G -x POOL_NVME_SIZE=16G daos_test -p
# orterun -np 1 --ompi-server file:/shared/urifile --allow-run-as-root --mca mtl ^psm2,ofi -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 -x CRT_PHY_ADDR_STR=ofi+psm2 daos_test -p
# orterun -np 1 --ompi-server file:/shared/urifile --allow-run-as-root --mca mtl ^psm2,ofi -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 -x CRT_PHY_ADDR_STR=ofi+psm2 -x SCM_POOL_SIZE=1G -x NVME_POOL_SIZE=0G daos_test -p
# orterun -np 1 -x OFI_INTERFACE=ib0 -x CRT_ATTACH_INFO_PATH=/shared/uri -x DAOS_SINGLETON_CLI=1 -x CRT_PHY_ADDR_STR=ofi+psm2 --allow-run-as-root daos_test -p
# orterun -np 1 --ompi-server file:/shared/uri --allow-run-as-root -x OFI_INTERFACE=ib0 -x CRT_PHY_ADDR_STR=ofi+psm2 daos_test -p
Troubleshooting ERROR when NVMe devices specified (resolved):
=================
DAOS pool tests..
=====================
[==========] Running 10 test(s).
setup: creating pool, SCM size=4 GB, NVMe size=8 GB
daos_pool_create failed, rc: -1003
state not set, likely due to group-setup issue
[==========] 0 test(s) run.
[ FAILED ] GROUP SETUP
[ ERROR ] Pool tests
[ PASSED ] 0 test(s).
============ Summary src/tests/suite/daos_test.c
ERROR, 1 TEST(S) FAILED
This is because SPDK is not able to access hugepages which are not getting released as expected (even after `daos_server storage prepare -n --reset`:
[root@intel-2 daos]# grep Huge /proc/meminfo
AnonHugePages: 1284096 kB
HugePages_Total: 1024
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
System needs a reboot to resolve.
And pool sizes for SCM/NVMe need to be doubled in test by setting the relevant environment variables in the client commandline.
after solution applied, outputs the following:
=================
DAOS pool tests..
=====================
[==========] Running 10 test(s).
setup: creating pool, SCM size=8 GB, NVMe size=16 GB
setup: created pool 1e503e65-30b7-4e6a-94ed-475e09e1d5b5
[ RUN ] POOL1: connect to non-existing pool
[ OK ] POOL1: connect to non-existing pool
[ RUN ] POOL2: connect/disconnect to pool
rank 0 connecting to pool synchronously ... success
rank 0 querying pool info... success
rank 0 disconnecting from pool synchronously ... rank 0 success
[ OK ] POOL2: connect/disconnect to pool
[ RUN ] POOL3: connect/disconnect to pool (async)
rank 0 connecting to pool asynchronously ... success
rank 0 querying pool info... success
rank 0 disconnecting from pool asynchronously ... rank 0 success
[ OK ] POOL3: connect/disconnect to pool (async)
[ RUN ] POOL4: pool handle local2global and global2local
rank 0 connecting to pool synchronously ... success
rank 0 querying pool info... success
rank 0 call local2global on pool handlesuccess
rank 0 broadcast global pool handle ...success
rank 0 disconnecting from pool synchronously ... rank 0 success
[ OK ] POOL4: pool handle local2global and global2local
[ RUN ] POOL5: exclusive connection
SUBTEST 1: other connections already exist; shall get -1012
establishing a non-exclusive connection
trying to establish an exclusive connection
disconnecting the non-exclusive connection
SUBTEST 2: no other connections; shall succeed
establishing an exclusive connection
SUBTEST 3: shall prevent other connections (-1012)
trying to establish a non-exclusive connection
disconnecting the exclusive connection
[ OK ] POOL5: exclusive connection
[ RUN ] POOL6: exclude targets and query pool info
Skip it for now, because CaRT can't support subgroup membership, excluding a node w/o killing it will cause IV issue.
[ OK ] POOL6: exclude targets and query pool info
[ RUN ] POOL7: set/get/list user-defined pool attributes (sync)
setup: connecting to pool
connected to pool, ntarget=22
setting pool attributes synchronously ...
listing pool attributes synchronously ...
Verifying Total Name Length..
Verifying Small Name..
Verifying All Names..
getting pool attributes synchronously ...
Verifying Name-Value (A)..
Verifying Name-Value (B)..
Verifying with NULL buffer..
[ OK ] POOL7: set/get/list user-defined pool attributes (sync)
[ RUN ] POOL8: set/get/list user-defined pool attributes (async)
setting pool attributes asynchronously ...
listing pool attributes asynchronously ...
Verifying Total Name Length..
Verifying Small Name..
Verifying All Names..
getting pool attributes asynchronously ...
Verifying Name-Value (A)..
Verifying Name-Value (B)..
Verifying with NULL buffer..
[ OK ] POOL8: set/get/list user-defined pool attributes (async)
[ RUN ] POOL9: pool reconnect after daos re-init
connected to pool, ntarget=22
[ OK ] POOL9: pool reconnect after daos re-init
[ RUN ] POOL10: pool create with properties and query
create pool with properties, and query it to verify.
setup: creating pool, SCM size=8 GB, NVMe size=16 GB
setup: created pool fbdf1810-609d-4bd7-b4a0-2c2428f28654
setup: connecting to pool
connected to pool, ntarget=22
ACL prop matches expected defaults
[ OK ] POOL10: pool create with properties and query
[==========] 10 test(s) run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment