Last active
October 31, 2019 09:26
-
-
Save tanabarr/b82a0c83a02ccef2d5be3749a733dd95 to your computer and use it in GitHub Desktop.
Run DAOS pool test suite with OFI PSM2 provider, DCPM modules and NVMe SSDs.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CONFIG: | |
name: daos_server # map to -g daos_server | |
port: 10001 # mgmt port | |
provider: ofi+psm2 # map to CRT_PHY_ADDR_STR=ofi+psm2 | |
nr_hugepages: 4096 | |
control_log_mask: DEBUG | |
control_log_file: /tmp/daos_control.log | |
transport_config: | |
allow_insecure: true | |
# single server instance per config file for now | |
servers: | |
- | |
targets: 11 | |
first_core: 1 | |
nr_xs_helpers: 1 | |
fabric_iface: ib0 | |
fabric_iface_port: 31416 | |
log_mask: ERR | |
log_file: /tmp/server.log | |
# Environment variable values should be supplied without encapsulating quotes. | |
env_vars: # influence DAOS IO Server behaviour by setting env variables | |
- CRT_TIMEOUT=2000 | |
- CRT_CREDIT_EP_CTX=0 | |
- PSM2_MULTI_EP=1 | |
- CRT_CTX_SHARE_ADDR=1 | |
- CRT_CTX_NUM=8 | |
# Storage definitions | |
scm_mount: /mnt/daos # map to -s /mnt/daos | |
scm_class: dcpm | |
scm_list: [/dev/pmem0] | |
bdev_class: nvme | |
bdev_list: ["0000:5e:00.0", "0000:5f:00.0"] # generate regular nvme.conf | |
SERVER: | |
ssh intel-2 "rm -rf /mnt/daos/*" | |
rm -rf /mnt/daos/* | |
Running on both hosts (-np 2) with the client on the second host can successfully run daos_test with SCM pmem device file & NVMe SSDs specified on the server config file. | |
When SSDs specified in config file, tests returned with -1003, solution was to double the size of pool SCM/NVMe used during test. See below for more details. | |
[root@intel-1 daos]# orterun --map-by node --mca btl tcp,self --mca oob tcp -np 2 -H intel-1,intel-2 --allow-run-as-root --report-uri /shared/urifile daos_server -o /shared/daos_server_psm2_intel-2.yml | |
# orterun --map-by node --mca btl tcp,self --mca oob tcp -np 1 -H intel-1 --allow-run-as-root daos_server -a /shared/uri -o /shared/daos_server_psm2_intel-2.yml | |
# orterun -np 1 -H intel-1 --allow-run-as-root daos_server -a /shared/uri -o /shared/daos_server_psm2_intel-2.yml | |
The following lines should appear in standard out after successful format: | |
daos_io_server:0 DAOS I/O server (v0.6.0) process 121715 started on rank 1 (out of 2) with 11 target, 1 helper XS per target, firstcore 1, host intel-2. | |
daos_io_server:0 DAOS I/O server (v0.6.0) process 62470 started on rank 0 (out of 2) with 11 target, 1 helper XS per target, firstcore 1, host intel-1. | |
CLIENT: | |
<start server as above> | |
[root@intel-2 daos]# daos_shell -l intel-1:10001,intel-2:10001 storage format -f -i | |
[root@intel-2 daos]# daos_agent -i & | |
[root@intel-ap2 daos]# orterun -np 1 --ompi-server file:/shared/urifile --allow-run-as-root --mca mtl ^psm2,ofi -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 -x CRT_PHY_ADDR_STR=ofi+psm2 -x POOL_SCM_SIZE=8G -x POOL_NVME_SIZE=16G daos_test -p | |
# orterun -np 1 --ompi-server file:/shared/urifile --allow-run-as-root --mca mtl ^psm2,ofi -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 -x CRT_PHY_ADDR_STR=ofi+psm2 daos_test -p | |
# orterun -np 1 --ompi-server file:/shared/urifile --allow-run-as-root --mca mtl ^psm2,ofi -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 -x CRT_PHY_ADDR_STR=ofi+psm2 -x SCM_POOL_SIZE=1G -x NVME_POOL_SIZE=0G daos_test -p | |
# orterun -np 1 -x OFI_INTERFACE=ib0 -x CRT_ATTACH_INFO_PATH=/shared/uri -x DAOS_SINGLETON_CLI=1 -x CRT_PHY_ADDR_STR=ofi+psm2 --allow-run-as-root daos_test -p | |
# orterun -np 1 --ompi-server file:/shared/uri --allow-run-as-root -x OFI_INTERFACE=ib0 -x CRT_PHY_ADDR_STR=ofi+psm2 daos_test -p | |
Troubleshooting ERROR when NVMe devices specified (resolved): | |
================= | |
DAOS pool tests.. | |
===================== | |
[==========] Running 10 test(s). | |
setup: creating pool, SCM size=4 GB, NVMe size=8 GB | |
daos_pool_create failed, rc: -1003 | |
state not set, likely due to group-setup issue | |
[==========] 0 test(s) run. | |
[ FAILED ] GROUP SETUP | |
[ ERROR ] Pool tests | |
[ PASSED ] 0 test(s). | |
============ Summary src/tests/suite/daos_test.c | |
ERROR, 1 TEST(S) FAILED | |
This is because SPDK is not able to access hugepages which are not getting released as expected (even after `daos_server storage prepare -n --reset`: | |
[root@intel-2 daos]# grep Huge /proc/meminfo | |
AnonHugePages: 1284096 kB | |
HugePages_Total: 1024 | |
HugePages_Free: 0 | |
HugePages_Rsvd: 0 | |
HugePages_Surp: 0 | |
Hugepagesize: 2048 kB | |
System needs a reboot to resolve. | |
And pool sizes for SCM/NVMe need to be doubled in test by setting the relevant environment variables in the client commandline. | |
after solution applied, outputs the following: | |
================= | |
DAOS pool tests.. | |
===================== | |
[==========] Running 10 test(s). | |
setup: creating pool, SCM size=8 GB, NVMe size=16 GB | |
setup: created pool 1e503e65-30b7-4e6a-94ed-475e09e1d5b5 | |
[ RUN ] POOL1: connect to non-existing pool | |
[ OK ] POOL1: connect to non-existing pool | |
[ RUN ] POOL2: connect/disconnect to pool | |
rank 0 connecting to pool synchronously ... success | |
rank 0 querying pool info... success | |
rank 0 disconnecting from pool synchronously ... rank 0 success | |
[ OK ] POOL2: connect/disconnect to pool | |
[ RUN ] POOL3: connect/disconnect to pool (async) | |
rank 0 connecting to pool asynchronously ... success | |
rank 0 querying pool info... success | |
rank 0 disconnecting from pool asynchronously ... rank 0 success | |
[ OK ] POOL3: connect/disconnect to pool (async) | |
[ RUN ] POOL4: pool handle local2global and global2local | |
rank 0 connecting to pool synchronously ... success | |
rank 0 querying pool info... success | |
rank 0 call local2global on pool handlesuccess | |
rank 0 broadcast global pool handle ...success | |
rank 0 disconnecting from pool synchronously ... rank 0 success | |
[ OK ] POOL4: pool handle local2global and global2local | |
[ RUN ] POOL5: exclusive connection | |
SUBTEST 1: other connections already exist; shall get -1012 | |
establishing a non-exclusive connection | |
trying to establish an exclusive connection | |
disconnecting the non-exclusive connection | |
SUBTEST 2: no other connections; shall succeed | |
establishing an exclusive connection | |
SUBTEST 3: shall prevent other connections (-1012) | |
trying to establish a non-exclusive connection | |
disconnecting the exclusive connection | |
[ OK ] POOL5: exclusive connection | |
[ RUN ] POOL6: exclude targets and query pool info | |
Skip it for now, because CaRT can't support subgroup membership, excluding a node w/o killing it will cause IV issue. | |
[ OK ] POOL6: exclude targets and query pool info | |
[ RUN ] POOL7: set/get/list user-defined pool attributes (sync) | |
setup: connecting to pool | |
connected to pool, ntarget=22 | |
setting pool attributes synchronously ... | |
listing pool attributes synchronously ... | |
Verifying Total Name Length.. | |
Verifying Small Name.. | |
Verifying All Names.. | |
getting pool attributes synchronously ... | |
Verifying Name-Value (A).. | |
Verifying Name-Value (B).. | |
Verifying with NULL buffer.. | |
[ OK ] POOL7: set/get/list user-defined pool attributes (sync) | |
[ RUN ] POOL8: set/get/list user-defined pool attributes (async) | |
setting pool attributes asynchronously ... | |
listing pool attributes asynchronously ... | |
Verifying Total Name Length.. | |
Verifying Small Name.. | |
Verifying All Names.. | |
getting pool attributes asynchronously ... | |
Verifying Name-Value (A).. | |
Verifying Name-Value (B).. | |
Verifying with NULL buffer.. | |
[ OK ] POOL8: set/get/list user-defined pool attributes (async) | |
[ RUN ] POOL9: pool reconnect after daos re-init | |
connected to pool, ntarget=22 | |
[ OK ] POOL9: pool reconnect after daos re-init | |
[ RUN ] POOL10: pool create with properties and query | |
create pool with properties, and query it to verify. | |
setup: creating pool, SCM size=8 GB, NVMe size=16 GB | |
setup: created pool fbdf1810-609d-4bd7-b4a0-2c2428f28654 | |
setup: connecting to pool | |
connected to pool, ntarget=22 | |
ACL prop matches expected defaults | |
[ OK ] POOL10: pool create with properties and query | |
[==========] 10 test(s) run. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment