Skip to content

Instantly share code, notes, and snippets.

@keeperAndy
Last active March 6, 2019 17:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save keeperAndy/aa80d41618caa4394e028478f4ad1694 to your computer and use it in GitHub Desktop.
Save keeperAndy/aa80d41618caa4394e028478f4ad1694 to your computer and use it in GitHub Desktop.
CephFS Kernel Client Cannot Read & Write at the Same Time

This gist is posted to add images for context to a question asked in the Ceph Users listserv.

We discovered recently that our CephFS mount appeared to be halting reads when writes were being synched to the Ceph cluster to the point it was affecting applications. First, details about the host:

$ uname -r
4.16.13-041613-generic

$ egrep 'xfs|ceph' /proc/mounts
192.168.1.115:6789,192.168.1.116:6789,192.168.1.117:6789:/ /cephfs ceph rw,noatime,name=cephfs,secret=<hidden>,rbytes,acl,wsize=16777216 0 0
/dev/mapper/tst01-lvidmt01 /rbd_xfs xfs rw,relatime,attr2,inode64,logbsize=256k,sunit=512,swidth=1024,noquota 0 0

$ ceph -v
ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)

$ cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: adaptive load balancing
Primary Slave: None
Currently Active Slave: net6
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200

Slave Interface: net8
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 2
Permanent HW addr: e4:1d:2d:17:71:e1
Slave queue ID: 0

Slave Interface: net6
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: e4:1d:2d:17:71:e0
Slave queue ID: 0

We had CephFS mounted alongside an XFS filesystem made up of 16 RBD images aggregated under LVM as our storage targets. The link to the Ceph cluster from the host is a mode 6 2x10GbE bond (bond1 above).

We started capturing network counters from the Ceph cluster connection (bond1) on the host using ifstat at its most granular setting of 0.1 (sampling every tenth of a second). We then ran various overlapping read and write operations in separate shells on the same host to obtain samples of how our different means of accessing Ceph handled this. We converted our ifstat output to CSV and insterted it into a spreadsheet to visualize the network activity.

We found that the CephFS kernel mount did indeed appear to pause ongoing reads when writes were being flushed from the page cache to the Ceph cluster. graph

We wanted to see if we could make this more pronounced, so we added a tc filter to the interface and re-ran our tests. This yielded much lengthier delay periods in the reads while the writes were more slowly flushed from the page cache to the Ceph cluster. image

A more restrictive tc filter produced much lengthier delays of our reads. image

When we tested the same I/O on the RBD-backed XFS file system on the same host, we found a very different pattern. The reads seemed to be given priority over the write activity, but the writes were only slowed, they were not halted. image

Finally we tested overlapping SMB client reads and writes to a Samba share that used the userspace libceph-based VFS_Ceph module to produce the share. In this case, while raw throughput was lower than that of the kernel, the reads and writes did not interrupt each other at all.

image

Is this expected behavior for the CephFS kernel drivers? Can a CephFS kernel client really not read and write to the file system simultaneously?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment