The way to retrieve data from socket in Seastar leads through the argument-less data_source_impl::get()
method. It returns an instance of temporary_buffer<...>
which, for the sake of this document, can be seen just as pointer-to-data and data-len. This means the decision about both data location and chunk size belongs to the network stack. Client of the interface has no control over these factors.
Moreover, if copying memory must be avoided, then the client cannot easily regain this responsibility as the decision about e.g. payload size of a network frame naturally belongs to the network peer and network layer in general. All application can do to handle those fragmented payloads is to use scatter-gather list.
class data_source_impl {
public:
virtual ~data_source_impl() {}
virtual future<temporary_buffer<char>> get() = 0;
virtual future<temporary_buffer<char>> skip(uint64_t n);
virtual future<> close() { return make_ready_future<>(); }
};
This design suits DPDK. However, the POSIX stack in mamy cases is obligated to actually copy the memory to preserve isolation between kernel- and user-space. That is, as there is no easy way to guarantee that chunks coming from NIC are in size being multiply of PAGE_SIZE
and that they are aligned to page boundaries, remapping memory cannot be performed in all cases.
memcpy
can be costly but the way how one copy operation is performed may help avoid further copies. For instance, composing multiple payload chunks into logically contigous and properly aligned area can allow kernel to just remap memory when e.g. passing the data to block device through io_submit(2)
. This might be the reason behind page-alignmenent requirements for data segmenets of Ceph's on-wire protocol v2.
The POSIX network stack of Seastar had to adopt the operating system-offered ::read(int fd, void *buf, size_t count)
to the data_source::get()
-driven interface. posix_data_source_impl
always reads into fixed-size prefetch buffer (posix_data_source_impl::_buf_size
is set to 8192
by defualt). This allows to minimize the number of read(2)
syscalls for tiny chunks. When the chunk is larger than the buffer size, the request is served with multiple calls to get()
which translates into many read(2)
syscalls.
Moreover, when an application uses input_stream<CharType>::read_exactly(size_t n)
(the method operates on top of the data_source::get()
), those multiple chunks must be copied to compose the flat output buffer – this is mandatory as by calling read_exactly()
the application declares it wants single, contiguous output buffer. The problem becomes especially visible while filling crimson-osd
with data. Profiling of 32k writes shows that around 60% of all cycles have been burnt just to serve page faults induced be the memory copy operation.
Also, the output buffer is currently not aligned at page boundary.
Extend the network interface with a concept of voluntary input buffer factory. Its voluntariness would manifest in that:
- implementation of network stack could use the factory but would not be obliged to do that (the native stack would ignore the factory entirely),
- client could provide the factory implementation but would not be enforced to do that.
The factory is supposed to handle following operations on the per-input_stream
basis:
- estimatation of size of next chunk to read,
- memory allocation for this chunk.
In the case of POSIX stack this can be seen as delagating the control over posix_data_source_impl::_buf_size
to application. Application could provide the size basing on its internal knowledge and policy. If the estimation is correct, data_source_impl::get()
would provide the payload as single, contiguous chunk aligned at page boundary without the need for any extra user-space-to-user-space copying.
- Consume the network payload chunks with
future<> input_stream::consume(Consumer& c)
to form scatter-gather list (possibly justceph::bufferlist
) that would be passed over entire IO path – from the network layer to storage. This part targets the DPDK/SPDK use case. - Consider preserving the 4 KB alignment hint for the place when first memory copy/translation (e.g. encryption or compression) takes place to form contigous, aligned buffer.
- Implement the input buffer factory in a way that allows to issue one
read(2)
syscall per protocol v2 message on average. It might be possible to read 1) segments and epilogue of current v2 frame and 2) prologue of next frame (if its available) with the same syscall.
Could you point out which line of
seastar::input_stream<char>::read_exactly()
leads to 58,05% page fault?Previously I thought it was https://github.com/ceph/seastar/blob/8188d3b8b7091db9227824791d022d64c98193e9/core/iostream-impl.hh#L190. But it seems not true, because
posix_data_source_impl::get()
also calls the sametemporary_buffer<char>(_buf_size)
and there's no page fault overhead according to your profiling.I'm not quite agree with this part.
For smaller block sizes like 4K or 256B, the prefetch version (with minimum system calls, but requires user-space-to-user-space copy) is still way much faster. My evaluation (https://docs.google.com/spreadsheets/d/1WygP0QgxASzdIupQlEBX8q5m1LWxiv6IcyggLELTT1g/edit#gid=0) implies that the minimum-user-space-to-user-space copy version could only be faster when block size is larger than ~64K.
I think this implies that system-call is much more expensive than memory-copy. We still need an implementation of optimal number of system-calls to do prefetch. And extra user-space-to-user-space copy can be avoided when the buffer to read is larger than the prefetched buffer.