rzarzynski/seastar_socket_reading.md Secret

## seastar_socket_reading.md

      
    Raw
  

              seastar_socket_reading.md
            
          
    Efficient reading from network socket in Seastar

Current state

The way to retrieve data from socket in Seastar leads through the argument-less data_source_impl::get() method. It returns an instance of temporary_buffer<...> which, for the sake of this document, can be seen just as pointer-to-data and data-len. This means the decision about both data location and chunk size belongs to the network stack. Client of the interface has no control over these factors.
Moreover, if copying memory must be avoided, then the client cannot easily regain this responsibility as the decision about e.g. payload size of a network frame naturally belongs to the network peer and network layer in general. All application can do to handle those fragmented payloads is to use scatter-gather list.
class data_source_impl {
public:
    virtual ~data_source_impl() {}
    virtual future<temporary_buffer<char>> get() = 0;
    virtual future<temporary_buffer<char>> skip(uint64_t n);
    virtual future<> close() { return make_ready_future<>(); }
};
This design suits DPDK. However, the POSIX stack in mamy cases is obligated to actually copy the memory to preserve isolation between kernel- and user-space. That is, as there is no easy way to guarantee that chunks coming from NIC are in size being multiply of PAGE_SIZE and that they are aligned to page boundaries, remapping memory cannot be performed in all cases.
The need for alignment

memcpy can be costly but the way how one copy operation is performed may help avoid further copies. For instance, composing multiple payload chunks into logically contigous and properly aligned area can allow kernel to just remap memory  when e.g. passing the data to block device through io_submit(2). This might be the reason behind page-alignmenent requirements for data segmenets of Ceph's on-wire protocol v2.
The problem

The POSIX network stack of Seastar had to adopt the operating system-offered ::read(int fd, void *buf, size_t count) to the data_source::get()-driven interface. posix_data_source_impl always reads into fixed-size prefetch buffer (posix_data_source_impl::_buf_size is set to 8192 by defualt). This allows to minimize the number of read(2) syscalls for tiny chunks. When the chunk is larger than the buffer size, the request is served with multiple calls to get()  which translates into many read(2) syscalls.
Moreover, when an application uses input_stream<CharType>::read_exactly(size_t n) (the method operates on top of the data_source::get()), those multiple chunks must be copied to compose the flat output buffer – this is mandatory as by calling read_exactly() the application declares it wants single, contiguous output buffer. The problem becomes especially visible while filling crimson-osd with data. Profiling of 32k writes shows that around 60% of all cycles have been burnt just to serve page faults induced be the memory copy operation.
Also, the output buffer is currently not aligned at page boundary.
Proposed solution

At Seastar side

Extend the network interface with a concept of voluntary input buffer factory. Its voluntariness would manifest in that:

implementation of network stack could use the factory but would not be obliged to do that (the native stack would ignore the factory entirely),
client could provide the factory implementation but would not be enforced to do that.

The factory is supposed to handle following operations on the per-input_stream basis:

estimatation of size of next chunk to read,
memory allocation for this chunk.

In the case of POSIX stack this can be seen as delagating the control over posix_data_source_impl::_buf_size to application. Application could provide the size basing on its internal knowledge and policy. If the estimation is correct, data_source_impl::get() would provide the payload as single, contiguous chunk aligned at page boundary without the need for any extra user-space-to-user-space copying.
At the crimson-osd side


Consume the network payload chunks with future<> input_stream::consume(Consumer& c) to form scatter-gather list (possibly just ceph::bufferlist) that would be passed over entire IO path – from the network layer to storage. This part targets the DPDK/SPDK use case.
Consider preserving the 4 KB alignment hint for the place when first memory copy/translation (e.g. encryption or compression) takes place to form contigous, aligned buffer.
Implement the input buffer factory in a way that allows to issue one read(2) syscall per protocol v2 message on average. It might be possible to read 1) segments and epilogue of current v2 frame and 2) prologue of next frame (if its available) with the same syscall.