Skip to content

Instantly share code, notes, and snippets.

@rzarzynski
Last active May 30, 2019 06:50
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rzarzynski/a1d67dc39b0ef4d49cb522179b1f3c89 to your computer and use it in GitHub Desktop.
Save rzarzynski/a1d67dc39b0ef4d49cb522179b1f3c89 to your computer and use it in GitHub Desktop.
Efficient reading from network socket in Seastar

Efficient reading from network socket in Seastar

Current state

The way to retrieve data from socket in Seastar leads through the argument-less data_source_impl::get() method. It returns an instance of temporary_buffer<...> which, for the sake of this document, can be seen just as pointer-to-data and data-len. This means the decision about both data location and chunk size belongs to the network stack. Client of the interface has no control over these factors.

Moreover, if copying memory must be avoided, then the client cannot easily regain this responsibility as the decision about e.g. payload size of a network frame naturally belongs to the network peer and network layer in general. All application can do to handle those fragmented payloads is to use scatter-gather list.

class data_source_impl {
public:
    virtual ~data_source_impl() {}
    virtual future<temporary_buffer<char>> get() = 0;
    virtual future<temporary_buffer<char>> skip(uint64_t n);
    virtual future<> close() { return make_ready_future<>(); }
};

This design suits DPDK. However, the POSIX stack in mamy cases is obligated to actually copy the memory to preserve isolation between kernel- and user-space. That is, as there is no easy way to guarantee that chunks coming from NIC are in size being multiply of PAGE_SIZE and that they are aligned to page boundaries, remapping memory cannot be performed in all cases.

The need for alignment

memcpy can be costly but the way how one copy operation is performed may help avoid further copies. For instance, composing multiple payload chunks into logically contigous and properly aligned area can allow kernel to just remap memory when e.g. passing the data to block device through io_submit(2). This might be the reason behind page-alignmenent requirements for data segmenets of Ceph's on-wire protocol v2.

The problem

The POSIX network stack of Seastar had to adopt the operating system-offered ::read(int fd, void *buf, size_t count) to the data_source::get()-driven interface. posix_data_source_impl always reads into fixed-size prefetch buffer (posix_data_source_impl::_buf_size is set to 8192 by defualt). This allows to minimize the number of read(2) syscalls for tiny chunks. When the chunk is larger than the buffer size, the request is served with multiple calls to get() which translates into many read(2) syscalls.

Moreover, when an application uses input_stream<CharType>::read_exactly(size_t n) (the method operates on top of the data_source::get()), those multiple chunks must be copied to compose the flat output buffer – this is mandatory as by calling read_exactly() the application declares it wants single, contiguous output buffer. The problem becomes especially visible while filling crimson-osd with data. Profiling of 32k writes shows that around 60% of all cycles have been burnt just to serve page faults induced be the memory copy operation.

Also, the output buffer is currently not aligned at page boundary.

Proposed solution

At Seastar side

Extend the network interface with a concept of voluntary input buffer factory. Its voluntariness would manifest in that:

  • implementation of network stack could use the factory but would not be obliged to do that (the native stack would ignore the factory entirely),
  • client could provide the factory implementation but would not be enforced to do that.

The factory is supposed to handle following operations on the per-input_stream basis:

  • estimatation of size of next chunk to read,
  • memory allocation for this chunk.

In the case of POSIX stack this can be seen as delagating the control over posix_data_source_impl::_buf_size to application. Application could provide the size basing on its internal knowledge and policy. If the estimation is correct, data_source_impl::get() would provide the payload as single, contiguous chunk aligned at page boundary without the need for any extra user-space-to-user-space copying.

At the crimson-osd side

  • Consume the network payload chunks with future<> input_stream::consume(Consumer& c) to form scatter-gather list (possibly just ceph::bufferlist) that would be passed over entire IO path – from the network layer to storage. This part targets the DPDK/SPDK use case.
  • Consider preserving the 4 KB alignment hint for the place when first memory copy/translation (e.g. encryption or compression) takes place to form contigous, aligned buffer.
  • Implement the input buffer factory in a way that allows to issue one read(2) syscall per protocol v2 message on average. It might be possible to read 1) segments and epilogue of current v2 frame and 2) prologue of next frame (if its available) with the same syscall.
@cyx1231st
Copy link

Profiling of 32k writes shows that around 60% of all cycles have been burnt just to serve page faults induced be the memory copy operation.

Could you point out which line of seastar::input_stream<char>::read_exactly() leads to 58,05% page fault?

Previously I thought it was https://github.com/ceph/seastar/blob/8188d3b8b7091db9227824791d022d64c98193e9/core/iostream-impl.hh#L190. But it seems not true, because posix_data_source_impl::get() also calls the same temporary_buffer<char>(_buf_size) and there's no page fault overhead according to your profiling.

Implement the input buffer factory in a way that allows to issue one read(2) syscall per protocol v2 message on average.

I'm not quite agree with this part.

For smaller block sizes like 4K or 256B, the prefetch version (with minimum system calls, but requires user-space-to-user-space copy) is still way much faster. My evaluation (https://docs.google.com/spreadsheets/d/1WygP0QgxASzdIupQlEBX8q5m1LWxiv6IcyggLELTT1g/edit#gid=0) implies that the minimum-user-space-to-user-space copy version could only be faster when block size is larger than ~64K.

I think this implies that system-call is much more expensive than memory-copy. We still need an implementation of optimal number of system-calls to do prefetch. And extra user-space-to-user-space copy can be avoided when the buffer to read is larger than the prefetched buffer.

@rzarzynski
Copy link
Author

rzarzynski commented May 30, 2019

Could you point out which line of seastar::input_stream::read_exactly() leads to 58,05% page fault?

In my understanding that's the line: https://github.com/ceph/seastar/blob/8188d3b8b7091db9227824791d022d64c98193e9/core/iostream-impl.hh#L148. Possibly compiler was able to inline the call to read_exactly_part from read_exactly.

I think this implies that system-call is much more expensive than memory-copy. We still need an implementation of optimal number of system-calls to do prefetch. And extra user-space-to-user-space copy can be avoided when the buffer to read is larger than the prefetched buffer.

Yeah, I think you're right here – we might need a bit more complex policy that preserves the prefetch feature for tiny chunks. Still, the application-provided input buffer factory looks flexible enough to accommodate that.

@ronen-fr
Copy link

A question to both of you: latest kernel already supports io_uring. Seastar (as far as I understand from Avi Kivity's word, but also from the fact that it makes perfect sense). So - we can expect almost no sys-calls on that side.
Am I correct in assuming that our end goal should be something like this:

a request arrives from the network, and is scattered into aligned buffers:
-- if from a regular socket: an io_uring request that places the header in one small buffer, and "all the rest" (yes - I know that it may include more than one segment, and even more calls) on an aligned buffer (see below marked (*))
-- if from DPDK: pre-arranged in the RDMA buffers to have multiple targets per segment
Our logic then creates a set of seastar 'writes' that will use the (still scattered) buffers directly. No system call on the application code, and - if seastar is using io_uring efficiently, almost no syscalls there, too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment