Skip to content

Instantly share code, notes, and snippets.

@cyx1231st
Last active May 30, 2019 06:57
Show Gist options
  • Save cyx1231st/57727c8aa6c98ed48a8b06d64b7923d7 to your computer and use it in GitHub Desktop.
Save cyx1231st/57727c8aa6c98ed48a8b06d64b7923d7 to your computer and use it in GitHub Desktop.

Efficient reading from network socket in Seastar

Our requriements for socket read

a) The buffer can be a scatter-gather list internally.

b) The buffer requires to be contiguous.

c) The buffer should be contiguous with extra alignment and/or length requirements.

My evaluations

Evaluation(https://docs.google.com/spreadsheets/d/1WygP0QgxASzdIupQlEBX8q5m1LWxiv6IcyggLELTT1g/edit#gid=0) to compare optimal less-system-calls(prefetch) vs optimal less-copy(exact, with https://github.com/cyx1231st/seastar/commit/d00a866bbfcd78bf0d99e0c2f14930f48205ebaa):

Case 1) When block-size is smaller (<~64K): prefetch is better than the exact version due to minimum system-calls, even with some extra user-space-to-user-space copying.

Case 2) When block-size is larger (>~64K): this case, the number of system-calls would be at the similar level for both prefetch and exact version. The exact version is better than prefetch version because of no user-space-to-user-space copying.

This implies:

  • System-call is much more expensive than memory-copy.
  • When block-size is large, likely larger than prefetching, we can minimize user-space-to-user-space copying.

Proposed change with minimal impact

  • Current implementation of input_stream<CharType>::consume() with our bufferlist_consumer already meets our requirement a), no need to change;
  • We can introduce a new interface read_exactly2(size_t read_len, __le16 alignment=sizeof(void *), size_t extra_len=0) to implement our specific requirements with optimizations.
    • For b), data_source::get() can be reused with prefetch, if it cannot read out the entire buffer we need, we can copy the content to our own buffer and use input_stream<CharType>::read_exactly_part2() to fill the rest.
    • For c), we can still use data_source::get() with prefetch, and copy the content to our own buffer, if it is not the entire buffer we need, we can use input_stream<CharType>::read_exactly_part2() to fill the rest.
  • As for input_stream<CharType>::read_exactly_part2(), there is an implementation in https://github.com/cyx1231st/seastar/commit/d00a866bbfcd78bf0d99e0c2f14930f48205ebaa. It requires a new interface data_source_impl::get2(char* buf, size_t size) to provide input buffer from caller to the data_source_impl. In DPDK, I think it requires user-space-to-user-space copying to fill that special-allocated buffer.
@rzarzynski
Copy link

rzarzynski commented May 30, 2019

a) The buffer can be a scatter-gather list internally.

Sure, this suits DPDK/SPDK but at the moment ProtocolV2 uses read_exactly() on the hot path (ProtocolV2::read_frame_payload).

c) The buffer should be contiguous with extra alignment and/or length requirements.

This suits kernel networking and kernel bdev but is contrary to the DPDK/SPDK's needs. The problem is we need to support both stacks effectively. I could imagine crimson-osd switching between bufferlist_consumer and read_exactly2 but it would need to be aware which stack it deals with. ADDED: The input buffer factory has been proposed to abstract those differences.

Case 2) When block-size is larger (>~64K): this case, the number of system-calls would be at the similar level for both prefetch and exact version.

In my understanding for 4 MB chunk we would have memcpy interleaved with plenty of syscalls as 4096 KB / 8 KB = 512.

@rzarzynski
Copy link

Evaluation(https://docs.google.com/spreadsheets/d/1WygP0QgxASzdIupQlEBX8q5m1LWxiv6IcyggLELTT1g/edit#gid=0) to compare optimal less-system-calls(prefetch) vs optimal less-copy(exact, with cyx1231st/seastar@d00a866):

Case 1) When block-size is smaller (<~64K): prefetch is better than the exact version due to minimum system-calls, even with some extra user-space-to-user-space copying.

I think there might be a problem with the comparison. IIUC handling one v2 frame with the exact policy was requiring six read(2) syscalls (1 for preamble, 4 for the 4 segments, 1 for epilogue). The idea is to minimize the number as much as possible.

@ronen-fr
Copy link

A question to both of you: latest kernel already supports io_uring. Seastar (as far as I understand from Avi Kivity's word, but also from the fact that it makes perfect sense). So - we can expect almost no sys-calls on that side.
Am I correct in assuming that our end goal should be something like this:

  • a request arrives from the network, and is scattered into aligned buffers:
    -- if from a regular socket: an io_uring request that places the header in one small buffer, and "all the rest" (yes - I know that it may include more than one segment, and even more calls) on an aligned buffer (see below marked (*))
    -- if from DPDK: pre-arranged in the RDMA buffers to have multiple targets per segment

Our logic then creates a set of seastar 'writes' that will use the (still scattered) buffers directly. No system call on the application code, and - if seastar is using io_uring efficiently, almost no syscalls there, too?

@cyx1231st
Copy link
Author

Sure, this suits DPDK/SPDK but at the moment ProtocolV2 uses read_exactly() on the hot path (ProtocolV2::read_frame_payload).

Yes, ProtocolV2::read_frame_payload is not related to case a), it is the case c) of our requirements.

This suits kernel networking and kernel bdev but is contrary to the DPDK/SPDK's needs. The problem is we need to support both stacks effectively. I could imagine crimson-osd switching between bufferlist_consumer and read_exactly2 but it would need to be aware which stack it deals with. ADDED: The input buffer factory has been proposed to abstract those differences.

I think it is still possible to provide an uniform implementation at the input-stream level without awareness of different stacks, and hide the differences only at data_source_impl level with a new "data_source_impl::get2(char* buf, size_t size)" interface for native_connected_socket_impl and posix_data_source_impl. As explained in section "Proposed change with minimal impact".

In my understanding for 4 MB chunk we would have memcpy interleaved with plenty of syscalls as 4096 KB / 8 KB = 512.

Agree. I mean here for larger block-sizes, the "exact version" is faster because the number of syscalls is at the same level of (but still more than) the "prefetch version" and it has no user-space-to-user-space copying.

I think there might be a problem with the comparison. IIUC handling one v2 frame with the exact policy was requiring six read(2) syscalls (1 for preamble, 4 for the 4 segments, 1 for epilogue). The idea is to minimize the number as much as possible.

Agreed it is not accurate, and the implementation is primary, my proposal would be based on that in order to:

  • Introduce even less syscalls for smaller block sizes, with some copies introduced by prefetching.
  • And with less impact of the existing code, because no need awareness of different stacks at input_stream level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment