a) The buffer can be a scatter-gather list internally.
b) The buffer requires to be contiguous.
c) The buffer should be contiguous with extra alignment and/or length requirements.
Evaluation(https://docs.google.com/spreadsheets/d/1WygP0QgxASzdIupQlEBX8q5m1LWxiv6IcyggLELTT1g/edit#gid=0) to compare optimal less-system-calls(prefetch) vs optimal less-copy(exact, with https://github.com/cyx1231st/seastar/commit/d00a866bbfcd78bf0d99e0c2f14930f48205ebaa):
Case 1) When block-size is smaller (<~64K): prefetch is better than the exact version due to minimum system-calls, even with some extra user-space-to-user-space copying.
Case 2) When block-size is larger (>~64K): this case, the number of system-calls would be at the similar level for both prefetch and exact version. The exact version is better than prefetch version because of no user-space-to-user-space copying.
This implies:
- System-call is much more expensive than memory-copy.
- When block-size is large, likely larger than prefetching, we can minimize user-space-to-user-space copying.
- Current implementation of
input_stream<CharType>::consume()
with ourbufferlist_consumer
already meets our requirement a), no need to change; - We can introduce a new interface
read_exactly2(size_t read_len, __le16 alignment=sizeof(void *), size_t extra_len=0)
to implement our specific requirements with optimizations.- For b),
data_source::get()
can be reused with prefetch, if it cannot read out the entire buffer we need, we can copy the content to our own buffer and useinput_stream<CharType>::read_exactly_part2()
to fill the rest. - For c), we can still use
data_source::get()
with prefetch, and copy the content to our own buffer, if it is not the entire buffer we need, we can useinput_stream<CharType>::read_exactly_part2()
to fill the rest.
- For b),
- As for
input_stream<CharType>::read_exactly_part2()
, there is an implementation in https://github.com/cyx1231st/seastar/commit/d00a866bbfcd78bf0d99e0c2f14930f48205ebaa. It requires a new interfacedata_source_impl::get2(char* buf, size_t size)
to provide input buffer from caller to the data_source_impl. In DPDK, I think it requires user-space-to-user-space copying to fill that special-allocated buffer.
Sure, this suits DPDK/SPDK but at the moment
ProtocolV2
usesread_exactly()
on the hot path (ProtocolV2::read_frame_payload
).This suits kernel networking and kernel bdev but is contrary to the DPDK/SPDK's needs. The problem is we need to support both stacks effectively. I could imagine
crimson-osd
switching betweenbufferlist_consumer
andread_exactly2
but it would need to be aware which stack it deals with. ADDED: The input buffer factory has been proposed to abstract those differences.In my understanding for 4 MB chunk we would have
memcpy
interleaved with plenty of syscalls as4096 KB / 8 KB = 512
.