thesjg/optimizing_sendfile.txt

## optimizing_sendfile.txt
[21:27] <@dillon> the current implementation is pretty bad, though it is using VOP_READ so there should be read-ahead by virtue of it going through the filesystem
[21:27] <@vsrinivas> also its mplocked.
[21:28] <@vsrinivas> actually looks not; just around the fd work.
[21:28] <@thesjg> so feel free to drop all your insight on me and i'll make note of it
[21:28] <@thesjg> but i'm not prepared to do it quite yet
[21:28] <@dillon> it shouldn't need that mplock since its holding the fp
[21:29] <@vsrinivas> ummm -- in the core loop it ssb_lock()s;
[21:29] <@vsrinivas> tcp output needs the ssb lock also iirc to read from that queue.
[21:29] <@dillon> w/ regards to the rest I think it can pass a larger value in the ioflags it passes to VOP_READ ... figure out what the maximum sequential heuristic value is there and pass that
[21:29] <@dillon> (in the upper 16 bits.  See line 1694)
[21:30] <@vsrinivas> so sendfile will copy all the 4k pages into the socket to send, but the socket can't go until ssb_lock is released?
[21:30] <@dillon> right now it looks like it is passing a fairly small value and not reading ahead much
[21:31] <@dillon> jeeze is it holding the sockbuf locked through the VOP_READ ?
[21:31] <@vsrinivas> dillon: is that accurate (ssb_lock excludes tcp output path?)
[21:31] <@vsrinivas> yes.
[21:31] <@dillon> it certainly does not need to do that
[21:31] <@dillon> yah, that will mess up the tcp stack
[21:32] <@dillon> it can definitely release the lock around the VOP_READ, and probably other places too
[21:32] <@vsrinivas> also -- its currently holding the vm token -- but iirc vm_page_lookup(obj ...) needs to hold the vmobj token iirc?
[21:32] <@dillon> I'm not sure it needs the ssb_lock at all, why does it get it ?
[21:32] <@dillon> its holding it around so_pru_send
[21:32] <@vsrinivas> to quote the comment: '         * Protect against multiple writers to the socket.
[21:32] <@vsrinivas> '
[21:33] <@thesjg> um
[21:33] <@dillon> ah
[21:33] <@dillon> it won't mess up the protocol side then
[21:33] <@dillon> sends are messaged so the userland side doesn't queue to the sockbuf directly
[21:34] <@vsrinivas> wha?
[21:34] <@vsrinivas>                 } else if (vm_page_sleep_busy(pg, TRUE, "sfpbsy")) {
[21:34] <@vsrinivas>                         lwkt_reltoken(&vm_token);
[21:34] <@dillon> still it probably doesn't need the lock, there is no expectation of atomicy if multiple threads are running writes to the same socket via sendfile()
[21:34] <@vsrinivas> that's a sad bit of code. sleep; wakeup to get the vm token. release the token.
[21:34] <@dillon> sendfile is manually manipulating VM pages.  VM pages that are in a PG_BUSY or m->busy state are in flux
[21:34] <@dillon> that case is going to be fairly rare
[21:35] <@dillon> if it catches up to the read-ahead it will hit soft-busied pages whos reads are in progress from prior VOP_READ readaheads
[21:36] <@dillon> it should be doing read-ahead ok, even though the implementation is hold, just fix the parameter sent to VOP_READ()
[21:36] <@dillon> you can test by sendfile'ing a file from a freshly mounted (uncached) filesystem and seeing if it maxes the disk's sequential throughput out
[21:36] <@thesjg> ok, sounds easy enough
[21:38] <@vsrinivas> okay; to do this sorta better.... would it make sense to start but building sfbufs for a ton of pages first.... then interleave piping sfbufs+mbufs into the so_send path (returning to sfbufing when the sosend path is draining well?
[21:38] <@vsrinivas> something like double-buffering between the two?
[21:39] <@dillon> no, the VOP_READ it issues is definitely correct, you want to do that because the filesystem will implement the read-ahead for you
...
[21:56] <@dillon> hmm. you know, it occurs to me that a lot of read-ahead wouldn't be beneficial anyway, since it is going to keep the sockbuf full
[21:57] <@thesjg> it doesn't keep the sockbuf full is the problem
[21:57] <@thesjg> once you get 50-60 connections being sendfile(2)'d to, each one a different file, the disk is so busy doing seeks the sockbuf's never fill
[21:57] <@thesjg> if you are doing more readahead you lessen the # of seeks the disk has to bear
[21:58] <@dillon> hmm.  That would be a hysteresis issue in cluster_read() I would guess
[21:58] <@dillon> something to definitely look into.  Even though cluster_read() does read-ahead it can wind up nickel-and-dimeing the disk
[21:58] <@thesjg> well, i haven't tested it on dragonfly but at least that's the issue i have hit time and again for years and years on freebsd
[21:58] <@dillon> e.g.  if you read at offset 0 it reads, say, 128KB
[21:59] <@dillon> when you read() at offset 4096 it reads ahead to the 132K mark, reading just 4K more
[21:59] <@dillon> that could cause the disk to thrash
[21:59] <@thesjg> ah, sure, i see
[21:59] <@dillon> there is a read-ahead-mark that's supposed to deal with this B_RAM
[22:00] <@dillon> you could add some debugging to print out the offsets B_RAM is being set at to make sure the read-ahead is working on properly sized chunks
[22:00] <@dillon> i.e. B_RAM should be set about half way through the read-ahead if the buffer cache buffer is not B_CACHE
[22:01] <@dillon> if the buffer cache buffer is cached (B_CACHE) then the next read-ahead is triggered when a B_RAM mark is found, setting the next B_RAM mark ahead another X KB.
[22:01] <@vsrinivas> (B_CACHE is set when? also is there a bflag for swapcache reads?)
[22:01] <@vsrinivas> ((anyone got a chance to look at the mpipe ctor path?
[22:01] <@dillon> B_CACHE means a buffer cache buffer has valid data
[22:01] <@vsrinivas> otherwise, i'll commit :D))
[22:01] <@vsrinivas> okay.
[22:01] <@dillon> in anycase, worth testing to make sure it is working properly.
[22:02] <@vsrinivas> ((yep. already run a vkernel with it))
[22:02] <@dillon> Lots of parallel sendfile()'s will thrash the disk but still should get good throughput if the read-ahead is sufficient
[22:02] <@thesjg> once i get another piece of hardware here running DF some of this testing will get easier
[22:02] <@vsrinivas> the ssb around it should definitely be fixed though.
[22:02] <@vsrinivas> *ssb lock
	[21:27] <@dillon> the current implementation is pretty bad, though it is using VOP_READ so there should be read-ahead by virtue of it going through the filesystem
	[21:27] <@vsrinivas> also its mplocked.
	[21:28] <@vsrinivas> actually looks not; just around the fd work.
	[21:28] <@thesjg> so feel free to drop all your insight on me and i'll make note of it
	[21:28] <@thesjg> but i'm not prepared to do it quite yet
	[21:28] <@dillon> it shouldn't need that mplock since its holding the fp
	[21:29] <@vsrinivas> ummm -- in the core loop it ssb_lock()s;
	[21:29] <@vsrinivas> tcp output needs the ssb lock also iirc to read from that queue.
	[21:29] <@dillon> w/ regards to the rest I think it can pass a larger value in the ioflags it passes to VOP_READ ... figure out what the maximum sequential heuristic value is there and pass that
	[21:29] <@dillon> (in the upper 16 bits. See line 1694)
	[21:30] <@vsrinivas> so sendfile will copy all the 4k pages into the socket to send, but the socket can't go until ssb_lock is released?
	[21:30] <@dillon> right now it looks like it is passing a fairly small value and not reading ahead much
	[21:31] <@dillon> jeeze is it holding the sockbuf locked through the VOP_READ ?
	[21:31] <@vsrinivas> dillon: is that accurate (ssb_lock excludes tcp output path?)
	[21:31] <@vsrinivas> yes.
	[21:31] <@dillon> it certainly does not need to do that
	[21:31] <@dillon> yah, that will mess up the tcp stack
	[21:32] <@dillon> it can definitely release the lock around the VOP_READ, and probably other places too
	[21:32] <@vsrinivas> also -- its currently holding the vm token -- but iirc vm_page_lookup(obj ...) needs to hold the vmobj token iirc?
	[21:32] <@dillon> I'm not sure it needs the ssb_lock at all, why does it get it ?
	[21:32] <@dillon> its holding it around so_pru_send
	[21:32] <@vsrinivas> to quote the comment: ' * Protect against multiple writers to the socket.
	[21:32] <@vsrinivas> '
	[21:33] <@thesjg> um
	[21:33] <@dillon> ah
	[21:33] <@dillon> it won't mess up the protocol side then
	[21:33] <@dillon> sends are messaged so the userland side doesn't queue to the sockbuf directly
	[21:34] <@vsrinivas> wha?
	[21:34] <@vsrinivas> } else if (vm_page_sleep_busy(pg, TRUE, "sfpbsy")) {
	[21:34] <@vsrinivas> lwkt_reltoken(&vm_token);
	[21:34] <@dillon> still it probably doesn't need the lock, there is no expectation of atomicy if multiple threads are running writes to the same socket via sendfile()
	[21:34] <@vsrinivas> that's a sad bit of code. sleep; wakeup to get the vm token. release the token.
	[21:34] <@dillon> sendfile is manually manipulating VM pages. VM pages that are in a PG_BUSY or m->busy state are in flux
	[21:34] <@dillon> that case is going to be fairly rare
	[21:35] <@dillon> if it catches up to the read-ahead it will hit soft-busied pages whos reads are in progress from prior VOP_READ readaheads
	[21:36] <@dillon> it should be doing read-ahead ok, even though the implementation is hold, just fix the parameter sent to VOP_READ()
	[21:36] <@dillon> you can test by sendfile'ing a file from a freshly mounted (uncached) filesystem and seeing if it maxes the disk's sequential throughput out
	[21:36] <@thesjg> ok, sounds easy enough
	[21:38] <@vsrinivas> okay; to do this sorta better.... would it make sense to start but building sfbufs for a ton of pages first.... then interleave piping sfbufs+mbufs into the so_send path (returning to sfbufing when the sosend path is draining well?
	[21:38] <@vsrinivas> something like double-buffering between the two?
	[21:39] <@dillon> no, the VOP_READ it issues is definitely correct, you want to do that because the filesystem will implement the read-ahead for you
	...
	[21:56] <@dillon> hmm. you know, it occurs to me that a lot of read-ahead wouldn't be beneficial anyway, since it is going to keep the sockbuf full
	[21:57] <@thesjg> it doesn't keep the sockbuf full is the problem
	[21:57] <@thesjg> once you get 50-60 connections being sendfile(2)'d to, each one a different file, the disk is so busy doing seeks the sockbuf's never fill
	[21:57] <@thesjg> if you are doing more readahead you lessen the # of seeks the disk has to bear
	[21:58] <@dillon> hmm. That would be a hysteresis issue in cluster_read() I would guess
	[21:58] <@dillon> something to definitely look into. Even though cluster_read() does read-ahead it can wind up nickel-and-dimeing the disk
	[21:58] <@thesjg> well, i haven't tested it on dragonfly but at least that's the issue i have hit time and again for years and years on freebsd
	[21:58] <@dillon> e.g. if you read at offset 0 it reads, say, 128KB
	[21:59] <@dillon> when you read() at offset 4096 it reads ahead to the 132K mark, reading just 4K more
	[21:59] <@dillon> that could cause the disk to thrash
	[21:59] <@thesjg> ah, sure, i see
	[21:59] <@dillon> there is a read-ahead-mark that's supposed to deal with this B_RAM
	[22:00] <@dillon> you could add some debugging to print out the offsets B_RAM is being set at to make sure the read-ahead is working on properly sized chunks
	[22:00] <@dillon> i.e. B_RAM should be set about half way through the read-ahead if the buffer cache buffer is not B_CACHE
	[22:01] <@dillon> if the buffer cache buffer is cached (B_CACHE) then the next read-ahead is triggered when a B_RAM mark is found, setting the next B_RAM mark ahead another X KB.
	[22:01] <@vsrinivas> (B_CACHE is set when? also is there a bflag for swapcache reads?)
	[22:01] <@vsrinivas> ((anyone got a chance to look at the mpipe ctor path?
	[22:01] <@dillon> B_CACHE means a buffer cache buffer has valid data
	[22:01] <@vsrinivas> otherwise, i'll commit :D))
	[22:01] <@vsrinivas> okay.
	[22:01] <@dillon> in anycase, worth testing to make sure it is working properly.
	[22:02] <@vsrinivas> ((yep. already run a vkernel with it))
	[22:02] <@dillon> Lots of parallel sendfile()'s will thrash the disk but still should get good throughput if the read-ahead is sufficient
	[22:02] <@thesjg> once i get another piece of hardware here running DF some of this testing will get easier
	[22:02] <@vsrinivas> the ssb around it should definitely be fixed though.
	[22:02] <@vsrinivas> *ssb lock