dormando/gist:05689cdc5709881b382a Secret

## gistfile1.txt
TL:DR: increasing thread_pools slows down rates partly due to WRK_SumStat lockwaits. Otherwise, locks around vsl_get() and vfprintf into the VSL take up between 30-50% of overall time.

Machine 2x8core 2.9ghz intel cpu.

Test:
Client threads: 200
Reqs per conn: 1000

thread_pools=1: 390,619r/s
thread_pools=2: 374,610r/s
thread_pools=8: 349,200r/s
thread_pools=16: 343,486r/s

    102 __lll_lock_wait,_L_lock_858,pthread_mutex_lock,Lck__Lock,WRK_SumStat,http1_cleanup,HTTP1_Session,ses_req_pool_task,SES_pool_accept_task
102 threads piling on WRK_SumStat lock.

thread_pools=16, thread_stats_rate=5000: 343k again

    136 __lll_lock_wait,_L_lock_858,pthread_mutex_lock,Lck__Lock,WRK_SumStat,http1_cleanup,HTTP1_Session,ses_req_pool_task
,SES_pool_accept_task

thread_stats_rate=50000: still 343k

     17 __lll_lock_wait,_L_lock_858,pthread_mutex_lock,Lck__Lock,WRK_SumStat,http1_cleanup,HTTP1_Session,ses_req_pool_task,SES_pool_accept_task

Fewer threads piling up.

Perf still shows 20% CPU in _raw_spin_lock. Might be VSL locks, lets dig in a bit.

thread_pools=1,thread_stats_rate=50000: 380,000r/s
same with thread_stats_rate set to default.

thread_pools=1,vsl_mask=-VCL_trace,-WorkThread,-Hash,-ReqHeader,-RespHeader,-VCL_call,-VCL_return,-Timestamp,-Debug: 422k-437k.

Lots of time spent in _raw_spin_lock now. (27%). Less time but still significant time spent in vprintf (13%)

threads are sitting in:
     25 __lll_lock_wait,_L_lock_858,pthread_mutex_lock,vsl_get,VSL_Flush,http1_cleanup,HTTP1_Session,ses_req_pool_task,SES_pool_accept_task
      5 __lll_lock_wait,_L_lock_858,pthread_mutex_lock,Lck__Lock,WRK_SumStat,http1_cleanup,HTTP1_Session,ses_req_pool_task,SES_pool_accept_task

plus:
      1 __lll_lock_wait,_L_lock_858,pthread_mutex_lock,vsl_get,vslr,VSL,http1_dissect,req=0x16f24f0),ses_req_pool_task,SES_pool_accept_task
at least 50 of these, with varying req= args waiting on the same lock in vsl_get.

vsl_get is:
/*--------------------------------------------------------------------
 * Reserve bytes for a record, wrap if necessary
 */

... global lock around pulling bytes from the shmlog. So I doubt this'll go much faster unless there's a way to switch amount of writing into VSL.

Lets kill even more:
param.set vsl_mask -VCL_trace,-WorkThread,-Hash,-ReqHeader,-RespHeader,-VCL_call,-VCL_return,-Timestamp,-Debug,-Begin,-ReqStart,-ReqMethod,-ReqURL,-ReqProtocol,-Hit,-RespProtocol,-RespStatus,-RespResponse,-RespUnset,-ReqAcct,-End

497k, but there's CPU idle now. load generator is maxed.

Adding a second host isn't behaving now, will have to tune the test.

Next up... pass/miss/etc/maybe?
	TL:DR: increasing thread_pools slows down rates partly due to WRK_SumStat lockwaits. Otherwise, locks around vsl_get() and vfprintf into the VSL take up between 30-50% of overall time.

	Machine 2x8core 2.9ghz intel cpu.

	Test:
	Client threads: 200
	Reqs per conn: 1000

	thread_pools=1: 390,619r/s
	thread_pools=2: 374,610r/s
	thread_pools=8: 349,200r/s
	thread_pools=16: 343,486r/s

	102 __lll_lock_wait,_L_lock_858,pthread_mutex_lock,Lck__Lock,WRK_SumStat,http1_cleanup,HTTP1_Session,ses_req_pool_task,SES_pool_accept_task
	102 threads piling on WRK_SumStat lock.

	thread_pools=16, thread_stats_rate=5000: 343k again

	136 __lll_lock_wait,_L_lock_858,pthread_mutex_lock,Lck__Lock,WRK_SumStat,http1_cleanup,HTTP1_Session,ses_req_pool_task
	,SES_pool_accept_task

	thread_stats_rate=50000: still 343k

	17 __lll_lock_wait,_L_lock_858,pthread_mutex_lock,Lck__Lock,WRK_SumStat,http1_cleanup,HTTP1_Session,ses_req_pool_task,SES_pool_accept_task

	Fewer threads piling up.

	Perf still shows 20% CPU in _raw_spin_lock. Might be VSL locks, lets dig in a bit.

	thread_pools=1,thread_stats_rate=50000: 380,000r/s
	same with thread_stats_rate set to default.

	thread_pools=1,vsl_mask=-VCL_trace,-WorkThread,-Hash,-ReqHeader,-RespHeader,-VCL_call,-VCL_return,-Timestamp,-Debug: 422k-437k.

	Lots of time spent in _raw_spin_lock now. (27%). Less time but still significant time spent in vprintf (13%)

	threads are sitting in:
	25 __lll_lock_wait,_L_lock_858,pthread_mutex_lock,vsl_get,VSL_Flush,http1_cleanup,HTTP1_Session,ses_req_pool_task,SES_pool_accept_task
	5 __lll_lock_wait,_L_lock_858,pthread_mutex_lock,Lck__Lock,WRK_SumStat,http1_cleanup,HTTP1_Session,ses_req_pool_task,SES_pool_accept_task

	plus:
	1 __lll_lock_wait,_L_lock_858,pthread_mutex_lock,vsl_get,vslr,VSL,http1_dissect,req=0x16f24f0),ses_req_pool_task,SES_pool_accept_task
	at least 50 of these, with varying req= args waiting on the same lock in vsl_get.

	vsl_get is:
	/*--------------------------------------------------------------------
	* Reserve bytes for a record, wrap if necessary
	*/

	... global lock around pulling bytes from the shmlog. So I doubt this'll go much faster unless there's a way to switch amount of writing into VSL.

	Lets kill even more:
	param.set vsl_mask -VCL_trace,-WorkThread,-Hash,-ReqHeader,-RespHeader,-VCL_call,-VCL_return,-Timestamp,-Debug,-Begin,-ReqStart,-ReqMethod,-ReqURL,-ReqProtocol,-Hit,-RespProtocol,-RespStatus,-RespResponse,-RespUnset,-ReqAcct,-End

	497k, but there's CPU idle now. load generator is maxed.

	Adding a second host isn't behaving now, will have to tune the test.

	Next up... pass/miss/etc/maybe?