Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@pnc
Last active June 27, 2022 10:56
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pnc/b8ccc7c9e633258294be22be56e82bb7 to your computer and use it in GitHub Desktop.
Save pnc/b8ccc7c9e633258294be22be56e82bb7 to your computer and use it in GitHub Desktop.
Debugging poolboy checkout errors

Just about the only thing that'll crash the Erlang VM is running out of memory. Although processes themselves are cheap, a good way to run out of memory is to start an unbounded number of processes from an external signal (such as one per web request), and then make each of those processes hold onto some more expensive resource. If you're not careful about cleanup, you can also leak processes, which compounds the problem. Anyway, nobody's perfect, so to avoid these cases, we use the poolboy library in a bunch of places to handle process pools that handle cleanup and limit the total amount of concurrency to some reasonable amount.

When:

  1. all a pool's workers are checked out (by actual organic load/concurrency)
  2. some process is leaking pool workers by not checking them back in (rude)
  3. the pool supervisor isn't responding (it's deadlocked, bugs, etc.)

you'll see "checkout" timeouts (waited longer than the timeout to check out a worker from the pool):

Ranch listener http_listener terminated with reason: {timeout,{gen_server,call,[somebody_ldap,{checkout,#Ref<0.0.182714369.260050>,true},5000]}}
Jun 01 14:24:15 squid-fortress-production squid-fortress:  18:24:15.686 [error] Ranch listener http_listener terminated with reason: [{reason,{timeout,{gen_server,call,[somebody_ldap,{checkout,#Ref<0.0.182714369.260407>,true},5000]}}},{mfa,{authentication_handler,handle,2}},{stacktrace,[{gen_server,call,3,[{file,"gen_server.erl"},{line,212}]},{poolboy,checkout,3,[{file,"src/poolboy.erl"},{line,55}]},{poolboy,transaction,3,[{file,"src/poolboy.erl"},{line,74}]},{authentication_handler,authenticate_request,3,[{file,"src/authentication_handler.erl"},{line,91}]},{authentication_handler,handle,2,[{file,"src/authentication_handle..."},...]},...]},...]

To get the number of available and checked out workers, run:

> poolboy:status(somebody_ldap)
{ready,39,0,1}

where somebody_ldap is the registered name of the pool in question, shown in the error message as the first argument to gen_server:call. The return value is {status, Available, Overflow, CheckedOut}. Overflow is how many workers beyond the normal pool size have been allocated, I think, but read the poolboy docs because I still don't understand what counts as overflow.

If poolboy:status command times out, the pool supervisor is locked up and you need to do something drastic (like restart the server or inspect the supervisor process itself.)

If you see something like {full,0,0,10}, all the workers are checked out.

You can find out what each worker is doing by running:

> lists:map(fun({_, Pid, _, _}) -> process_info(Pid, current_stacktrace) end, gen_server:call(somebody_ldap, get_all_workers)).
[{current_stacktrace,[{eldap,recv,1,
                             [{file,"eldap.erl"},{line,1027}]},
                      {ldap_authentication_adapter,login,5,
                                                   [{file,"src/ldap_authentication_adapter.erl"},{line,16}]},
                      {somebody_authentication_adapter,login,5,
                                                     [{file,"src/somebody_authentication_adapter.erl"},{line,43}]},
                      {authentication_server,handle_call,3,
                                             [{file,"src/authentication_server.erl"},{line,124}]},
                      {gen_server,try_handle_call,4,
                                  [{file,"gen_server.erl"},{line,629}]},
                      {gen_server,handle_msg,5,
                                  [{file,"gen_server.erl"},{line,661}]},
                      {proc_lib,init_p_do_apply,3,
                                [{file,"proc_lib.erl"},{line,240}]}]}, …]

In this case, the worker is stuck receiving data from the LDAP server over a (presumably defunct) TCP connection.

If you drop the second argument to erlang:process_info, you can see a bit more information, like the process's dictionary, which may contain useful info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment