informatimago/bt-ccl-debug.org

## bt-ccl-debug.org

      
    Raw
  

              bt-ccl-debug.org
            
          
    Debugging a dead-lock in a program using bordeaux-threads in ccl.

We have a program with three threads:
cl-user> (list-threads)
 1) #<process Telnet REPL Client #1 DOWN LAYER(22) [semaphore wait] #x302002A4903D>
 2) #<process Telnet REPL Client #1(21) [semaphore wait] #x302002A469FD>
 3) #<process Telnet REPL Server(20) [Active] #x30200291EB5D>
 4) …

The server thread listens to connections and forks client threads for
  accepted connections.
The client thread forks a down layer thread that loops reading bytes
  from the client socket, and forwarding them up the protocol layers, up
  to a buffer in a TELNET-STREAM Gray stream.
The client thread then goes on into a REPL loop using the
  TELNET-STREAM Gray stream as *TERMINAL-IO*.  Writing back to
  *TERMINAL-IO* goes down to the client socket in this client thread.
Unfortunately, when sending a byte to the upper layer, the down layer
  thread hangs waiting for the stream-lock.  Who has locked this stream?
Neither ccl nor bordeaux-threads are very helpful in debugging that…
Recording the thread and function that holds an lock

What we’d want, is to know what threads are holding a lock.  So we will
  implement a macro shadowing  BT:WITH-LOCK-HELD, to record that
  information into a weak hash-table.  Happily, ccl has native weak
  hash-tables so we don’t have to use
  com.informatimago.clext.closer-weak.
In addition to recording the current thread, we also get the name of
  the caller function from ccl:backtrace-as-list.
We use a PRINT-OBJECT :AROUND method to print the locking threads
  when it’s available
Then when the dead-lock occurs, we can have a look at the status of
  the various locks, and notice immediately our culprit stream lock that
  is held by not once but TWICE by the same thread!  ccl only has
  recursive locks, and bt:with-lock-held uses the native locking
  mechanism, which is a recursive lock.  But now, it is clear what
  functions are involved in this double locking and the solution will be
  obvious: split input-buffer-fetch-octet into an inner function that
  assumes the lock is already held, and use this in %stream-read-char.
  Problem solved.
(map nil 'print (com.informatimago.common-lisp.cesarum.utility:hash-table-entries *status*))

(#<recursive-lock "down-layer" [ptr @ #x605080] #x3020028D45FD>) 
(#<recursive-lock "Telnet REPL Server Lock" [ptr @ #x10D880] #x30200279729D>) 
(#<recursive-lock "telnet-stream" :status ((:locking 
                                            (funcall "#<STANDARD-METHOD COM.INFORMATIMAGO.CLEXT.TELNET.STREAM::INPUT-BUFFER-FETCH-OCTET (COM.INFORMATIMAGO.CLEXT.TELNET.STREAM:TELNET-STREAM T)>" "#<TELNET-STREAM  #x3020028D75CD>" "NIL")
                                            #<process Telnet REPL Client #1(21) [semaphore wait] #x3020028D192D>)
                                           (:locking
                                            (com.informatimago.clext.telnet.stream::%stream-read-char "#<TELNET-STREAM #x3020028D75CD>" "NIL")
                                            #<process Telnet REPL Client #1(21) [semaphore wait] #x3020028D192D>)) #x3020028D74BD> 
                  (:locking
                   (funcall "#<STANDARD-METHOD COM.INFORMATIMAGO.CLEXT.TELNET.STREAM::INPUT-BUFFER-FETCH-OCTET (COM.INFORMATIMAGO.CLEXT.TELNET.STREAM:TELNET-STREAM T)>" "#<TELNET-STREAM #x3020028D75CD>" "NIL")
                   #<process Telnet REPL Client #1(21) [semaphore wait] #x3020028D192D>)
                  (:locking 
                   (com.informatimago.clext.telnet.stream::%stream-read-char "#<TELNET-STREAM #x3020028D75CD>" "NIL")
                   #<process Telnet REPL Client #1(21) [semaphore wait] #x3020028D192D>)) nil

Naming Condition Variables

In addition, ccl condition-variables are not named;
  bordeaux-threads ignores the name parameter.  So we shadow
  make-condition-variable and record the name of the condition
  variables in a weak hash-table, and add a print-object :around
  method to print this name when available.  This is very convenient
  when inspecting threads, to see on what condition variable they’re
  actually waiting.