This is a very rough set of notes taken in 2022 to help engineers understand the logic as well as the nontrivial Haskell concepts (e.g. GHC extensions) embodied in that logic.
It is not an official guide from the Duckling authors, and there are likely at least a few errors.
These notes only cover the core Duckling engine, not any logic specific to a specific Dimension. Most dimensions are relatively straightforward, but some are quite complicated - particularly Time - and these notes do not cover those.
Let's trace a request in debug mode, something like
debug <some_locale> "some text" [Seal <SomeDimension>]
--
-- locale input targets
--
-- ^ these are the arg names, sort of (they change unnecessarily as we
-- go down the stack, I'm planning to put a commit out unifying them where
-- I can soon)
debug
is reallydebugCustom
, and it- converts the Seal Dimension list into a hash set
- calls
analyze sentence context options <that_hash_set>
- calls
debugTokens sentence <output_of_analyze>
debugTokens
isn't critical to understand for core duckling- it maps
formatToken sentence
over the tokens from analyze - and it calls
mapM_ (ptree sentence)
over that (in an IO)ptree
is really just recursivepnode
, an easy skim
- it further returns the formatted tokens to the caller
- it maps
analyze
fromApi.hs
is the real entrypoint- type:
Text -> Context -> Options -> HashSet (Seal Dimension) -> [ResolvedToken]
- type:
- quick note about some other endpoints that call
analyze
:- the
parse
function is similar todebugCustom
; like it it takes the Seal Dimension collection as a list not a set and callsformatToken
to convert theResolvedTokens
intoEntity
s - the webserver from
exe/ExampleMain.hs
callsparse
from its handler
- the
- back to
analyze
, here's what it does:- calls
rulesFor locale targets
to get a[Rule]
list - calls
parseAndResolve <that_rule_list> input
to get a[ResolvedToken]
- filters that list to just appropriate dimensions
- calls
rank
on the results, which performs naive Bayes ranking
- calls
Some of Duckling's logic involves types that aren't vanilla Haskell - even if you're used to Sigma-flavor code they could be nontrivial to understand - so let's walk through them.
Dimension
is a GADT that wraps custom dimensions (which have to implement theCustomDimension
typeclass declaring dependent dimensions, how to findRule
s, etc). It's hashable, although interestingly I think the way this is done may only allow one CustomDimension at a time!- There's also a
GEq
instance, which I think is a generalized equality and uses some interesting type magic... not critical to understand now
- There's also a
Token
binds together aDimension a
and ana
- the
a
here will be a "data" type for a dim, e.g.TimeData
,DurationData
- it has to implement the
Resolve
typeclass inResolve.hs
, which defines- defines a
ResolvedValue a
type (this is usingTypeFamilies
; it's not really a polymorphic type, it's a type-level function, e.g.ResolvedValue AmountOfMondeyData = AmountOfMoneyValue
) resove :: Context -> Options -> a -> Maybe (ResolvedValue a, Bool)
- here the
Bool
is an is_latent flag, see TODO above
- here the
- defines a
- the
ResolvedVal
binds together (in a type which itself is not polymorphic) a(Dimension a)
and a(ResolvedValue a)
... this is a nice example of whereforall
is helpful; theEq
instance is also instructive- Note:
ResolvedValue a
type isn't a vanilla polymorphic type, it's a type family type defined as part of theResolve
typeclass going on here with a language extension I don't know about
- Note:
- a
Node
is aRange
,Token
,children :: [Node]
, andrule :: Maybe Text
- a
ResolvedToken
is aRange
,Node
,ResolvedVal
andisLatent :: bool
- it's basically a
Node
withToken
converted toResolvedVal
, but it still keeps track of the originalNode
- it's basically a
- a
Candidate
is a tripleCandidate ResolvedToken Double Bool
- this is used for ranking, the Double is a score and the Bool indicates whether this is a dimension explicitly requested by the caller
- a
Range Int Int
is ... well, a range - the types that go into rules. Thse are familiar if you've done "user-land" Duckling:
type Production = [Token] -> Maybe Token
type Predicate = [Token] -> Bool
data PatternItem = Regex PCRE.Regex | Predicate Predicate
type Pattern = [PatternItem]
- a
Rule
is a name, Pattern, and Production
- an
Entity
is basically just a serializableResolvedVal
...- the
ResolvedVal
is actually still nested inside it - but all the important fields are extracted and flattened
- and we also convert the range into the actual matching text
- see the
formatToken
function inApi.hs
that we already looked at - I think this is what's actually exposed via json but it's not used by the core engine
- the
- two of the
PatternItem
-producing helper functions live here- the
regex
helper function that returns aRegex
value - the
dimension
helper function makes aPredicate
for whether aToken
matches aDimension
- the
Let's recap the flow concisely, because having this in mind makes the Engine.hs
code much easier to follow:
- a
Token
is the raw output of aProduction
, it has a dimension and a value of the "data" type - a
Node
wraps aToken
in metadata, including potentially children (soNode
s can form a tree) - a
ResolvedToken
comes from converting aNode
to aResolvedVal
- a
Candidate
wraps aResolvedToken
with some classifier-related metadata for ranking - an
Entity
is a nicely-serializable value computed from aResolvedToken
Optional further reading on the not-so-vanilla-Haskell things here:
- For the
GADT
andTypeFamilies
magic going on inDimension
andResolve
:- wikipedia has a quick intro, and the ghc wiki has painful details
- Here's a concise step-by-step walkthroughl of several related GHC extensions, and various ways to use type families
- In Duckling using the simple "type synonym" flavor of type families described here
- Here's a longer guide with more motivation, but I haven't read it fully yet
- NFData, which almost all our basic token/node/etc values either implement or derive,
is all about controlling strictness. You don't need to fully understand this to follow the Duckling core logic, but it's helps to have a vague idea what the
seq
anddeepseq
andforce
functions randomly popping up in our code do.- FP Complete put out a lengthy blog about laziness, strictness, and NFData that aims to be beginner friendly
Unlike the ones discussed above, these are all internal to the Engine
code (they aren't exposed when writing new rules) and they're all pretty vanilla:
Stash
defined in Duckling.Types.Stash.hs
, newtypes a Data.IntMap.Strict.IntMap (HashSet Node)
- each key is a start position
- the nodes form a HashSet because multiple matches can happen from the same position
Document
defined in Duckling.Types.Document.hs
, converts Text to bytes and wraps with metadata
Duckling a
= Identity a
- also
runDuckling ma = runIdentity ma
- you can think of a
Duckling a
as just ana
but with alternative syntax via a dummy monad
type Match = (Rule, Int, [Node])
This is a partially-matched rule
- what is in the
Rule
?- recall that a
Rule
consists of aname, pattern, production
- Inside the engine, what we do is we match one
PatternItem
at a time, and we make a newRule
whosepattern
is one item shorter. The rule only ever produces a match (in which case theproduction
) gets called - thats what the
Rule
in aMatch
is: it's a partially-matchedRule
from the original DSL, whith only the trailing so-far-unmatchedPatternItem
s still left inpattern
- recall that a
- what is the
Int
?- It's the end position of the text matched so far (by the
Node
s that we've already matched from the front ofpattern
) - Remember that the
Stash
indexesNode
s by start position. This means we can line up partial matches against potential next tokens by lining up end position and start position
- It's the end position of the text matched so far (by the
- what is the
[Node]
?- It's the tokens matched by whatever
[PatternItems]
we've already popped off the front ofRule.pattern
. We build it back to front, so it's in reverse order. We'll need this so that we can extractToken
values in order to run theproduction
whenever aRule
is fully-matched.
- It's the tokens matched by whatever
The core of Engine is the parseString1
function that gets called inside a recursive loop to build parse trees. There are three high-level plumbing functions that create the recursive loop to drive it.
parseAndResolve : (rules: [Rule]) (input: Text) context options -> [ResolvedToken]
- calls
Document.fromText input
to get aDocument
. - calls
parseString rules <that_document>
to get aDuckling Stash
- remember
Duckling = Identity
, so really this is just aStash ~= IntMap (HashSet Node)
- remember
- calls
runDuckling
on the resultingStash
, which just extracts theStash
inside - calls
toPosOrderedList
to convert theStash
to a[Node]
, and callsforce
which is anNFData
+ lazy thing that probably matters to the performance but not business logic - calls
mapMaybe (resolveNode context options)
on the resulting datamapMaybe
is becauseresolveNode
returns an option, it can decline to resolve aNode
- this is where the
Node
toResolvedToken
(hence also theToken
toResolvedVal
) happens; it's where theTime
callbacks get called and latent flags get extracted, etc - Note that we don't use
context
andoptions
until this stage:Rule
s are context-free.
- the end result is a
[ResolvedToken]
of nodes that resolved correctly. Recall that this list can have irrelevant values in it, we filter by dimension in theApi.hs
layer
parseString :: [Rule] -> Document -> Duckling Stash
builds up a stash of Nodes
- it begins with a call to
parseString1
to get anew :: Stash
andpartialMatches :: [Match]
- if that's empty it aborts, otherwise it calls
saturateParseString
with theRule
s that start with a Predicate to recursively build a stash of all nested parse trees
saturateParseString :: [Rule] -> Document -> Stash -> Stash -> [Match] -> Duckling Stash
the recursive loop
- this calls
parseString1 rules sentence stash new matches
to get(new', matches')
- Base case: if
new'
is null, we returnstash
- Recursive case: tail call with
stash' = Stash.union stash new'
,new'
,matches'
- ... Some notes on the invariants and why we need each argument:
- In the initial call from
parseAndResolve
we use the initialparseString1
output for bothstash
andnew
, but in general they are actually quite different:stash
is everything so far,new
is just the stuff from the last iteration. - Why do we need both? Because in order to efficiently compute new
matches we shoudl only check where
new
is relevant, not all ofstash
- In the initial call from
I suggest reviewing the discussion of type Match = (Rule, Int, [Node])
above real quick before you read this. Here's the signature:
parseString1 :: [Rule] -> Document -> Stash -> Stash -> [Match] -> Duckling (Stash, [Match])
Here's a quick rundown on the non-obvious arguments:
- the first Stash
is stash
and contains all of the Node
s we've computes so far
- the second Stash
is new
and contains just newly-added Nodes
from last iteration
- the [Match]
list is a monotonically-growing list of partial matches which are waiting
for a new Node
to match the next Predicate (PatternItem)
in the rule
- the output will be a Stash
of new fully-matched Node
s and a possibly-larger
collection matches :: [Match]
of partial matches
How does it work?
- At each stage, what we have is
- a list of
Match
values - partially matched Rules from the previous iteration - a bag of brand-new fully-matched
Node
s (which containToken
s) from the previous iteration
- a list of
- So, there are two ways to get new partial matches:
- we might be able to advance a
Match
from before by one more step. Note that in this case thePatternItem
we match against will be contained within theMatch
. This is whatmatchFirst
does. - or, we might be able to start a new
Match
by matching the very firstPatternItem
in one of the original DSLRules
, due to a newly-availableToken
coming from aNode
that was fully-matched in the previous iteration. This is whatmatchFirstAnywhere
does.
- we might be able to advance a
- Then, there are some efficiency issues:
- we want to only use
new
Nodes when checking for the above - so, once we've found all matches that can advance using the above logic we want to "run them to completion" against the existing stash (and evaluating Regex PatternItems we come across as we go) before we move on to the next cycle, that way we can check only new nodes rather than the full stash in the next cycle
- moreover, anything that hits a
Regex
PatternItem
that fails to match can be discarded - newNode
s can't possibly turn it into a complete match - this is what the
matchAll
function does
- we want to only use
- Once all that is done, we have a final set of partial
Match
values using theNode
s computed thus far. We're ready to compute newNode
s and iterate:- we check for any partial
Match
that is now complete, and runrule.production
on it; if aToken
is returned we convert it to aNode
- all the rest of our matches get added to the pool of potential matches. Note that we
cannot discard the old matches because, even if they didn't get advanced this cycle,
they might in the next cycle due to new
Node
s we just computed.
- we check for any partial
Our classifier is a naive Bayes classifier.
What this means is that
- we precompute a log-likelihood from the corpus for each Rule - that is, a rule that
can produce a single
Token
- to produce a log likelihood for a full
ResolvedToken
, we:- compute "features" from the associated
Node
; there are two potential features- the concatenation of the rule names of all direct children
- the concatenation of all grains of all direct children (for some specific dimension types)
- ... note that the top level
Rule
name is actually not a feature, this surprised me but I think it's because it's actually the mapping from children to parent we want to score
- sum up the log-likelihood for each feature
- sum up the result from doing ^ for all recursive
children
of theNode
- compute "features" from the associated
The final result of this is a Candidate
, and then we take all the top-ranked Candidate
(according to the Ord
instance defined on Candidate
The logic for this lives in
Extraction.hs
, which has the feature extraction codeRank.hs
, which does the actual scoring and the recursion, as well as the final ranking