Skip to content

Instantly share code, notes, and snippets.

@stroxler
Created February 18, 2022 16:29
Show Gist options
  • Save stroxler/1187695c98e94b0f3ea7dbc1efadf0a8 to your computer and use it in GitHub Desktop.
Save stroxler/1187695c98e94b0f3ea7dbc1efadf0a8 to your computer and use it in GitHub Desktop.
Notes on Duckling internals

About this guide

This is a very rough set of notes taken in 2022 to help engineers understand the logic as well as the nontrivial Haskell concepts (e.g. GHC extensions) embodied in that logic.

It is not an official guide from the Duckling authors, and there are likely at least a few errors.

These notes only cover the core Duckling engine, not any logic specific to a specific Dimension. Most dimensions are relatively straightforward, but some are quite complicated - particularly Time - and these notes do not cover those.

The layers we interact with: debug and analyze

Let's trace a request in debug mode, something like

debug <some_locale> "some text" [Seal <SomeDimension>]
--
--              locale             input           targets
--
-- ^ these are the arg names, sort of (they change unnecessarily as we
-- go down the stack, I'm planning to put a commit out unifying them where
-- I can soon)
  • debug is really debugCustom, and it
    • converts the Seal Dimension list into a hash set
    • calls analyze sentence context options <that_hash_set>
    • calls debugTokens sentence <output_of_analyze>
  • debugTokens isn't critical to understand for core duckling
    • it maps formatToken sentence over the tokens from analyze
    • and it calls mapM_ (ptree sentence) over that (in an IO)
      • ptree is really just recursive pnode, an easy skim
    • it further returns the formatted tokens to the caller
  • analyze from Api.hs is the real entrypoint
    • type: Text -> Context -> Options -> HashSet (Seal Dimension) -> [ResolvedToken]
  • quick note about some other endpoints that call analyze:
    • the parse function is similar to debugCustom; like it it takes the Seal Dimension collection as a list not a set and calls formatToken to convert the ResolvedTokens into Entitys
    • the webserver from exe/ExampleMain.hs calls parse from its handler
  • back to analyze, here's what it does:
    • calls rulesFor locale targets to get a [Rule] list
    • calls parseAndResolve <that_rule_list> input to get a [ResolvedToken]
    • filters that list to just appropriate dimensions
    • calls rank on the results, which performs naive Bayes ranking

Before looking at Engine.hs, let's look at some types

Some of Duckling's logic involves types that aren't vanilla Haskell - even if you're used to Sigma-flavor code they could be nontrivial to understand - so let's walk through them.

Core dimension / token / entity types from Types.hs and Resolve.hs

  • Dimension is a GADT that wraps custom dimensions (which have to implement the CustomDimension typeclass declaring dependent dimensions, how to find Rules, etc). It's hashable, although interestingly I think the way this is done may only allow one CustomDimension at a time!
    • There's also a GEq instance, which I think is a generalized equality and uses some interesting type magic... not critical to understand now
  • Token binds together a Dimension a and an a
    • the a here will be a "data" type for a dim, e.g. TimeData, DurationData
    • it has to implement the Resolve typeclass in Resolve.hs, which defines
      • defines a ResolvedValue a type (this is using TypeFamilies; it's not really a polymorphic type, it's a type-level function, e.g. ResolvedValue AmountOfMondeyData = AmountOfMoneyValue)
      • resove :: Context -> Options -> a -> Maybe (ResolvedValue a, Bool)
        • here the Bool is an is_latent flag, see TODO above
  • ResolvedVal binds together (in a type which itself is not polymorphic) a (Dimension a) and a (ResolvedValue a)... this is a nice example of where forall is helpful; the Eq instance is also instructive
    • Note: ResolvedValue a type isn't a vanilla polymorphic type, it's a type family type defined as part of the Resolve typeclass going on here with a language extension I don't know about
  • a Node is a Range, Token, children :: [Node], and rule :: Maybe Text
  • a ResolvedToken is a Range, Node, ResolvedVal and isLatent :: bool
    • it's basically a Node with Token converted to ResolvedVal, but it still keeps track of the original Node
  • a Candidate is a triple Candidate ResolvedToken Double Bool
    • this is used for ranking, the Double is a score and the Bool indicates whether this is a dimension explicitly requested by the caller
  • a Range Int Int is ... well, a range
  • the types that go into rules. Thse are familiar if you've done "user-land" Duckling:
    • type Production = [Token] -> Maybe Token
    • type Predicate = [Token] -> Bool
    • data PatternItem = Regex PCRE.Regex | Predicate Predicate
    • type Pattern = [PatternItem]
    • a Rule is a name, Pattern, and Production
  • an Entity is basically just a serializable ResolvedVal...
    • the ResolvedVal is actually still nested inside it
    • but all the important fields are extracted and flattened
    • and we also convert the range into the actual matching text
    • see the formatToken function in Api.hs that we already looked at
    • I think this is what's actually exposed via json but it's not used by the core engine
  • two of the PatternItem-producing helper functions live here
    • the regex helper function that returns a Regex value
    • the dimension helper function makes a Predicate for whether a Token matches a Dimension

Quick recap of the flow

Let's recap the flow concisely, because having this in mind makes the Engine.hs code much easier to follow:

  • a Token is the raw output of a Production, it has a dimension and a value of the "data" type
  • a Node wraps a Token in metadata, including potentially children (so Nodes can form a tree)
  • a ResolvedToken comes from converting a Node to a ResolvedVal
  • a Candidate wraps a ResolvedToken with some classifier-related metadata for ranking
  • an Entity is a nicely-serializable value computed from a ResolvedToken

Further Reading

Optional further reading on the not-so-vanilla-Haskell things here:

  • For the GADT and TypeFamilies magic going on in Dimension and Resolve:
  • NFData, which almost all our basic token/node/etc values either implement or derive, is all about controlling strictness. You don't need to fully understand this to follow the Duckling core logic, but it's helps to have a vague idea what the seq and deepseq and force functions randomly popping up in our code do.

Engine.hs

Let's talk about a few more types used in the Engine

Unlike the ones discussed above, these are all internal to the Engine code (they aren't exposed when writing new rules) and they're all pretty vanilla:

Stash defined in Duckling.Types.Stash.hs, newtypes a Data.IntMap.Strict.IntMap (HashSet Node)

  • each key is a start position
  • the nodes form a HashSet because multiple matches can happen from the same position

Document defined in Duckling.Types.Document.hs, converts Text to bytes and wraps with metadata

Duckling a = Identity a

  • also runDuckling ma = runIdentity ma
  • you can think of a Duckling a as just an a but with alternative syntax via a dummy monad

type Match = (Rule, Int, [Node]) This is a partially-matched rule

  • what is in the Rule?
    • recall that a Rule consists of a name, pattern, production
    • Inside the engine, what we do is we match one PatternItem at a time, and we make a new Rule whose pattern is one item shorter. The rule only ever produces a match (in which case the production) gets called
    • thats what the Rule in a Match is: it's a partially-matched Rule from the original DSL, whith only the trailing so-far-unmatched PatternItems still left in pattern
  • what is the Int?
    • It's the end position of the text matched so far (by the Nodes that we've already matched from the front of pattern)
    • Remember that the Stash indexes Nodes by start position. This means we can line up partial matches against potential next tokens by lining up end position and start position
  • what is the [Node]?
    • It's the tokens matched by whatever [PatternItems] we've already popped off the front of Rule.pattern. We build it back to front, so it's in reverse order. We'll need this so that we can extract Token values in order to run the production whenever a Rule is fully-matched.

The high-level logic: the entrypoint and recursive loop

The core of Engine is the parseString1 function that gets called inside a recursive loop to build parse trees. There are three high-level plumbing functions that create the recursive loop to drive it.

parseAndResolve : (rules: [Rule]) (input: Text) context options -> [ResolvedToken]

  • calls Document.fromText input to get a Document.
  • calls parseString rules <that_document> to get a Duckling Stash
    • remember Duckling = Identity, so really this is just a Stash ~= IntMap (HashSet Node)
  • calls runDuckling on the resulting Stash, which just extracts the Stash inside
  • calls toPosOrderedList to convert the Stash to a [Node], and calls force which is an NFData + lazy thing that probably matters to the performance but not business logic
  • calls mapMaybe (resolveNode context options) on the resulting data
    • mapMaybe is because resolveNode returns an option, it can decline to resolve a Node
    • this is where the Node to ResolvedToken (hence also the Token to ResolvedVal) happens; it's where the Time callbacks get called and latent flags get extracted, etc
    • Note that we don't use context and options until this stage: Rules are context-free.
  • the end result is a [ResolvedToken] of nodes that resolved correctly. Recall that this list can have irrelevant values in it, we filter by dimension in the Api.hs layer

parseString :: [Rule] -> Document -> Duckling Stash builds up a stash of Nodes

  • it begins with a call to parseString1 to get a new :: Stash and partialMatches :: [Match]
  • if that's empty it aborts, otherwise it calls saturateParseString with the Rules that start with a Predicate to recursively build a stash of all nested parse trees

saturateParseString :: [Rule] -> Document -> Stash -> Stash -> [Match] -> Duckling Stash the recursive loop

  • this calls parseString1 rules sentence stash new matches to get (new', matches')
  • Base case: if new' is null, we return stash
  • Recursive case: tail call with stash' = Stash.union stash new', new', matches'
  • ... Some notes on the invariants and why we need each argument:
    • In the initial call from parseAndResolve we use the initial parseString1 output for both stash and new, but in general they are actually quite different: stash is everything so far, new is just the stuff from the last iteration.
    • Why do we need both? Because in order to efficiently compute new matches we shoudl only check where new is relevant, not all of stash

The core logic: what's going on in parseString1 in each loop iteration?

I suggest reviewing the discussion of type Match = (Rule, Int, [Node]) above real quick before you read this. Here's the signature:

parseString1 :: [Rule] -> Document -> Stash -> Stash -> [Match] -> Duckling (Stash, [Match])

Here's a quick rundown on the non-obvious arguments: - the first Stash is stash and contains all of the Nodes we've computes so far - the second Stash is new and contains just newly-added Nodes from last iteration - the [Match] list is a monotonically-growing list of partial matches which are waiting for a new Node to match the next Predicate (PatternItem) in the rule - the output will be a Stash of new fully-matched Nodes and a possibly-larger collection matches :: [Match] of partial matches

How does it work?

  • At each stage, what we have is
    • a list of Match values - partially matched Rules from the previous iteration
    • a bag of brand-new fully-matched Nodes (which contain Tokens) from the previous iteration
  • So, there are two ways to get new partial matches:
    • we might be able to advance a Match from before by one more step. Note that in this case the PatternItem we match against will be contained within the Match. This is what matchFirst does.
    • or, we might be able to start a new Match by matching the very first PatternItem in one of the original DSL Rules, due to a newly-available Token coming from a Node that was fully-matched in the previous iteration. This is what matchFirstAnywhere does.
  • Then, there are some efficiency issues:
    • we want to only use new Nodes when checking for the above
    • so, once we've found all matches that can advance using the above logic we want to "run them to completion" against the existing stash (and evaluating Regex PatternItems we come across as we go) before we move on to the next cycle, that way we can check only new nodes rather than the full stash in the next cycle
    • moreover, anything that hits a Regex PatternItem that fails to match can be discarded - new Nodes can't possibly turn it into a complete match
    • this is what the matchAll function does
  • Once all that is done, we have a final set of partial Match values using the Nodes computed thus far. We're ready to compute new Nodes and iterate:
    • we check for any partial Match that is now complete, and run rule.production on it; if a Token is returned we convert it to a Node
    • all the rest of our matches get added to the pool of potential matches. Note that we cannot discard the old matches because, even if they didn't get advanced this cycle, they might in the next cycle due to new Nodes we just computed.

How does classifier ranking work?

Our classifier is a naive Bayes classifier.

What this means is that

  • we precompute a log-likelihood from the corpus for each Rule - that is, a rule that can produce a single Token
  • to produce a log likelihood for a full ResolvedToken, we:
    • compute "features" from the associated Node; there are two potential features
      • the concatenation of the rule names of all direct children
      • the concatenation of all grains of all direct children (for some specific dimension types)
      • ... note that the top level Rule name is actually not a feature, this surprised me but I think it's because it's actually the mapping from children to parent we want to score
    • sum up the log-likelihood for each feature
    • sum up the result from doing ^ for all recursive children of the Node

The final result of this is a Candidate, and then we take all the top-ranked Candidate (according to the Ord instance defined on Candidate

The logic for this lives in

  • Extraction.hs, which has the feature extraction code
  • Rank.hs, which does the actual scoring and the recursion, as well as the final ranking
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment