stroxler/notes.md

## notes.md

      
    Raw
  

              notes.md
            
          
    About this guide

This is a very rough set of notes taken in 2022 to help engineers understand the logic as well as the nontrivial Haskell concepts (e.g. GHC extensions) embodied in that logic.
It is not an official guide from the Duckling authors, and there are likely at least a few errors.
These notes only cover the core Duckling engine, not any logic specific to a specific Dimension. Most dimensions are relatively straightforward, but some are quite complicated - particularly Time - and these notes do not cover those.
The layers we interact with: debug and analyze

Let's trace a request in debug mode, something like
debug <some_locale> "some text" [Seal <SomeDimension>]
--
--              locale             input           targets
--
-- ^ these are the arg names, sort of (they change unnecessarily as we
-- go down the stack, I'm planning to put a commit out unifying them where
-- I can soon)


debug is really debugCustom, and it

converts the Seal Dimension list into a hash set
calls analyze sentence context options <that_hash_set>
calls debugTokens sentence <output_of_analyze>


debugTokens isn't critical to understand for core duckling

it maps formatToken sentence over the tokens from analyze
and it calls mapM_ (ptree sentence) over that (in an IO)

ptree is really just recursive pnode, an easy skim


it further returns the formatted tokens to the caller


analyze from Api.hs is the real entrypoint

type: Text -> Context -> Options -> HashSet (Seal Dimension) -> [ResolvedToken]


quick note about some other endpoints that call analyze:

the parse function is similar to debugCustom; like it it takes
the Seal Dimension collection as a list not a set and calls
formatToken to convert the ResolvedTokens into Entitys
the webserver from exe/ExampleMain.hs calls parse from its handler


back to analyze, here's what it does:

calls rulesFor locale targets to get a [Rule] list
calls parseAndResolve <that_rule_list> input to get a [ResolvedToken]
filters that list to just appropriate dimensions
calls rank on the results, which performs naive Bayes ranking


Before looking at Engine.hs, let's look at some types

Some of Duckling's logic involves types that aren't vanilla Haskell - even if you're used to Sigma-flavor code they could be nontrivial to understand - so let's walk through them.
Core dimension / token / entity types from Types.hs and Resolve.hs


Dimension is a GADT that wraps custom dimensions (which have to
implement the CustomDimension typeclass declaring dependent dimensions,
how to find Rules, etc). It's hashable, although interestingly I think
the way this is done may only allow one CustomDimension at a time!

There's also a GEq instance, which I think is a generalized equality
and uses some interesting type magic... not critical to understand now


Token binds together a Dimension a and an a

the a here will be a "data" type for a dim, e.g. TimeData, DurationData
it has to implement the Resolve typeclass in Resolve.hs, which defines

defines a ResolvedValue a type (this is using TypeFamilies; it's not
really a polymorphic type, it's a type-level function, e.g.
ResolvedValue AmountOfMondeyData = AmountOfMoneyValue)
resove :: Context -> Options -> a -> Maybe (ResolvedValue a, Bool)

here the Bool is an is_latent flag, see TODO above


ResolvedVal binds together (in a type which itself is not polymorphic)
a (Dimension a) and a (ResolvedValue a)... this is a nice example of
where forall is helpful; the Eq instance is also instructive

Note: ResolvedValue a type isn't a vanilla polymorphic type, it's
a type family type defined as part of the Resolve typeclass
going on here with a language extension I don't know about


a Node is a Range, Token, children :: [Node], and rule :: Maybe Text
a ResolvedToken is a Range, Node, ResolvedVal and isLatent :: bool

it's basically a Node with Token converted to ResolvedVal, but it still
keeps track of the original Node


a Candidate is a triple Candidate ResolvedToken Double Bool

this is used for ranking, the Double is a score and the Bool indicates whether this
is a dimension explicitly requested by the caller


a Range Int Int is ... well, a range
the types that go into rules. Thse are familiar if you've done "user-land" Duckling:

type Production = [Token] -> Maybe Token
type Predicate = [Token] -> Bool
data PatternItem = Regex PCRE.Regex | Predicate Predicate
type Pattern = [PatternItem]
a Rule is a name, Pattern, and Production


an Entity is basically just a serializable ResolvedVal...

the ResolvedVal is actually still nested inside it
but all the important fields are extracted and flattened
and we also convert the range into the actual matching text
see the formatToken function in Api.hs that we already looked at
I think this is what's actually exposed via json but it's not used by the core engine


two of the PatternItem-producing helper functions live here

the regex helper function that returns a Regex value
the dimension helper function makes a Predicate for whether a Token matches a Dimension


Quick recap of the flow

Let's recap the flow concisely, because having this in mind makes the Engine.hs code much easier to follow:

a Token is the raw output of a Production, it has a dimension and a value of the "data" type
a Node wraps a Token in metadata, including potentially children (so Nodes can form a tree)
a ResolvedToken comes from converting a Node to a ResolvedVal
a Candidate wraps a ResolvedToken with some classifier-related metadata for ranking
an Entity is a nicely-serializable value computed from a ResolvedToken

Further Reading

Optional further reading on the not-so-vanilla-Haskell things here:

For the GADT and TypeFamilies magic going on in Dimension and Resolve:

wikipedia has a quick intro, and the ghc wiki has painful details
Here's a concise step-by-step walkthroughl of several related GHC extensions, and various ways to use type families

In Duckling using the simple "type synonym" flavor of type families described here


Here's a longer guide with more motivation, but I haven't read it fully yet


NFData, which almost all our basic token/node/etc values either implement or derive,
is all about controlling strictness. You don't need to fully understand this to follow the Duckling core logic, but it's helps to have a vague idea what the seq and deepseq and force functions randomly popping up in our code do.

FP Complete put out a lengthy blog about laziness, strictness, and NFData that aims to be beginner friendly


Engine.hs

Let's talk about a few more types used in the Engine

Unlike the ones discussed above, these are all internal to the Engine code (they aren't exposed when writing new rules) and they're all pretty vanilla:
Stash defined in Duckling.Types.Stash.hs, newtypes a Data.IntMap.Strict.IntMap (HashSet Node)

each key is a start position
the nodes form a HashSet because multiple matches can happen from the same position

Document defined in Duckling.Types.Document.hs, converts Text to bytes and wraps with metadata
Duckling a = Identity a

also runDuckling ma = runIdentity ma
you can think of a Duckling a as just an a but with alternative syntax via a dummy monad

type Match = (Rule, Int, [Node]) This is a partially-matched rule

what is in the Rule?

recall that a Rule consists of a name, pattern, production
Inside the engine, what we do is we match one PatternItem at a time, and
we make a new Rule whose pattern is one item shorter. The rule only
ever produces a match (in which case the production) gets called
thats what the Rule in a Match is: it's a partially-matched Rule
from the original DSL, whith only the trailing so-far-unmatched
PatternItems still left in pattern


what is the Int?

It's the end position of the text matched so far (by the Nodes that
we've already matched from the front of pattern)
Remember that the Stash indexes Nodes by start position. This means we can line up partial matches against potential next tokens by lining up end position and start position


what is the [Node]?

It's the tokens matched by whatever [PatternItems] we've already popped off
the front of Rule.pattern. We build it back to front, so it's in reverse order. We'll need this so that we can extract Token values in order to run the production whenever a Rule is fully-matched.


The high-level logic: the entrypoint and recursive loop

The core of Engine is the parseString1 function that gets called inside a recursive loop to build parse trees. There are three high-level plumbing functions that create the recursive loop to drive it.
parseAndResolve : (rules: [Rule]) (input: Text) context options -> [ResolvedToken]

calls Document.fromText input to get a Document.
calls parseString rules <that_document> to get a Duckling Stash

remember Duckling = Identity, so really this is just a Stash ~= IntMap (HashSet Node)


calls runDuckling   on the resulting Stash, which just extracts the Stash inside
calls toPosOrderedList to convert the Stash to a [Node], and calls force which is
an NFData + lazy thing that probably matters to the performance but not business logic
calls mapMaybe (resolveNode context options) on the resulting data

mapMaybe is because resolveNode returns an option, it can decline to resolve a Node
this is where the Node to ResolvedToken (hence also the Token to ResolvedVal)
happens; it's where the Time callbacks get called and latent flags get extracted, etc
Note that we don't use context and options until this stage: Rules are context-free.


the end result is a [ResolvedToken] of nodes that resolved correctly. Recall that this
list can have irrelevant values in it, we filter by dimension in the Api.hs layer

parseString :: [Rule] -> Document -> Duckling Stash builds up a stash of Nodes

it begins with a call to parseString1 to get a new :: Stash and partialMatches :: [Match]
if that's empty it aborts, otherwise it calls saturateParseString with the Rules that
start with a Predicate to recursively build a stash of all nested parse trees

saturateParseString :: [Rule] -> Document -> Stash -> Stash -> [Match] -> Duckling Stash the recursive loop

this calls parseString1 rules sentence stash new matches to get (new', matches')
Base case: if new' is null, we return stash
Recursive case: tail call with stash' = Stash.union stash new', new', matches'
... Some notes on the invariants and why we need each argument:

In the initial call from parseAndResolve we use the initial parseString1 output
for both stash and new, but in general they are actually quite different:
stash is everything so far, new is just the stuff from the last iteration.
Why do we need both?  Because in order to efficiently compute new
matches we shoudl only check where new is relevant, not all of stash


The core logic: what's going on in parseString1 in each loop iteration?

I suggest reviewing the discussion of type Match = (Rule, Int, [Node]) above real quick before you read this. Here's the signature:
parseString1 :: [Rule] -> Document -> Stash -> Stash -> [Match] -> Duckling (Stash, [Match])

Here's a quick rundown on the non-obvious arguments:
- the first Stash is stash and contains all of the Nodes we've computes so far
- the second Stash is new and contains just newly-added Nodes from last iteration
- the [Match] list is a monotonically-growing list of partial matches which are waiting
for a new Node to match the next Predicate (PatternItem) in the rule
- the output will be a Stash of new fully-matched Nodes and a possibly-larger
collection matches :: [Match] of partial matches
How does it work?

At each stage, what we have is

a list of Match values -  partially matched Rules from the previous iteration
a bag of brand-new fully-matched Nodes (which contain Tokens) from the previous iteration


So, there are two ways to get new partial matches:

we might be able to advance a Match from before by one more step.  Note
that in this case the PatternItem we match against will be contained
within the Match. This is what matchFirst does.
or, we might be able to start a new Match by matching the very first PatternItem in
one of the original DSL Rules, due to a newly-available Token coming from a Node
that was fully-matched in the previous iteration. This is what matchFirstAnywhere does.


Then, there are some efficiency issues:

we want to only use new Nodes when checking for the above
so, once we've found all matches that can advance using the above logic we want
to "run them to completion" against the existing stash (and evaluating Regex PatternItems
we come across as we go) before we move on to the next cycle, that way we can check only
new nodes rather than the full stash in the next cycle
moreover, anything that hits a Regex PatternItem that fails to match can be discarded -
new Nodes can't possibly turn it into a complete match
this is what the matchAll function does


Once all that is done, we have a final set of partial Match values using the Nodes
computed thus far. We're ready to compute new Nodes and iterate:

we check for any partial Match that is now complete, and run rule.production
on it; if a Token is returned we convert it to a Node
all the rest of our matches get added to the pool of potential matches. Note that we
cannot discard the old matches because, even if they didn't get advanced this cycle,
they might in the next cycle due to new Nodes we just computed.


How does classifier ranking work?

Our classifier is a naive Bayes classifier.
What this means is that

we precompute a log-likelihood from the corpus for each Rule - that is, a rule that
can produce a single Token
to produce a log likelihood for a full ResolvedToken, we:

compute "features" from the associated Node; there are two potential features

the concatenation of the rule names of all direct children
the concatenation of all grains of all direct children (for some specific dimension types)
... note that the top level Rule name is actually not a feature, this surprised me
but I think it's because it's actually the mapping from children to parent we want to score


sum up the log-likelihood for each feature
sum up the result from doing ^ for all recursive children of the Node


The final result of this is a Candidate, and then we take all the top-ranked Candidate
(according to the Ord instance defined on Candidate
The logic for this lives in

Extraction.hs, which has the feature extraction code
Rank.hs, which does the actual scoring and the recursion, as well as the final ranking