Skip to content

Instantly share code, notes, and snippets.

Created October 20, 2017 23:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save anonymous/522156df4ce78f2592805c8f417c5687 to your computer and use it in GitHub Desktop.
Save anonymous/522156df4ce78f2592805c8f417c5687 to your computer and use it in GitHub Desktop.

Suppose I have a simple words_to_numbers.grm that, given a spelled-out number string, will return multiple possible interpretations for it:

Input String: six twenty two

Output String: 622 <cost: 0.2>
Output String: 6 22 <cost: 0.4>
Output String: 620 2 <cost: 0.4>

What I would like is to be able to map the output tokens to the input tokens. An example would be something like this:

Output String: 622<"six twenty two"> <cost: 0.2>
Output String: 6<"six"> 22<"twenty two"> <cost: 0.4>
Output String: 620<"six twenty"> 2<"two"> <cost: 0.4>

(or just provide the character positions of each new token, or anything else that could possibly help you do the mapping at a later stage)

You can't do this post-rewrite; it's impossible to know whether "(six) (twenty two)" transduced to "6 22", or "(six twenty) two".

I don't believe this is possible to do with thraxrewrite-tester, or just trying to add the markup in grammar rules. I've also looked at both thrax and open-fst code and tried to see what it takes to carry over the input states forward through rewrites but haven't had any success yet.

The grammars I'm working on are much more complicated than this example (400k nodes and millions of arcs for a very sophisticated NLU module) and being able to provide some sort of mapping between input and output is essential to be able to integrate thrax into the rest of the application.

Thank you very much for this incredibly useful tool, and any help or hints are greatly appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment