Suppose I have a simple words_to_numbers.grm that, given a spelled-out number string, will return multiple possible interpretations for it:
Input String: six twenty two
Output String: 622 <cost: 0.2>
Output String: 6 22 <cost: 0.4>
Output String: 620 2 <cost: 0.4>
What I would like is to be able to map the output tokens to the input tokens. An example would be something like this:
Output String: 622<"six twenty two"> <cost: 0.2>
Output String: 6<"six"> 22<"twenty two"> <cost: 0.4>
Output String: 620<"six twenty"> 2<"two"> <cost: 0.4>
(or just provide the character positions of each new token, or anything else that could possibly help you do the mapping at a later stage)
You can't do this post-rewrite; it's impossible to know whether "(six) (twenty two)" transduced to "6 22", or "(six twenty) two".
I don't believe this is possible to do with thraxrewrite-tester
, or just trying to add the markup in grammar rules. I've also looked at both thrax and open-fst code and tried to see what it takes to carry over the input states forward through rewrites but haven't had any success yet.
The grammars I'm working on are much more complicated than this example (400k nodes and millions of arcs for a very sophisticated NLU module) and being able to provide some sort of mapping between input and output is essential to be able to integrate thrax into the rest of the application.
Thank you very much for this incredibly useful tool, and any help or hints are greatly appreciated!