- The paper introduces an architecture for end-to-end Reinforcement Learning (RL) optimization for task-oriented dialogue systems and its application to a multimodal task - grounding the dialogue into a visual context.
- Encoder Decoder models do not account for the planning problems (which are inherent in the dialogue systems) and do not integrate seamlessly with external contexts or knowledge bases.
- RL models can handle the planning problem but require online learning and a predefined structure of the task.