Natural Language Understanding |
Semantic analysis |
The ability of an agent to derive meaning from text (e.g., paragraphs, sentences, phrases). One example includes an agent identifying various parts of speech (e.g., verbs, nouns, etc.) in sentences. |
|
Machine learning |
The ability of an agent to learn, often by training to find patterns in large amounts of data. One example includes an agent learning to determine the sentiment (e.g., happy, sad, etc.) of a sentence by training on many sentences humans have labeled with particular sentiments. |
|
Similarity scores |
Values, often between 0 and 1, representing the similarity between two words or phrases. One example includes an agent determining the word "programming" is similar to "coding" (e.g., through calculating cosine distance of the two word embeddings [5]), and therefore giving it a value (similarity score) close to 1. |
|
Large language models |
A machine learning model that has been trained on enormous amounts of text (e.g., hundreds of gigabytes of text). These models can be used as starting points for conversational agents to understand language, and can be trained further (e.g., using transfer learning) by developers to enable agents to understand specific pre-determined phrases or "intents". |
|
Transfer learning |
A machine learning technique allowing developers to do small amounts of additional training with already pre-trained machine learning models (e.g., large language models) to develop sufficient models. For example, agent developers may complete transfer learning on large language models to enable agents to understand specific pre-determined phrases or "intents" without having to train their own models from scratch. |
|
Intents |
Groups of phrases with similar meanings. For example, the phrases, "What games can I play?", "What type of games do you have?" and "Which games are available?" might be categorized as an Ask About Games intent. Agents can recognize such groups of phrases as particular intents, allowing them to respond the same way to multiple different phrases. |
|
Entities |
Groups of words with similar characteristics that agents may identify in users' speech, and use in their responses. For example, an agent might recognize a "date" entity in the phrase, "Set a reminder for March 24th", or a name entity in the phrase, "Send a text to Morgan". Often, agents recognize entities using machine learning models. |
|
Training |
The process agents may use to learn. Often, agents train by analyzing large amounts of data (e.g., many sentences), predicting something about the data (e.g., predicting sentences have sad or happy sentiments), and updating their models based on whether their predictions were correct or incorrect. |
|
Testing |
The process developers use to assess their agents and whether the agents sufficiently address their needs. For example, developers may test whether agents' machine learning models categorize phrases into the correct intents. |
|
Constrained vs. unconstrained natural language |
A spectrum of "constrained" to "unconstrained", which describes how restricted agents' are in their understanding of language. For example, constrained natural language agents only attempt to understand a limited number of phrases, whereas unconstrained natural language agents will attempt to understand any phrases. Generally, constrained natural language agents have high accuracy in their understanding, whereas unconstrained natural language agents have lower accuracy. |
Conversation Representation |
Textual representations |
The mode of representing agent conversation visually using characters. Textual representations are common when developing agents. They can also be used to quickly identify any misrecognition of users' speech. |
|
Undirected graphs |
A type of visualization that includes nodes and connections. Undirected graphs may be used to represent language; for example, with nodes containing words and connections representing word relatedness. This can allow learners to quickly understand agents' perceptions of language. |
|
Histograms |
A type of visualization that compares different values using bars. For example, a histogram may represent word similarity scores with different heights of bars. This can allow learners to quickly compare which words agents find most similar to each other. |
|
Event-driven program representations |
A method of visualizing the logic behind events and agents' responses to these events, often through event definition code and event response code. For example, one piece of code might define when an event occurs (e.g., when an agent recognizes a greeting intent), and another piece of code might define how the agent responds (e.g., by exclaiming, "Hello there!"). This type of programming representation can quickly become complex, and may benefit by being transformed into a directed graph representation. |
|
Storyboards |
A method of visualizing agent conversation, which includes agent phrases, human responses and data the agent has stored about conversation context. This method is user-centric, and allows developers to envision realistic conversations with contextual information. |
|
Directed graphs |
A type of visualization that includes nodes and arrows (i.e., directed connections). This visualization may be used to represent agents' logic and conversation flow. It is developer-centric, and allows developers to observe many paths of potential conversations. |
Dialog Management |
Turn-taking |
The way in which conversation control is passed between the user and agent. For example, initially the user may hold conversation control and say, "What day is it today?", after this, the agent holds control and may say, "It is July 11th, 2022". In this way, the agent and user take turns holding control. |
|
Events |
Something that triggers an agent to do something. For example, an event might include someone saying, "Hello, agent!", which might trigger the agent to respond with, "Why hello there, Hal!". |
|
Entity-filling |
The process of identfying and storing entities (i.e., groups of words with similar characteristics). For example, an agent may identify a date entity in the phrase, "Create a calendar event for September 25th", and store this for later use. Agents often use machine learning models to identify entities. |
|
Conditions |
Methods of controlling conversation flow, which can be described by if-statements. For example, one condition might be, "If it is the user's birthday, play 'Happy Birthday to You', otherwise, say, 'Good morning!'". Conditions allows agents to perform different functions depending on contextual data. |
|
Conversation State |
The location in a given conversation that an agent is in. This depends on past conversation events. For instance, an agent may initially be in a "Start" state, and after being asked to open a banana bread recipe, it may transition to a "Banana Bread Recipe: Step 1" state. The agent's response will depend on the conversation state. For example, if it is in the "Banana Bread Recipe: Step 1" state, and it is asked to say the next step, it may say, "Preheat the oven to 350° F", whereas if it were in a nacho recipe state and asked the same question, it may say, "Preheat the oven to 425°F". |
|
State machines |
A method of organizing conversation control, which can be represented by a directed graph. Each node in the state machine is a conversation state. Each connection is a transition. When events occur (e.g., a user says, "Which games are available?"), agents will transition to different states (e.g., transition to a state called, "List Available Games"). |
Data Access and Conversation Context |
Pre-programmed data |
Data that an agent has stored prior to engaging with users. For example, a developer may provide agents with pre-programmed data about titles of games it knows how to play. |
|
User-defined data |
Data that an agent stores while engaging with users. For example, a user might tell an agent their birthday. The user's birthday is an example of a piece of user-defined data that an agent may store for later use. |
|
Contextual data |
Data from a particular conversation, which has been collected by an agent. This data helps the agent respond in a way that makes sense based on the prior conversation. For instance, a piece of contextual data might be how a user asked to order an orange shirt. Based on this contextual data, the agent may send a request to a store's database to determine whether orange shirts are in stock. |
|
Agent modularization |
How agents do not share data or functions, unless explicitly programmed to do so. For instance, if one agent has stored a user's name, another unrelated agent will not know that user's name unless it has been programmed to obtain data from the first agent (or the user also tells the second agent their name). |
|
Device access |
The ability of an agent to communicate with other technology, or the ability of a user to communicate with an agent. For instance, a developer may have created an agent, which they can test and have conversations with using their personal computer. Unless this agent is deployed to an accessible location (e.g., the cloud), however, no other users would be able to access it or communicate with it. Another example includes how oftentimes agents need permission from users to access other devices, like smart lights. If a user has not given the agent permission to control their smart lights, the agent will not be able to do so. |
|
Cloud computing |
When agents' computation occurs in remote locations (i.e., the cloud), as opposed to developers' personal computers. The cloud is a network of many computers, which typically have large computing power and data storage available. Developers often use cloud computing to train machine learning models, since this involves large amounts of computation. Developers often also deploy agents to the cloud so that many others can access their agents. |
|
Webhooks and APIs |
Methods agents may use to access external information and systems. For example, an agent may utilize a webhook to receive real-time weather updates, or an API to communicate with a flight-booking website. |
|
Flow and page modularization |
Approaches for structuring agent conversation, which prevent or allow access to intents depending on the conversation state. For example, if a pet store agent had "Dog Inquiry" and "Cat Inquiry" conversation flows, and the conversation was on the "Describe Current Dog" page (i.e., state), the user likely could not immediately access information about cats, until they return to a page that allows the conversation to transition to the Cat Inquiry flow. This allows developers to structure the progression of agent conversations. |
Human-Agent Interaction |
Speech synthesis |
The ability of an agent to create speech sounds. For instance, an agent may have trained on many human voice sounds to recognize the speech patterns for particular words. Based on these patterns, it may be able to develop similar patterns to create speech audio (i.e., synthesize speech). |
|
Speech recognition |
The ability of an agent to recognize language from human speech. For example, an agent may be able to convert human speech recordings to machine-readable text by training on many examples of human speech. |
|
Recovery |
The ability of an agent to continue a conversation regardless of misunderstandings or errors. For example, an agent may misrecognize the number "fifteen" as "fifty", in the phrase, "Send me fifteen pizzas". If the agent is able to recognize that the word "fifty" is unusual in this context, and follow up with the user by asking whether they really want to order fifty pizzas, the agent would be able to recover the conversation. |
|
Societal impact and ethics |
How agents can affect human life and culture, as well as the moral considerations that come along with this. For example, if an agent's speech recognition system is trained using only male voices, the agent may not recognize female voices as accurately. This would have serious implications for the female users of the agent, and would not generally be considered fair. |
|
Text-based interaction |
When an agent interacts using characters, as opposed to audio, for input and output. Agents that use text-based interaction are often called chatbots. This mode of communication is typically very accurate, but oftentimes less efficient than voice-based interaction. |
|
Voice-based interaction |
When an agent interacts using audio, as opposed to characters, for input and output. This mode of communication is typically very efficient, but oftentimes less accurate than text-based interaction. |
|
Multimodal interaction |
When an agent interacts using multiple different modalities (e.g., voice, text, haptics, etc.). For instance, an agent may say, "So, you'd like to deliver your pizza to 484 Zorilla Street?" and display a map on a screen while it speaks. In this example, the agent is interacting multimodally through visuals and audio. |
|
Task- vs. non-task-oriented |
The extent to which an agent acts socially, or alternatively, with the purpose of completing specific goals. For instance, a pet robot agent may interact socially (i.e., with less task-orientation) by asking about the user's day and engaging in small talk. A customer service bot may interact with task-orientation by helping a user return a faulty product. |
|
Deployment |
The act of making an agent available to users. For instance, a developer may deploy their agent to a customer service website, or to an app store. |
|
Effective conversation design |
Methods to improve agent communication. For example, a developer may improve an agent's conversational flexibility by allowing users to interrupt it at any time. A list of such conversation design principles is shown in Table 2. |