Menu Close

Audio Signifiers for Voice Interaction

by Kathryn Whitentonon September 10,2017

Summary:Provide explicit,implicit,and nonverbal signifiers to help users understand their options in voice-interaction interfaces.

Good voice interfaces require not only excellent natural language comprehension,but also strategies for helping users understand the universe of actions and commands available in voice interactions — in other words,we need to bridge the Gulf of Execution.This interaction manbetx官方网站手机版design challenge is present in all systems,but is inherently more difficult for voice interfaces.

The Gulf of Execution

To successfully interact with any system,people must be able to (1) figure out whatactions to takein order to achieve a specific goal,and (2)understand the resultsof those actions.In his seminal book,The manbetx官方网站手机版Design of Everyday Things,our colleague Don manbetx官网手机登陆Norman described these needs as the Gulf of Execution and the Gulf of Evaluation,respectively.Both are important,but in this article,we'll focus on the Gulf of Execution,and how voice-interaction systems can help users understand what commands are possible.

In graphical user interfaces (GUIs),manbetx官方网站手机版designers can help people bridge the Gulf of Execution by providingvisible signifiers,likedistinctive colors for clickable text.When used appropriately,these techniques enable users to understand at a glance what actions are possible;conversely,research shows thatdiminished graphical signifiers lead to slower task timesand cause click uncertainty in a GUI.

In the absence of visual signals,users must either imagine or try to remember possible commands — both of which increase thecognitive loadand difficulty of using the interface.manbetx官方网站手机版Designers of voice-interaction systems can help minimize these problems by including signifiers for important system commands.

Definition: Avoice-interaction signifieris a user-interface cue that the system provides to users in order to help them understand what verbal commands they can make.

Eithersound or visual cuescan be used to signify possible voice commands.If a screen is available,it's usually best to include a visual signifier,but not all UIs have screens.This article focuses on audio signifiers,which are used by both smartphone-embedded personal assistants such as Siri,Google Now,and Cortana,and by standalone voice-interaction systems such as Amazon's Echo and Google Home.

Types of Sound Signifiers

There are three types of sound-based signifiers or cues which can prompt user actions and inform users about possible commands:

  1. Nonverbal sounds,or earcons(auditory icons),which are distinctive noises generated by the system,usually associated with specific actions or states.For example,Siri emits a 2-tone beep after detecting its activation phrase,to signal that it is now ‘listening' for a command.
  2. Explicit verbal signifiers,when the system verbalizes a suggestion or request to let the user know what commands are available.For example,if you tell Google Home to "Set a timer,"it responds with "Ok,for how long?"
  3. Implicit verbal cues,when the system hints that an action is possible,without fully articulating the suggestion.For example,when Amazon's Echo detects its wake word while it is speaking,it pauses its own speech to let the user know that it is ‘listening' for a new command.This behavior mimics human speech patterns,where people pause briefly to cue conversational partners that they are willing to stop speaking and listen.

Nonverbal Sounds

Earcons,or interface sound cues,are similar to visual icons found in graphical user interfaces because both attempt to communicate with users more efficiently by eliminating words.Just as an icon of a trashcan requires less screen space than a button labeledDelete,playing a beep takes less time than speaking the words "You have a new message."

But the efficiency gains of both icons and earcons depend on users actually understanding the meaning of the signal.Visual icons often have usability problemsdue to the ambiguous nature of the imagery;auditory icons are even less understandable,because they can't use wordsorimages to convey meaning.This problem quickly becomes obvious when you attempt to generate earcons with the same meaning as common visual icons.

For example,visual icons can be classified asresemblance,reference,or arbitrary,depending on how the symbol relates to with the action it represents.Resemblanceicons look like the function they perform: a drawing of a trashcan has the same shape as a physical garbage container.A resemblance earcon for the action of deleting could be a "clunk" sound,to resemble the noise made when dropping an object into a can.But a "clunk" also sounds like many other noises—such as two objects colliding,or an object being placed on a shelf.

Earcons become even more obscure when they attempt toreferencea concept.For example,the sound of a cow mooing could be a reference earcon for the concept ‘milk'.(The "moo" in this case refers to something related to a cow,instead of representing an actual cow.) But requiring users to guess a specific action based solely on a "moo" sound would be more appropriate for a game than for a shopping list.

Because of the ambiguity of nonverbal sounds,most earcons function asarbitrarysounds,even if they are intended to resemble a specific action.Arbitrary sounds can attract attention,but can only convey specific meanings under certain conditions:

  • Within a task context: Immediately after the user has made a command,they may beprimedto recognize earcons as related to that command.  For example,playing a "clunk" sound can serve as a confirmation after the user has indicated that an object should be deleted.
  • After repeated exposure: Over time users may learn that one arbitrary sound indicates an incoming phone call,while a different sound indicates a new text message.

Even repeated exposure is no guarantee that people will be able to uniquely identify arbitrary sounds.Most of us have heard the tones generated by dialing numbers on a telephone thousands of times,but would be hard-pressed to tell you which tone is associated with a particular digit.

Auditory icons are also a one-way form of communication: the system can play a "clunk" or a "beep" sound to the user,but users cannot accurately produce specific nonverbal sounds as commands.

Due to these limitations,earcons are effective primarily in narrow,repetitive contexts (for example,when serving as confirmations for frequent tasks) or as a generic attention getters.But,in most situations,they do not convey enough information and need to be accompanied (or replaced) by verbal signifiers.

Verbal Signifiers: Balancing Explicit and Implicit Cues

Explicit verbal cues,in the form of questions or suggestions,are the most understandable type of sound signifier.But explicitly stating every possible command would be incredibly tedious  — just askscreenreaderusers.Besides being annoying,long verbal lists are ineffective,because people are unlikely to remember them all.(Think of telephone menus which list 10 different departments,and how often you forget the first choice by the time you get to number 5.)

Explicit signifiers are essential for irreversible actions like making a purchase.But in many other instances,the need for strong signifiers can be balanced with the need for quick interaction by using a few specific techniques:

  • Guess and confirmthe user's intent
  • Provide implicit cuesinstead of explicit statements
  • Separate explicit cuesinto sequential,distinct dialogues
  • Progressively disclosecues so they are stated only at the point of need

Guess and Confirm

Guessing the user's intent leads to the most efficient interactions — but only if the system has enough information to make a good guess.For example,if you tell Echo to play the music-streaming service Pandora,rather than asking which station,the device automatically plays the most recently used Pandora station.Since that station was previously selected by the user,there's a pretty good chance it will be ok.

On the other hand,inaccurate guessing can slow down the interaction,because after a bad guess,users have to figure out how to correct the system's assumptions.

For example,consider a scenario where the user has asked a voice-interaction system to "Set up a meeting at nine." This seemingly simple task necessitates several different followup questions and commands,such as specifying the name,date,and time of day (morning or evening) for the appointment,as well as editing any of these values,and finally saving the appointment.The table below illustrates how four different voice-interaction systems handle this task,withsystem statements in bold:

Voice system responses to the user command "Set up a meeting at nine."
Google Echo Cortana Siri

What's the title of the event?


Sure,Haircut tomorrow at 9 am.Do you want to save this?


All right,I saved your event.

Is that nine o'clock in the morning or in the evening?

In the morning

On what day?


And what's the name of the event?


I'll schedule Haircut on Thursday,August 24th at 9 am,ok?


Ok,I've added that.

What do you want to call it?


I'll add Haircut to your calendar for today at 9 pm.Sound good?

No,change it to 9 am tomorrow

Sorry,should I add this to your calendar?


Did you want to change the title or the time?


When is your event?

Nine am tomorrow

All right,I'll add Haircut to your calendar for tomorrow at 9 am.Sound good?


Sure thing.

Ok,I set up your meeting for today at 9 pm.Shall I schedule it?

No,change it to 9 am tomorrow

To continue,you can confirm,cancel,change the time,or change the title.

Change the time

What time is your appointment?

9 am tomorrow

Ok,I set up your meeting for tomorrow at 9 am.Shall I schedule it?

Change the title

Ok,what's the new name for this meeting?


Ok,I set up your meeting for tomorrow.Shall I schedule it?


It's on your calendar for 9 am tomorrow.

Guessing the name of the event is doomed to failure,and only one system makes the attempt: Siri,which creates an appointment named ‘meeting.' All the other systems use an explicit signifier in the form of a question to prompt users for the event name.For the date and time of day,3 of the 4 systems attempt a guess.Siri and Cortana guess the closest future day and time,while Google guesses the following day.虽然这些是合理的猜测,they could each easily be incorrect,and therefore require users to figure out how to edit the information.As it happens,Google was right in this case,and so its interaction sequence was the shortest,but it could almost as easily have been wrong,leading to a more clunky dialogue.(Machine learning techniques run across vast masses of meeting requests presumably lead to a higher probability of correct guesses under various circumstances.)

Unfortunately,neither Siri nor Google provide any signifiers about how to edit the event details;instead they skip straight to explicit prompts to save the appointment.This is problematic because "Shall I schedule it?" and "Do you want to save this?" are questions that suggest a yes-or-no answer,rather than the possibility of editing the event information.Both systems allow editing,but it's up to users to realize that this option exists.The editing command could be signaled by asking another question — such as "Do you want to change this event?" — but that would also add an extra step to the task.

Implicit Signifiers

Substituting an implicit signifier instead of an explicit signifier can effectively prompt people to edit without making the interaction (much) longer.Implicit signifiers suggest actions by mimicking speech patterns that humans have developed to exchange information more efficiently.One such pattern is adding a question word to the end of a statement.For example,both Cortana and Echo include a brief implicit cue at the end of the confirmation prompt:

Question Words as Implicit Cues

Cortana: I'll add Haircut to your calendar for today at 9 pm.Sound good?

Echo: I'll schedule Haircut on Thursday,August 24th at 9 am,ok?

Implicit interrogatory cues at the end of a system-produced utterance help users understand that they can change the appointment.

These statements ask for permission to proceed with saving the event.But interrogatory confirmation words ("Sound good" and "ok")are placed in the same sentence,immediately after the event details.This placement,coupled with the broad terminology,suggests that the system is asking not only about whether to save,but also about the accuracy of the entire preceding sentence.(Unfortunately,Cortana only accepts yes or no answers to this question.Only after saying "no" and waiting for another explicit editing prompt can users change the information.)

Sequential Explicit Signifiers

Theoretically,combining several prompts into a single exchange could save time,for example by asking "What is the name,day,and time of day for this event?" In actual practice,however,this type of prompt increases the users'cognitive load,because they have to remember 3 different questions and answer them all at once.In practical terms,the mistakes and corrections that result from increased cognitive load are likely to equal or exceed the time needed to ask each question separately.

Echo's dialogue in the table above illustrates an alternativesequential approach to explicit signifiers:the name,date,and time of day prompts are 3 separate questions with 3 separate answers.As illustrated in the example,this strategy may end up being more efficient than one which requires editing due to incorrect guesses — especially if there are errors or false starts due to missing editing signifiers.

Also,keep in mind that not every user will be asked all questions.People who providemore detailed commandssuch as "Set up a meeting on Thursday at 9 AM named Haircut," can complete their task successfully after just one confirmation questionon any of the systems:

Detailed Commands = Fewer Steps

User: Alexa,set up a meeting on Thursday at 9 am named Haircut.

Echo: That's Haircut on Thursday,August 31st at 9 am,right?

User: Yes.

Echo: Ok,I've added that.

The Echo event dialogue does not include sequential followup questions if the user includes all necessary information in the initial command.

Progressive Disclosure of Cues

Progressive disclosureprovides infrequently used options only after users' behavior indicates a need for those commands.For example,Siri at first explicitly prompts users to save the event.Only after a user declines to save does Siri suggest several alternative commands,with the statement: "To continue,you can confirm,cancel,change the time,or change the title."

Prioritization is essential for good progressive disclosure,in order to determine which features should be exposed as primary features vs.hidden as secondary features.Prioritization is even more critical for voice-interaction systems than for graphical user interfaces,because exposed voice options require people to listen to them being read aloud while deferred voice options are even more hidden,due to the lack of visiblesee moreindicators.(A universal "See more" earcon would be handy for voice interfaces,but doesn't exist yet.)

Considering that Siri's event-creation dialogue guesses the name,day,and time of day for the event based on very little data,most users would need to edit those guesses for this task.So,in this case,deferring a frequently used command to a secondary choice,without offering even an implicit cue,is likely to cost more time than it saves.

For complex tasks with branching paths,progressive disclosure helps keep the user focused on exactly what is relevant for the current step.Google Home uses this technique to provide step-by-step cooking instructions.During this task,the system asks,"Would you like to prepare the ingredients,or skip to the instructions?"If users ask to prepare the ingredients,they receive another more specific prompt explaining options for hearing the ingredients: "There are 7 in total.Should we go through them in groups of three,or one by one?" However,this offer is presentedonlyto users who have opted in to the ingredients task,as the question is irrelevant for those who want to skip to the instructions.


Voice interfaces must minimize the time required to complete an action in order to be useful,because listening to signifiers takes longer than scanning a menu,and speaking a verbal command takes longer than clicking a button.But skipping important signifiers in order to save time can be counterproductive: it doesn't matter how quickly a task ended if the task failed.

Guessing,implicit signifiers,sequential cues,and progressive disclosure are all valid strategies for expediting voice commands,especially for secondary or easily reversible commands.These methods can sometimes be more efficient and just as effective at bridging the Gulf of Execution.But they must be carefully manbetx官方网站手机版designed to ensure they actually reduce the interaction time,instead of extending it by introducing errors.