A practical guide to testing voice interfaces
For VUIs, or voice user interfaces, user testing is just as essential as for any other system. Although the tests will be very similar to those we are used to with mobile applications, if the device only allows voice-based control (like Amazon’s Echo), there are significant differences.
Testing is as much about cost as development – for example, few companies can afford to “spread” dozens of expensive prototypes across the target market for live testing, but it is worth spending money on effective testing, because it costs a lot more if a lot of customers start complaining about basic defects that could have been caught by comprehensive testing during development.
In this article, we therefore provide a practical guide to cheaper and more expensive VUI testing methods and options to help voice interface developers to implement more effective user tests.
VUI testing basics
Let’s start with the most basic question: how will the user know that they can control the system by speaking? Never take it for granted that the marketing department will take care of it, or that someone will tell them anyway.
Think instead of grandparents who are using a smartphone for the first time in their lives. They’re used to a different kind of phone, and they’re looking for the buttons rightly so. It has been observed that people don’t use voice commands on new devices because they don’t know how to talk to the device!
It is therefore worth starting with a user test to see if they can figure out for themselves that they can (also) control the system with a voice command. It is also worth observing, even before implementing the speech recognition part, what and how the testers say, how articulate they are.
A later version of this is to stress test the VUI with different articulation situations – how well the system understands mumbling, speech over background noise, commands given in the middle of a conversation with several people, or even speech from people with speech impairments or disabilities (e.g. stroke survivors, people with hemiplegia).
Basically, it is important to keep track of why and where users get stuck, what causes them frustration, and in general: how much time difference (faster or slower) they can complete the test tasks compared to what they expected.
We also need to be critically aware that the classic UX testing methodology, the think aloud protocol, does not work well when testing voice control systems. In this case, the user thinks aloud and we ask them questions, but at the same time the voice control system tries to interpret this dialogue, as it cannot know that the users are chatting with each other.
Here are the aspects to focus on when testing VUIs.
Building on knowledge rather than reinventing the wheel
While it can be a source of pride when you make a groundbreaking discovery and your device is touted as the new iPhone or the new PC, it’s worth taking a good look around the market first. It’s almost certain that someone has invented the same thing before – just implemented it differently.
In the case of VUIs, you will typically find an IVR or mobile app that does much the same thing, at least in outline, as the system you are envisioning. We use these during development, especially the test results, because this alone saves a lot of time and unnecessary rounds.
It is worth conducting a preliminary survey of the target audience at the very beginning of development, and building on this in the later stages of testing. At Ergomania, we have learned a lot from the Microsoft case. Redmond asked experienced personal assistants who had developed Cortana to tell us what tools and methods they used – and when it turned out that they would ask questions if an instruction was not clear to them, Microsoft developers built this feature into Cortana.
Involve the target audience in the development from the beginning
Following Microsoft’s example, it makes a big difference to development if we can involve the target audience in testing as early and as often as possible. After all, they are the ones for whom the system is being built, so the interviewers can understand what they need most from their day-to-day experience. Consequently, most of the planned features can be assessed after the in-depth interviews to see if they would be really useful for the target audience.
In the case of a VUI, especially if there is no graphical user interface, there can be many use cases that the developers have not thought of, but that prospective users have been missing or have been having problems with for a long time.
Define the test tasks precisely
Whether we are talking about an early-stage test, a prototype test or a full-scale test of a finished product, we should never expect users to do what we want them to do.
It is always best to formulate the test tasks clearly and precisely, step by step, in plain language. We would rather be seen as fussy than as useless.
Less is more in field tests
However, this does not mean that “free problem solving” cases should be excluded, i.e. where only an initial state and a goal are given and it is up to the users to decide how to get to the goal.
At Ergomania, we follow the principle of minimizing the influence of user testing when we want to find out how understandable and usable the VUI is by practically anyone (and by “anyone” we mean members of the target audience).
Vary the order of the tasks
Testing can be misleading if users learn a “trick”, a shortcut or “cheat code”, early in a succession of tasks – and tend to use it later. This, in turn, affects results, because they start tasks with “prior knowledge” where they are not supposed to have it.
The solution to this is to give each tester a randomized sequence of tasks, so that as few people as possible can gain “prior knowledge”.
Choose your testers and questions well
Although it may seem obvious at first glance, in our experience it is still common for developers to select the wrong users for testing. For example, the ideal subjects for a health app are hardly healthy, vibrant students – but rather the patients for whom the app is being developed and their relatives and carers!
For a VUI-managed healthcare device, the important thing is whether elderly patients whose voices are hoarse or too soft and they sometimes mumble or nursing staff can control the system with voice commands.
How many testers is ideal?
The question of the number of testers requires a different approach. It is hardly necessary to include hundreds of people who represent the entire population in terms of demographic distribution! Much more would be achieved by selecting quality testers instead.
It depends on the application or VUI you want to test, what is the amount of users you don’t need to test more than that to be successful – and then you’re just wasting time and money on unnecessary tests. Experience over the past decades has shown that a minimum of four and a maximum of eight testers are perfectly sufficient to test almost any system or device.
What questions should we ask?
Although observing the way the system is used can reveal a lot of valuable information during testing, asking the right questions can also reveal hidden parts that are crucial to improving or finalizing the system.
It is generally observed that the majority of people avoid conflicts that are considered unnecessary (and often even essential), and therefore prefer to keep the negative aspects silent or to consider them as irrelevant in a direct, personal account.
It is also ill-advised to ask leading questions – people tend to play compliance to avoid conflict, so they will say what they want to hear. For example, it’s less good to ask “did you like this solution because it makes your job easier?” – instead, we ask “what do you think of this solution, how effective is it?”.
Sometimes it is useful for testers to tell you what they have experienced
At Ergomania, during testing, we occasionally let testers report back in their own words. We always ask open questions to avoid suggestions. For example, if a feature is not to the user’s liking, we don’t leave them guessing why they don’t like it, but rather encourage them to tell us in their own words.
This is an advantage because if we get negative feedback after the launch of the product, the same thing will happen: they will tell us in unhelpful words what caused the problem for the users. It’s worth getting ahead of the game and addressing any problems early!
Otherwise, it is much more efficient to guide the testers all the way through and ask targeted questions, occasionally including a free-form panel. The more guided the testing, the more effective we will be.
What to look out for when monitoring testers?
Observing testers during testing is as important as their conscious feedback and the logs generated by the system. It is a cliché that body language, metacommunication, accounts for 80% of all communications and interactions.
Most people are not really aware of this, and therefore their body language at the time of testing is much more believable than their conscious account. At Ergomania, when testing VUIs and other interfaces, we therefore pay particular attention to looking at what people are communicating through metacommunication.
For example, do certain features trigger expected or unexpected emotional reactions? Do they get confused when they have to talk to the system? When do they hesitate or start to ramble? How do they react to the system’s responses?
The benefits of early phase testing
If you have a sound knowledge of the market, a precise psychological profile and a number of real-life use cases, then early phase testing of individual features and modules may not be necessary.
In all other cases, however, it is highly recommended to “run” all concepts before programmers or engineers start to implement them. One of the methods Ergomania’s experts use in the development of VUIs is the live dialogue method.
In essence, this is where someone writes a realistic dialogue between the user and the system, and two people act it out. Very quickly, we find out how lifelike the dialogue is, or whether there are repetitions, parts that are difficult for the target audience to understand, etc.
Since we are talking about VUIs, it is important to flag words with problematic pronunciation that make it difficult for the system to help the user understand.
Wizard of Oz test
The Wizard of Oz (WOz) test is needed when the system behind the VUI is not really ready and is not yet replaced by a live human. WOz is typically used in the early stages and is more of a creative tool than a system calibration solution.
Testing is even more like a magic trick, in that it requires a “magician” and his assistant: the “magician” remains invisible all the time, and best of all, the testers are unaware of his existence. For this reason alone, he cannot be the same as the assistant, who is present all the time, taking notes, helping with testing, etc.
For example, WOz used to be used for early testing of telephone customer support systems, IVRs, but it works just as well for VUIs. One of the big advantages of WOz testing, for example, is that you can test dialog flows without writing a single line of code:
- First step: enter your postcode.
- Next step: enter your phone number.
- Next step: confirm your phone number.
- Opening text: ‘please say or type the last nine digits of your telephone number.
- Error message 1: “I’m sorry, I didn’t understand. Please enter the last nine digits of your phone number.”
- Error message 2: “I’m sorry, I still don’t understand. Please enter the last nine digits of your phone number.”
- Closing error: “I’m sorry for any inconvenience caused. Please wait while I put you through to an operator”.
Of course, the user in the test does not know that the above texts are being read to him by a human – it will be like using a real VUI.
Other considerations when testing VUI
At Ergomania, we typically filter out bugs first in the later stages of VUI development. Beyond the immediate benefits of bug filtering and fixing, it is also vital for improving the user experience and UX.
Dialogue-based usability testing is also required for advanced VUI
We have already mentioned the role and importance of dialogue in the WOz tests – but the same needs to be done without a “wizard behind the screen” to replace the machine intelligence. In any case, the system will interpret, and this will soon show whether it can manage and steer the conversation properly, or whether it is slipping or getting bogged down too much.
These usability tests usually follow a scripted scenario, partly aimed at practical usability and partly at detecting (still) hidden bugs. Of course, there are different functionalities for different systems – that’s why there are only a few templates for usability tests, all other scenarios are created according to the specific system.
Examples of common testing options include the activation process, or the use of background noise, accents, speech errors. For example, you could test a time setting function in this way:
- User (F): “Set the timer to six!”
- The system should then prompt you back to clarify the instruction.
- System (R): “In the morning or at six in the evening?”
- (F): “Six in the morning”
- At this point, the command is still incomplete, so if the system is working correctly, it will ask for another clarification.
- (R): ‘Should it be one-off or recurring? ”
And so on – during testing, for example, the user always provides just one more piece of information, and the system has to clarify it through dialogue. This scenario brings up, for example, whether the system is able to both interpret the previous information and build a hierarchy of it, i.e. whether it knows in which order to query the parameters.
Find out what frustrates users
For example, it was found that users were frustrated with one system that required them to stop recording manually, rather than the system stopping it itself. People don’t like to deal with devices more than they absolutely have to, and no matter how small the push of a button or the sound command, experience shows that it frustrates them.
If you can’t find twenty errors, choose a new test method
Generally speaking, for a prototype, if the testing method does not produce at least 20 bugs, then the system is not good – the method was lousy. Of course, 20 bugs is not meant literally – it’s simply the right method that reveals as many bugs as possible.
At the VUI, text comprehension is all
Given that we are talking about testing VUIs, it is vital to understand the text accurately. WOz, as useful as it is in the early stages, cannot filter out the critical parts where the software may fail to understand the text.
Usability tests are also primarily used to see how easy the system is to use, or where the process breaks down, or a fatal program error occurs.
For this reason, Ergomania’s experts test the systems’ understanding of the text, by looking specifically at their ability to understand problematic words and their ability to “think” contextually.
Whether you want to test your own system or develop your own VUI, contact Ergomania and we will deliver the best solution for you!
Share your opinion with us