Skip to content

Instantly share code, notes, and snippets.

@patilswapnilv
Created September 21, 2023 15:49
Show Gist options
  • Save patilswapnilv/82f1b37d799d6d5737bd7c7f10fcec8c to your computer and use it in GitHub Desktop.
Save patilswapnilv/82f1b37d799d6d5737bd7c7f10fcec8c to your computer and use it in GitHub Desktop.
functions that are included are evolved in enough detail to support realistic user
experience evaluation. Often the functionality of a vertical prototype can include a stub for or an actual working back-end database.
A vertical prototype is ideal for times when you need to represent completely
the details of an isolated part of an individual interaction workflow in order to
understand how those details play out in actual usage. For example, you may wish to study a new design for the checkout part of the workflow for an
e-commerce Website. A vertical prototype would show that one task sequence
and associated user actions, in depth.
11.2.2 "T" Prototypes
A "T" prototype combines the advantages of both horizontal and vertical,
offering a good compromise for system evaluation. Much of the interface is realized at a shallow level (the horizontal top of the T), but a few parts are done in depth (the vertical part of the T). This makes a T prototype essentially a
horizontal prototype, but with the functionality details filled out vertically for some parts of the design.
In the early going, the T prototype provides a nice balance between the two extremes, giving you some advantages of each. Once you have established a system overview in your horizontal prototype, as a practical matter the T prototype is the next step toward achieving some depth. In time, the horizontal foundation supports evolving vertical growth across the whole prototype.
11.2.3 Local Prototypes
We call the small area where horizontal and vertical slices intersect a "local prototype" because the depth and breadth are both limited to a very localized interaction design issue. A local prototype is used to evaluate design alternatives
for particular isolated interaction details, such as the appearance of an icon,
wording of a message, or behavior of an individual function. It is so narrow and shallow that it is about just one isolated design issue and it does not support any depth of task flow.
A local prototype is the solution for those times when your design team encounters an impasse in design discussions where, after a while, there is no agreement and people are starting to repeat themselves. Contextual data are not clear on the question and further arguing is a waste of time. It is time to put the specific design issue on a list for testing, letting the user or customer speak to it in a kind of "feature face-off" to help decide among the alternatives.
For example, your design team might not be able to agree on the details of a "Save" dialogue box and you want to compare two different approaches. So you can mockup the two dialogue box designs and ask for user opinions about how they behave.
Local prototypes are used independently from other prototypes and have very short life spans, useful only briefly when specific details of one or two particular design issues are being worked out. If a bit more depth or breadth becomes
needed in the process, a local prototype can easily grow into a horizontal,
vertical, or T prototype.
11.3 FIDELITY OF PROTOTYPES
The level of fidelity of a prototype is another dimension along which prototype content can be controlled. The fidelity of a prototype reflects how "finished"
it is perceived to be by customers and users, not how authentic or correct
the underlying code is (Tullis, 1990).
11.3.1 Low-Fidelity Prototypes
Low-fidelity prototypes are, as the term implies, prototypes that are not faithful
representations of the details of look, feel, and behavior, but give rather high-
level, more abstract impressions of the intended design. Low-fidelity prototypes are appropriate when design details have not been decided or when they are likely to change and it is a waste of effort and maybe even misleading to try and flesh out the details.
Because low-fidelity prototypes are sometimes not taken seriously, the case for low-fidelity prototyping, especially using paper, bears some explaining. In fact, it is perhaps at this lowest end of the fidelity spectrum, paper prototypes, that dwells the highest potential ratio of value in user experience gained per unit of effort expended. A low-fidelity prototype is much less evolved and therefore far less expensive. It can be constructed and iterated in a fraction of the time it takes to produce a good high-fidelity prototype.
But can a low-fidelity prototype, a prototype that does not look like the final system, really work? The experience of many has shown that despite the vast difference between a prototype and the finished product, low-fidelity prototypes can be surprisingly effective.
Virzi, Sokolov, and Karis (1996) found that people, customers, and users do take paper prototypes seriously and that low-fidelity prototypes do reveal many user experience problems, including the more severe problems. You can get your project team to take them seriously, too. Your team may be reluctant about doing a "kindergarten" activity, but they will see that users and customers love them and that they have discovered a powerful tool for their design projects.
But will not the low-fidelity appearance bias users about the perceived user experience? Apparently not, according to Wiklund, Thurrott, and Dumas (1992), who concluded in a study that aesthetic quality (level of finish) did not bias users (positively or negatively) about the prototype's perceived user experience. As long as they understand what you are doing and why, they will go along with it.
Sometimes it takes a successful personal experience to overcome a bias against low fidelity. In one of our UX classes, we had an experienced software developer who did not believe in using low-fidelity prototypes. Because it was a requirement in the project for the course, he did use the technique anyway, and it was an eye-opener for him, as this email he sent us a few months later attests:
After doing some of the tests I have to concede that paper prototypes are useful. Reviewing screenshots with the customer did not catch some pretty obvious usability problems and now it is hard to modify the computer prototype. Another
problem is that we did not get as complete a coverage with the screenshots of the system as we thought and had to improvise some functionality pretty quickly. I think someone had told me about that .. . .
Low-fidelity prototyping has long been a well-known design technique and, as Rettig (1994) says, if your organization or project team has not been using low-fidelity prototypes, you are in for a pleasant surprise; it can be a big breakthrough tool for you.
11.3.2 Medium-Fidelity Prototypes
Sometimes you need a prototype with a level in between low fidelity and high fidelity. Sometimes you have to choose one level of fidelity to stick with because you do not have time or other resources for your prototype to evolve from low fidelity to high-fidelity. For teams that want a bit more fidelity in their design representations than you can get with paper and want to step up to computer- based representations, medium-fidelity prototypes can be the answer.
In Chapter 9, for example, this occurs about when you undertake intermediate design and early detailed design. As a mechanism for medium- fidelity prototypes, wireframes (also in Chapter 9) are an effective way to show layout and the breadth of user interface objects and are fast becoming the most popular approach in many development organizations.
11.3.3 High-Fidelity Prototypes
In contrast, high-fidelity prototypes are more detailed representations of
designs, including details of appearance and interaction behavior. High-fidelity is required to evaluate design details and it is how the users can see the complete (in the sense of realism) design. High-fidelity prototypes are the vehicle for
refining the design details to get them just right as they go into the final
implementation.
As the term implies, a high-fidelity prototype is faithful to the details, the look, feel, and behavior of an interaction design and possibly even system functionality. A high-fidelity prototype, if and when you can afford the added expense and time to produce it, is still less expensive and faster than programming the final product and will be so much more realistic, more interactive, more responsive, and so much more representative of a real software product than a low-fidelity prototype. High-fidelity prototypes can also be useful as advance sales
demos for marketing and even as demos for raising venture capital for the
company.
An extreme case of a high-fidelity prototype is the fully-programmed, whole- system prototype, discussed soon later, including both interaction design and non-user-interface functionality working together. Whole system prototypes can be as expensive and time-consuming as an implementation of an early version of the system itself and entail a lot of the software engineering management issues of non-prototype system development, including UX and SE collaboration about system functionality and overall design much earlier in the project than usual.
11.4 INTERACTIVITY OF PROTOTYPES
The amount of interactivity allowed by a prototype is not independent of the
level of fidelity. In general, high interactivity requires high-fidelity. Here we discuss various ways to accomplish interactivity within a prototype.
11.4.1 Scripted and "Click-Through" Prototypes
The first prototypes to have any "behavior," or ability to respond to user actions, are usually scripted prototypes, meaning programmed with a scripting language. Scripting languages are easy to learn and use and, being high-level languages, can be used to produce some kinds of behavior very rapidly. But they are not effective tools for implementing much functionality. So scripted prototypes will be low or medium fidelity, but they can produce nice live-action storyboards of screens.
A "click-through" prototype is a medium-fidelity prototype with some active links or buttons that allow sequencing through screens by clicking, but usually with no more functionality than that. Wireframes can be used to make click- through prototypes by adding links that respond in simple ways to clicking, such as moving to the next screen.
11.4.2 A Fully Programmed Prototype
Even the prototypes of large systems can themselves be large and complex. On rare occasions and in very special circumstances, where time and resources permit and there is a genuine need, a project team is required to produce a high- fidelity full-system operational prototype of a large system, including at least some back-end functionality.
One such occasion did occur in the early 1990s when the FAA sought proposals from large development organizations for a big 10-year air traffic control system development project. Bidders successful in the first phase of
proposals would be required to design and build a full-function proof-of- concept prototype in a project that itself took nearly 2 years and cost millions of dollars. On the basis of this prototype phase, even larger multiyear contracts would be awarded for construction of the real system.
Such large and fully functional prototypes call for the power of a real programming language. Although the resulting prototype is still not intended to be the final system, a real programming language gives the most flexibility to produce exactly the desired look and feel. And, of course, a real programming language is essential for implementing extensive functionality. The process, of course, will not be as fast, low in cost, or easy to change.
11.4.3 "Wizard of Oz" Prototypes: Pay No Attention to the Man Behind the Curtain
The Wizard of Oz prototyping technique is a deceptively simple approach to the appearance of a high degree of interactivity and highly flexible prototype behavior in complex situations where user inputs are unpredictable. The setup requires two connected computers, each in a different room. The user's computer is connected as a "slave" to the evaluator's computer. The user makes input actions on one computer, which are sent directly to a human team member at the evaluator's computer, hidden in the second room.
The human evaluator sees the user inputs on the hidden computer and sends appropriate simulated output back to the user's computer. This approach has particular advantages, one of which is the apparently high level of interactivity as seen by the user. It is especially effective when flexible and adaptive "computer" behavior is of the essence, as with artificial intelligence and other difficult-to- implement systems. Within the limits of the cleverness of the human evaluator, the "system" should never break down or crash.
In one of the earliest uses of the Wizard of Oz technique that we know of, Good and colleagues (1984) designed empirically a command-driven email interface to accommodate natural novice user actions. Users were given no menus, help, no documentation, and no instruction.
Users were unaware that a hidden operator was intercepting commands when the system itself could not interpret the input. The design was modified iteratively so that it would have recognized and responded to previously intercepted inputs. The design progressed from recognizing only 7% of inputs to recognizing about 76% of user commands.
The Wizard of Oz prototyping technique is especially useful when your design ideas are still wide open and you want to see how users behave naturally in the course of simulated interaction. It could work well, for example, with a kiosk.
You would set up the general scope of usage expected and let users at it.
You will see what they want to do. Because you have a human at the other end, you do not have to worry about whether you programmed the application to handle any given situation.
11.4.4 Physical MockUps for Physical Interactivity
If a primary characteristic of a product or system is physicality, such as you have with a handheld device, then an effective prototype will also have to offer physical interactivity. Programming new applications on physical devices with real software means complex and lengthy implementation on a challenging hardware and software platform. Prototypes afford designers and others insight into the product look and feel without complicated specialized device programming.
Some products or devices are "physical" in the sense that they are something like a mobile device that users might hold in their hands. Or a system might be "physical" like a kiosk. A physical prototype for such products goes beyond screen simulation on a computer; the prototype encompasses the whole device. Pering (2002) describes a case study of such an approach for a handheld communicator device that combines the functionality of a PDA and a cellphone.
If the product is to be handheld, make a prototype from cardboard, wood, or metal that can also be handheld. If the product, such as a kiosk, is to sit on the floor, put the prototype in a cardboard box and add physical buttons or a touchscreen.
You can use materials at hand or craft the prototype with realistic hardware. Start off with glued-on shirt buttons and progress to real push-button switches. Scrounge up hardware buttons and other controls that are as close to those in your envisioned design as possible: push buttons, tilt buttons, sliders, for example, from a light dimmer, knobs and dials, rocker switch, or a joystick from an old Nintendo game.
Even if details are low fidelity, these are higher fidelity in some ways because they are typically 3D, embodied, and tangible. You can hold them in your hands. You can touch them and manipulate them physically. Also, physical prototypes are excellent media for supporting evaluation of emotional impact and other user experience characteristics beyond just usability.
And just because physical product prototyping usually involves a model of physical hardware does not rule out being a low-fidelity prototype. Designers of the original Palm PDA carried around a block of wood as a physical prototype of the envisioned personal digital assistant. They used it to explore the physical
feel and other requirements for such a device and its interaction possibilities (Moggridge, 2007, p. 204).
Physical prototyping is now being used for cellphones, PDAs, consumer electronics, and products beyond interactive electronics, employing found objects, "junk" (paper plates, pipe cleaners, and other playful materials) from the recycle bin, thrift stores, dollar stores, and school supply shops
(N. Frishberg, 2006). Perhaps IDEO1 is the company most famous for its physical prototyping for product ideation; see their shopping cart project video (ABC News Nightline, 1999) for a good example.
Wright (2005) describes the power of a physical mockup that users can see and hold as a real object over just pictures on a screen, however powerful and fancy the graphics. Users get a real feeling that this is the product. The kind of embodied user experience projected by this approach can lead to a product that generates user surprise and delight, product praise in the media, and must-have cachet in the market.
Paper-in-device mockup prototype, especially for mobile applications
The usual paper prototype needs an "executor," a person playing computer to change screens and do all the other actions of the system in response to a user's actions. This role of mediator between user and device will necessarily interfere with the usage experience, especially when a large part of that experience involves, holding, feeling, and manipulating the device itself.
Bolchini, Pulido, and Faiola (2009) and others devised a solution by which they placed the paper prototype inside the device, leveraging the advantages of paper prototyping in evaluating mobile device interfaces with the real physical device. They drew the prototype screens on paper, scanned them, and loaded them into the device as a sequence of digital images that the device can display. During evaluation, users can move through this sequential navigation by making touches or gestures that the device already can recognize.
This is an agile and inexpensive technique, and the authors reported that their testing showed that even this limited amount of interactivity generated a lot of useful feedback and discussion with evaluation users. Also, by adding page annotations about user interactions, possible user thoughts, and other behind- the-scenes information, the progression of pages can become like a design storyboard of the usage scenario.
1http://www.ideo.com
11.4.5 Animated Prototypes
Most prototypes are static in that they depend on user interaction to show what they can do. Video animation can bring a prototype to life for concept demos, to visualize new interaction designs, and to communicate design ideas. While animated prototypes are not interactive, they are at least active.
Lo�wgren (2004) shows how video animations based on a series of sketches can carry the advantages of low-fidelity prototypes to new dimensions where a static paper prototype cannot tread. Animated sketches are still "rough" enough to invite engagement and design suggestions but, being more like scenarios or storyboards, animations can convey flow and sequencing better in the context of usage.
HCI designers have been using video to bring prototypes to life as early as the 1980s (Vertelney, 1989). A simple approach is to use storyboard frames in a "flip book" style sequence on video or, if you already have a fairly complete low-fidelity prototype, you can film it in motion by making a kind of "claymation" frame-by-frame video of its parts moving within an interaction task.
11.5 CHOOSING THE RIGHT BREADTH, DEPTH, LEVEL OF FIDELITY, AND AMOUNT OF INTERACTIVITY
There are two major factors to consider when choosing the right breadth, depth, level of fidelity, and amount of interactivity of your prototypes: the stage of progress you are in within the overall project and the design perspective in which you are prototyping. These two factors are interrelated, as stages of progress occur within each of the design perspectives.
11.5.1 Using the Right Level of Fidelity for the Current Stage of Progress
Choosing your audience and explaining the prototype
In general, low-fidelity prototypes are a tool to be used within the project team. Low-fidelity prototypes are shown to people outside the team only to get feedback on very specific aspects of the design. If low-fidelity
prototypes are shown casually around to users and customer personnel
without careful explanation, they can be misinterpreted. To someone not familiar with their use, a paper prototype can look like the product of an inexperienced amateur.
Even if they do get beyond the rough appearance, without guidance as to what kind of feedback you want, "sophisticated" users and customers will immediately see missing features and think that you do not know what you are doing, possibly creating a credibility gap. Therefore, low-fidelity prototypes are often considered "private" to the project team and reserved for your own use for early exploration and iterative refinement of the conceptual design and early workflow.
Therefore, when a project is deliverable-oriented and the customer expects to evaluate your progress based on what they see developing as a product, a medium- or high-fidelity prototype can be used as a design demo. Practitioners often construct pixel-perfect representations of envisioned designs for these prototypes to show off designs to customers, users, and other non-team stakeholders. Such realistic-looking demos, however, carry the risk of being interpreted as complete designs, as versions of the final product. If something is wrong or missing, the designers are still blamed. Explaining everything in advance can head off these complications.
A progression of increasing fidelity to match your stage of progress
As a general rule, as you move through stages of progress in your project, you will require increasing levels of fidelity in your prototypes. For example,
the goal in an early stage might be to determine if your design approach is even a good idea or a feasible one.
The goal of a later stage might simply be to show off: "Look at what a cool design we have!" In Table 11-1 we describe the appropriate time and place to use each kind of prototype in terms of various kinds of iteration within design production (Chapter 9). The types of prototypes mentioned in Table 11-1 are described in various places, mostly in this chapter.
11.5.2 Using the Right Level of Fidelity for the Design Perspective Being Addressed
For each design perspective in which you make a prototype, you must decide which kind of prototype, horizontal or vertical and at what fidelity, is needed, requiring you to consider what aspects of the design you are worried about and what aspects need to be tested in that perspective. In large part, this means asking about the audience for and the purpose of your prototype in the context of that perspective. What do you hope to accomplish with a prototype in the design perspective being addressed? What questions will the prototype help you answer?
Table 11-1
Summary of the uses for
various levels of fidelity
and types of prototypes
Ideation and sketching
To support exploring ideas, brainstorming, and discussion (so design details are inappropriate)
Sketches, fast and disposable mockups, ultralow fidelity
Conceptual design
To support exploration and creation of conceptual design, the high-level system structure, and the overall interaction metaphor
Evolution from hand-drawn paper, computer-printed paper, low-fidelity wireframes, high- fidelity wireframes, to pixel- perfect interactive mockups (to communicate with customer)
Intermediate design
To support interaction design for tasks and task threads
Evolution from paper to wireframes
Detailed design Support for deciding navigation
details, screen design and layout, including pixel-perfect visual comps complete specification
for look and feel of the "skin"
Detailed wireframes and/or pixel-perfect interactive mockups
Design refinement
To support evaluation to refine a chosen design by finding and removing as many UX problems as possible
Medium to high fidelity, lots of design detail, possibly a programmed prototype
Prototyping for the ecological perspective
To support exploration of the high-level system structure, a prototype in the ecological perspective is a kind of concept map to how the different parts of the system will work at the conceptual level and how it fits in with the rest of the world-other systems and products and other users.
As you evaluate the conceptual design, remember that you are looking at the big picture so the prototypes do not need to be high fidelity or detailed. If evaluation with early conceptual prototypes shows that users do not get along well with the basic metaphor, then the designers will not have wasted all the time it takes to work out design details of interaction objects such as screen icons, messages, and so on.
The development of IBM's Olympic Message System (Gould et al., 1987) was an avant garde example of product prototyping with emphasis on the ecological setting. IBM was tasked to provide a communications system for the 1984 Olympics in Los Angeles to keep athletes, officials, families, and friends in immediate contact during the games. For their earliest concept testing they used a "Wizard of Oz" technique whereby participants pressed keys on a computer terminal and spoke outgoing messages. The experimenter read aloud the incoming messages and gave other outputs as the interaction required.
For enhanced ecological validity they used a "hallway methodology" that started with a hollow wooden cylinder set in IBM building hallways, with pictures of screens and controls pasted on. They quickly learned a lot about the best height, location, labeling, and wording for displays. Real interactive displays housed in more finished kiosk prototypes led to even more feedback from visitors and corporate passersby. The resulting system was a big success at the Olympics.
Prototyping for the interaction perspective
For conceptual design, support early exploration with ideation and sketching using rapid and disposable low-fidelity prototypes. As you evaluate the conceptual design, remember that you are looking at the big picture so the fidelity of prototypes can be low. Use many rapid iterations to refine candidate conceptual design ideas.
As you move into intermediate design iteration, start by choosing a few tasks that are the most important and prototype them fairly completely. Mockup a typical task so that a user can follow a representative task thread.
Use medium-fidelity prototypes, such as wireframes, to flesh out behavior, including sequencing and responses to user actions. As we will see in later chapters on formative evaluation, a great deal can be learned from an incomplete design in a prototype.
For detailed design, after you have exhausted possibilities in evaluating the conceptual model and early screen design ideas with your low-fidelity, possibly paper, prototype, you will move on. You might next use a computer-printed paper prototype or a computer-based mockup to test increasing amounts of design detail.
You will flesh out your prototype with more complete task threads, well- designed icons, and carefully worded messages. Representing and evaluating full design details require high-fidelity prototypes, possibly programmed
and possibly connected with some working functionality, such as database functions.
Prototyping for the emotional perspective
A prototype to support evaluation of emotional impact needs certain kinds of details. High fidelity and high interactivity are usually required to support this perspective. Although full details at the interaction level may not always be required, you do need details relating to fun, joy of use, and user satisfaction. Further, the emotional perspective for physical devices more or
less demands physical mockups for a real feeling of holding and manipulating the device.
11.5.3 Managing Risk and Cost within Stages of Progress and within Design Perspectives
There has been much debate over the relative merits of low-fidelity prototypes vs. high-fidelity prototypes, but Buxton (2007a) has put it in a better light: It is not so much about high-fidelity prototypes vs. low-fidelity prototypes as it is about getting the right prototype. But, of course, part of getting it right is in determining the right level of abstraction for the purpose.
One way to look at the horizontal vs. vertical and low-fidelity vs. high-fidelity question is with respect to the three design perspectives (Chapter 7). For each of these perspectives, it is about managing risk, particularly the risk (in terms of cost) of getting the design wrong (with respect to the corresponding perspective) and the cost of having to change it.
A user interaction design can be thought of in two parts:
� the appearance, especially the visual aspects of the user interface objects
� the behavior, including sequencing and responses to user actions
Of these, which has the biggest risk in terms of cost to change late in the schedule? It is the behavior and sequencing. The behavior is the part that corresponds roughly to the design metaphor envisioned to support the user workflow. Therefore, we should try to get the best design we can for the behavior before we worry about details of appearance. That means our earliest and easiest to change prototypes should represent interaction design behavior and that means having a low-fidelity prototype first. This interaction structure and sequencing is very easy to change with paper screens, but becomes increasingly more difficult to modify as it is committed to programming code.
In low-fidelity prototypes it can even be a disadvantage to show too many details that appear refined. As Rettig (1994) points out, if you have a nice slick look and feel, you will naturally get most of your feedback on the look and feel details rather than on high-level issues such as workflow, task flow, overall layout, and the metaphor. Also, some users may be less willing to suggest changes for a prototype that even appears to be high fidelity because of the impression that the design process is completed and that any feedback they provide is probably too late (Rudd, Stern, & Isensee, 1996).
Later, however, increasingly higher fidelity prototypes can be used to establish and refine the exact appearance, the visual and manipulation aspects of interface objects such as colors, fonts, button design, highlighting an object, and so on, and eventually to bring in some depth in terms of functionality, for example, more detail about checking and handling errors in user inputs. As shown in Table 11-2, there is a place for both low-fidelity and high-fidelity prototypes in most design projects.
Table 11-2
Summary of comparison of low-fidelity and high- fidelity prototypes
Low fidelity (e.g., paper)
Flexibility; easy to change sequencing, overall behavior
Early Almost none Low
High fidelity (e.g., computer)
Fidelity of appearance
Later Intermediate High
Finally, and just as an aside, prototyping is a technique that can help manage and reduce risks on the software engineering side as well on the UX side of
a project.
11.5.4 Summary of the Effects of Breadth, Depth, and Fidelity Factors
In the graph in Figure 11-3 we show roughly how scope (vertical vs. horizontal) and fidelity issues play out in the choice of prototyping approaches based on what the designer needs.
11.6 PAPER PROTOTYPES
Soon after you have a conceptual design mapped out, give it life as a low-fidelity prototype and try out the concept. This is the time to start with a horizontal prototype, showing the possible breadth of features without much depth. The
facility of paper prototypes enables you, in a day or two, to create a new design
idea, implement it in a prototype,
evaluate it with users, and modify it.
Low fidelity usually means paper prototypes. You should construct your early paper prototypes as quickly and efficiently as possible. Early versions are just about interaction, not functionality. You do not even have to use "real" widgets.
Sometimes a paper prototype can act as a "coding blocker" to prevent time wasted on coding too early. At this critical juncture, when the design is starting to come together,
Figure 11-3
Depth, breadth, and fidelity considerations in choosing a type of prototype.
programmers are likely to suffer from the WISCY syndrome (Why Isn't Sam Coding Yet?). They naturally want to run off and start coding.
You need a way to keep people from writing code until we can get the design to the point where we should invest in it. Once any code gets written, there will be ownership attached and it will get protected and will stay around longer than it should. Even though it is just a prototype, people will begin to resist making changes to "their baby"; they will be too invested in it. And other team members, knowing that it is getting harder to get changes through, will be less willing to suggest changes.
11.6.1 Paper Prototypes for Design Reviews and Demos Your earliest paper prototypes will have no functionality or interaction, no ability to respond to any user actions. You can demonstrate some predefined sequences of screen sketches as storyboards or "movies" of envisioned
interaction. For the earliest design reviews, you just want to show what it looks like and a little of the sequencing behavior. The goal is to see some of the interaction design very quickly-in the time frame of hours, not days or weeks.
11.6.2 Hand-Drawn Paper Prototypes
The next level of paper prototypes will support some simulated "interaction." As the user views screens and pretends to click buttons, a human "executor" plays computer and moves paper pieces in response to those mock user actions.
11.6.3 Computer-Printed Paper Prototypes
Paper prototypes, with user interface objects and text on paper printed via a computer, are essentially the same as hand-drawn paper prototypes, except slightly higher fidelity in appearance. You get fast, easy, and effective prototypes with added realism at very low cost. To make computer-printable screens for low- fidelity prototypes, you can use tools such as OmniGraffle (for the Mac) or Microsoft Visio.
Berger (2006) describes one successful case of using a software tool not intended for prototyping. When used as a prototyping tool, Excel provides grid alignment for objects and text, tabbed pages to contain a library of designs,
a hypertext feature used for interactive links, and a built-in primitive database capability.
Cells can contain graphical images, which can also be copied and pasted, thus the concept of templates for dialogue box, buttons, and so on can be thought of as native to Excel. Berger claimed fast turnarounds, typically on a daily basis.
11.6.4 Is not paper just a stopgap medium?
Is not paper prototyping a technique necessary just because we do not yet have good enough software prototyping tools? Yes and no. There is always hope for a future software prototyping tool that can match the fluency and spontaneity afforded by the paper medium. That would be a welcome tool indeed and perhaps wireframing is heading in that direction but, given the current software technology for programming prototypes even for low-fidelity prototypes, there is no comparison with the ease and speed with which paper prototypes can be modified and refined, even if changes are needed on the fly in the midst of an evaluation session.
Therefore, at least for the foreseeable future, paper prototyping has to be considered as more than just a stopgap measure or a low-tech substitute for that as yet chimerical software tool; it is a legitimate technology on its own.
Paper prototyping is an embodied effort that involves the brain in the creative hand-eye feedback loop. When you use any kind of programming, your brain is diverted from the design to the programming. When you are writing or drawing on the paper with your hands and your eyes and moving sheets
of paper around manually, you are thinking about design. When you are programming, you are thinking about the software tool.
Rettig (1994) says that with paper, "... interface designers spend 95% of their time thinking about the design and only 5% thinking about the mechanisms of the tool. Software-based tools, no matter how well executed, reverse this ratio."
11.6.5 Why Not Just Program a Low-Fidelity Prototype?
At "run-time" (or evaluation time), it is often useful to write on the paper pages, something you cannot do with a programmed prototype. Also, we have found that paper has much broader visual bandwidth, which is a boon when you want to look at and compare multiple screens at once. When it comes time to change the interaction sequencing in a design, it is done faster and visualized more easily by shuffling paper on a table.
Another subtle difference is that a paper prototype is always available for "execution," but a software prototype is only intermittently executable-only between sessions of programming to make changes. Between versions, there is a need for fast turnaround to the next version, but the slightest error in the code will disable the prototype completely. Being software, your prototype is susceptible to a single bug that can bring it crashing down and you may be caught in a position where you have to debug in front of your demo audience or users.
The result of early programmed prototypes is almost always slow prototyping, not useful for evaluating numerous different alternatives while the trail to interaction design evolution is still hot. Fewer iterations are possible, with more "dead time" in between where users and evaluators can lose interest and have correspondingly less opportunity to participate in the design process. Also, of course, as the prototype grows in size, more and more delay is incurred from programming and keeping it executable.
Because programmed prototypes are not always immediately available for evaluation and design discussion, sometimes the prototyping activity cannot keep up with the need for fast iteration. Berger (2006) relates an anecdote about a project in which the user interface software developer had the job of implementing design sketches and design changes in a Web page production tool. It took about 2 weeks to convert the designs to active Web pages for the prototype and in the interim the design had already changed again and the beautiful prototypes were useless.
11.6.6 How to Make an Effective Paper Prototype
Almost all you ever wanted to know about prototyping, you learned in Kindergarten.
Get out your paper and pencil, some duct tape, and WD-40. Decide who on your team can be trusted with sharp instruments, and we are off on another adventure. There are many possible approaches to building paper prototypes. The following are some general guidelines that have worked for us and that we have refined over many, many iterations.
Start by setting a realistic deadline. This is one kind of activity that can go on forever. Time management is an important part of any prototyping activity.
There is no end to the features, polishing, and tweaking that can be added to a paper prototype. And watch out for design iteration occurring before you even get the first prototype finished. You can go around in circles before you get user inputs and it probably will not add much value to the design. Why polish a feature that might well change within the next day anyway?
Gather a set of paper prototyping materials. As you work with paper prototypes, you will gather a set of construction materials customized to your approach. Here is a starter list to get you going:
� Blank plastic transparency sheets, 81/2 x 11; the very inexpensive write-on kind works fine; you do not need the expensive copier-type plastic
� An assortment of different colored, fine-pointed, erasable and permanent marking pens
� A supply of plain copier-type paper (or a pad of plain, unlined, white paper)
� Assorted pencils and pens
� Scissors
� "Scotch" tape (with dispensers)
� A bottle of Wite-out or other correction fluid
� Rulers or straight edges
� A yellow and/or pink highlighter
� "Sticky" (e.g., Post-it) note pads in a variety of sizes and colors
Keep these in a box so that you have them handy for the next time you need to make a paper prototype.
Work fast and do not color within the lines. If they told you in school to use straight lines and color only within the boxes, here is a chance to revolt, a chance to heal your psyche. Change your personality and live dangerously,
breaking the bonds of grade school tyranny and dogmatism, something you can gloat about in the usual postprototype cocktail party.
Draw on everything you have worked on so far for the design. Use your conceptual design, design scenarios, ideation, personas, storyboards, and everything else you have created in working up to this exciting moment of putting it into the first real materialization of your design ideas.
Make an easel to register (align) your screen and user interface object sheets of paper and plastic. Use an "easel" to register each interaction sheet with the others. The simple foam-core board easels we make for our short courses are economical and serviceable. On a piece of foam-core board slightly larger than 81/2 x 11, on at least two (of the four)
adjacent sides add some small pieces of additional foam-core board as "stops," as seen in Figures 11-4 and 11-5, against which each interaction sheet can be pushed to ensure proper positioning. When the prototype is being "executed" during UX evaluation, the easel will usually be taped to the tabletop for stability.
Make underlying paper foundation "screens." Start with simplest possible background for each screen in pencil or pen
Figure 11-4
Foam-core board paper prototype easel with "stops" to align the interaction sheets.
on full-size paper (usually 81/2 x 11) as a base for all moving parts. Include only parts
that never change. For example, in a calendar system prototype, show a monthly "grid," leaving a blank space for the month name). See
Figure 11-6.
Figure 11-5
Another style of "stops" on a foam-core board paper prototype easel.
Figure 11-6
Underlying paper foundation "screen."
Use paper cutouts taped onto full-size plastic "interaction sheets" for all moving parts. Everything else, besides the paper foundation, will be taped to transparent plastic sheets. Draw everything else (e.g., interaction
objects, highlights, dialogue boxes, labels) in pencil, pen, or colored markers on smaller pieces of paper and cut them out. Tape them onto separate full-size 81/2 x 11 blank plastic sheets in the appropriate position aligned relative to
objects in the foundation screen and to objects taped to other plastic sheets.
We call this full-size plastic sheet, with paper user interface object(s) taped in position, an "interaction sheet." The appearance of a given screen in your prototype is made up of multiple overlays of these interaction sheets. See Figure 11-7.
When these interaction sheets are aligned against the stops in the easel, they appear to be part of the user interface, as in the case of the pop-up dialogue box in Figure 11-8.
Be creative. Think broadly about how to add useful features to your prototype without too much extra effort. In addition to drawing by hand, you can use simple graphics or paint programs to import images such as buttons, and resize, label, and print them in color. Fasten some objects such as pull-down lists to the top or side of an interaction sheet with transparent tape hinges so that they can "flap down" to overlay the screen when they are selected. See Figure 11-9.
Scrolling can be done by cutting slits in your paper menu, which is taped to a plastic sheet. Then a slightly smaller slip
of paper with the menu choices can be slid through the slots. See Figure 11-10.
Use any creative techniques to demonstrate motion, dynamics, and feedback.
Do not write or mark on plastic interaction sheets. The plastic interaction sheets are almost exclusively for mounting and positioning the paper pieces. The plastic is supposed to be transparent; that is how layering works. Do not write or draw on the plastic. The only exception is for making transparent objects such as highlights or as an input
medium on which users write input values. Later we will discuss completely blank sheets for writing inputs.
Make highlights on plastic with "handles" for holding during prototype execution. Make a highlight to fit each major selectable object. Cut out a plastic square or rectangle with a long handle and color in the highlight (usually just an outline so as not to obscure the object or text being highlighted) with a permanent marking pen. See Figure 11-11.
Figure 11-7
Paper cutouts taped to full- size plastic for moving parts.
Figure 11-8
A "Preferences" dialogue box taped to plastic and aligned in easel.
Figure 11-9
Pull-down menu on a tape "hinge."
Figure 11-10
Paper sliding through a slit for scrolling.
Make your interaction sheets highly modular by including only a small amount on each one. Instead of customizing a single screen or page, build up each screen or display in layers. The less you put on each layer, the more modular and, therefore, the more reuse you will get. With every feature and every variation of appearance taped to a different sheet of plastic, you have the best chance at being able to show the most variation of appearances and user interface object
configurations you might encounter. Be suspicious of a lot of writing/drawing on one interaction sheet. When a prototype user gives an input, it usually makes a change in the display. Each possible change should go on a separate interaction sheet.
Get modularity by thinking about whatever needs to appear by itself. When you make an interaction sheet, ask yourself: Will every single detail on here always appear together? If there is a chance two items on the same interaction sheet will ever appear separately, it is best to put them on separate interaction sheets. They come back together when you overlay them together, but they can still be used separately, too. See Figure 11-12.
Do lots of sketching and storyboarding before making interaction sheets. This will save time and work.
Use every stratagem for minimizing work and time. Focus on design, not writing and paper cutting.
Reuse at every level. Make it a goal to not draw or write anything twice; use templates for the common parts of similar objects.
Use a copy machine or scanner to reproduce common parts of similar interaction objects and write in only the differences. For example, for a calendar, use copies of a blank month template, filling in the days for each month. The idea is to capture in a template everything that does not have to change from one instance to another.
Cut corners when it does not hurt things. Always trade off accuracy (when it is not needed) for efficiency (that is always needed). As an example, if it is not important to have the days and dates
be exactly right for a given month on a calendar, use the same date numbers for each month in your early prototype.
Then you can put the numbers in your month template and not have to write any in.
Make the prototype support key tasks. Prototype at least all benchmark tasks from your UX target table, as this prototype will be used in the formative evaluation exercise.
Make a "this feature not yet implemented" message. This is the prototype's response to a user action that was not anticipated or that has not yet been included in the design. You will be surprised
Figure 11-11 Selection highlight on
plastic with a long handle.
Figure 11-12
Lots of pieces of dialogue as paper cutouts aligned on plastic sheets.
how often you may use this in user experience evaluation with early prototypes. See Figure 11-13.
Figure 11-13
"Not yet implemented" message.
Figure 11-14
Data entry on clear plastic overlay sheet.
Include "decoy" user interface objects. If you include only user interface objects needed to do your initial benchmark tasks, it may be unrealistically easy for users to do just
those tasks. Doing user experience testing with this kind of initial interaction design does not give a good idea of the ease of use of the design when it is complete and contains many more user interface objects to choose from and many more other choices to make during a task.
Therefore, you should include many other "decoy" buttons, menu choices, etc., even if they do not do anything (so participants see more than just the "happy path" for their benchmark tasks). Your decoy objects should look plausible and should, as much as possible, anticipate other tasks and other paths. Users performing tasks with your prototype will be faced with a more realistic array of user interface objects about which they will have to think as they make choices about what user actions are next. And when they click on a decoy object, that is when you get to use your "not implemented" message. (Later, in the evaluation chapters, we while discuss probing the users on why they clicked on that
object when it is not part of your envisioned task sequence.)
Accommodate data value entry by users. When users need to enter a value (e.g., a credit card number) into a paper prototype, it is usually sufficient to use a clear sheet of plastic (a blank interaction sheet) on top of the layers and let them write the value in with a marking pen; see Figure 11-14. Of course, if your design requires them to enter that number using a touchscreen on an iPad, for example, you have to create a "text input" interaction sheet.
Create a way to manage complex task threads. Before an evaluation session, the prototype "executor" will have all the paper sheets and overlays all lined up and ready to put on the easel in response to user actions. When the number of prototype pieces gets very large, however, it is difficult to know what stack of pieces to use at any point in the interaction, and it is even more difficult to clean it all up after the session to make it ready for the next session.
As an organizing technique that works most of the time, we have taken to attaching colored dots to the pieces, color coding them according to task
threads. Sheets of adhesive-backed colored circles are available at most office supply stores. See Figure 11-15. Numbers written on the circles indicate the approximate expected order of usage in the
corresponding task thread, which is the order to sort them in when cleaning up after a session.
Pilot test thoroughly. Before your prototype is ready to be used in real user experience evaluation sessions, you must give it a good shake-down. Pilot test your prototype to be sure that it will support all your benchmark tasks. You do not want to make the rookie mistake of "burning" a user participant (subject) by getting them started only to discover the prototype "blows up" and prevents benchmark task performance.
Simulate user experience evaluation conditions by having one member of your team "execute" the prototype while another member plays "user" and tries out all benchmark tasks. The user person should go through each task in as many ways as anyone thinks possible to head off the "oh, we never thought they would try that" syndrome later in testing. Do not assume error-free performance by your users; try to have appropriate error messages where user errors might occur. When you think your prototype is ready, get someone from outside your group and have them play the user role in more pilot testing.
Figure 11-15
Adhesive-backed circles for color coding task threads on prototype pieces.
11.7 ADVANTAGES OF AND CAUTIONS ABOUT USING PROTOTYPES
11.7.1 Advantages of Prototyping
In sum, prototypes have these advantages:
� Offer concrete baseline for communication between users and designers
� Provide conversational "prop" to support communication of concepts not easily conveyed verbally
� Allow users to "take the design for a spin" (who would buy a car without taking it for a test drive or buy a stereo system without first listening to it?)
� Give project visibility and buy-in within customer and developer organizations
� Encourage early user participation and involvement
� Give impression that design is easy to change because a prototype is obviously not finished
� Afford designers immediate observation of user performance and consequences of design decisions
� Help sell management an idea for new product
� Help affect a paradigm shift from existing system to new system
11.7.2 Potential Pitfalls of Prototyping
Prototyping, however, is not without potential drawbacks that, with some caution, can be avoided.
Get cooperation, buy-in, and understanding
To ensure your best chances of success with a process based on prototyping, especially if your organization is not experienced with the technique, the UX team must first secure cooperation and buy-in from all parties involved, including management.
Be sure you sell prototyping as the vehicle through which you will apply an iterative process and eventually come to an acceptable level of user experience in the design. Otherwise, managers may view allocation of resources to building a prototype, especially a throw-away one, as wasteful.
In a small contract we had many years ago with a nationally known retail chain, we were asked to help redesign the in-store point-of-sale software. The process was discussed only generally because the UX people did not think others needed to know much about what, after all, they would be doing. So when there seemed to be agreement upfront, it seemed like the client had bought into the process. But it turned out to not be a real buy-in.
When we presented the first prototype, one that had never been evaluated or iterated, the client's software people immediately took over the prototype and the interaction designers were powerless to apply their process further to arrive at a better design. Later, when the design proved inferior, the interaction designers were blamed, validating the software people's view
that a UX process does not add value. "We just implemented what you designed."
If at least project management, if not the SE people, had understood the process and the place of the prototype within that process, and if the UX people had been empowered to carry out their process, this unfortunate scenario could have been avoided.
Be honest about limitations and do not overpromise
However, you must present a design prototype to any audience-management, customers, users, other professionals-with the utmost of professional honesty.
It is your responsibility not even passively to allow your audience to assume
or believe that it is the real product, especially for design exploration prototypes and prototypes promising features that do not yet exist and, possibly, cannot be delivered.
The ease of making a low-fidelity prototype makes it easy to add bells and whistles that you may not be able to deliver in the final product.
Remember that a prototype can be perceived to make promises about features and functionality. And a slick prototype can cause management to believe that you are further along in development than you really are in the project schedule.
Do not overwork your prototype
The engineering maxim to "make it good enough" applies particularly to prototypes. A programmed prototype can seduce designers into the trap of overdesign or wasting resources on overworking a prototype, only eventually to have it scrapped.
Do not "fall in love" with your prototype and continue to expand and polish it after it has served its usefulness. Establish formative evaluation goals for prototyping and stick to them.
Finally, it may not be you, but your boss or manager who falls in love with your prototype. You might be expected to "baby sit" the prototype and keep it updated to the minute so you can trot it out without notice to demonstrate it to the next visiting dignitary or visitor. Our advice is to find a way to be busy on something more important.
11.8 PROTOTYPES IN TRANSITION TO THE PRODUCT
If you have a high-fidelity prototype near the end of the iterative UX lifecycle,
you will be thinking next about making the transition from prototype to real
product. Do you keep the prototype or do you scrap it and start over? The answer often is that you do some of each.
Perhaps the most important consideration is the investment you have made in the iterative refinement of your design. To protect that investment, you should do everything possible to preserve the details of the user interface objects, the "look" or appearance part of the design-the layout and design of all screens and the exact wording of all labels, menus, and text. If a single detail changes in the transition, it could impact the user experience-user experience that you bought and paid for during evaluation and iteration.
Similarly, you will want to preserve the feel and sequencing behavior of your well-tested prototype. However, this will require careful recoding from the ground up. Your prototype code for the sequencing always was "spaghetti" code and was never intended to be anything else.
11.8.1 Graduation Day in the Trenches: The Sacred Passing of the Prototype
After all your iteration, the day will arrive in each project where all attention will "graduate" from prototypes to the current version or release of the final software product. There is no longer an independent UX lifecycle, and all the action in this "tail" of the lifecycle will be on the SE side because they own the code.
Some formative evaluation can still be done, but it will be with the real software system and not the prototype. Further changes on either side will be much slower and more expensive. As marketing circles in, and perhaps after a little bubbly for celebration/commiseration, the UX team will either disband or start working on their next great design.
What happens to the prototype code?
If you do, as we suggest you might, use your interaction design prototype as part or all of your interaction design specifications. What are the conditions that determine how you offer it to the SE folks? How much of the prototype software, especially for a high-fidelity prototype, is to be thrown away and how much is reused? In most cases, the technical answer is that it all should be thrown away; the design is what you reuse.
You cannot just keep the prototype
Despite the urge to say "Hey, we have a prototype that works, let us just use that as the first version of the product," in almost all cases the prototype cannot be gussied up into the final product, for a number of reasons. Rarely is the best software platform for fast mockups and flexibility of cut-and-try changes to prototypes the best platform for production software. Furthermore the prototype code is never production quality; if it is, you were doing prototyping wrong.
How do you reuse the interaction design of the prototype? Expectations for the implementation of your prototype in production software are that it will be generally faithful to the intent and behavior of the prototype, including sequencing and connections to functionality, and literally and absolutely faithful to every detail of the look, feel, and design of the user interface objects.
The need for faithfulness of conversion in the second point just given is paramount. Details such as font size or location of a dialogue box have been finely honed through significant investment in iterative design and evaluation. These details are to be treated as sacrosanct and taken literally to an extreme; they are too valuable to risk being "lost in translation" at this point in the project.
One way that many practitioners get this result is to use a prototype that is a wireframe plus a visual comp "skin," a graphical appearance created by visual designers and often constrained to an exacting visual style guide dictating color choices (often with great precision down to the RGB increments), branding, and so on.
An important tool for capturing and communicating much of the interaction design detail is the custom project style guide (Chapter 9). Your project style guide, if maintained to capture detailed design decisions, can serve as a reference to enforce consistency in similar other designs and will go a long way toward preventing loss of details in the transition from one platform to another.
The need for UX and SE collaboration and respect
At this point of the overall project, the process requires even more collaboration across the UX-SE aisle. The UX people "own" the interaction design but not the software that implements it. It is a little like owning a home but not the land upon which it sits. The "hand-off" point is a serious nexus in the two lifecycles. As the hand-off occurs across the boundaries of the two domains, there is a tendency on both sides to think that full responsibility has passed from one team to another. But a successful hand-off has to be much more of a collaborative
event. Both sides face the challenge of connecting the interaction design representations to existing software development processes.
UX interests are vested in the design, and SE interests are vested in what will become the implementation code. Mutual preservation of those interests demands careful collaboration, tempered with mutual respect (Chapter 23).
Do not think the UX team is now done
There are no standards governing the translation from prototype to product implementation. Because preserving quality user experience in the design is not in the purview of the SE people, the UX people have a strong responsibility to ensure that their interaction design comes through with fidelity. If the SE people are the sole interpreters of this process, there is no way to predict the assiduousness at the task.
There are important reasons why the UX team cannot just hand it off and walk away, confident that it is now entirely within the SE bailiwick. There are currently no major software development lifecycle concepts that include adequate support for including the UX lifecycle or its work products, such as interaction design specifications, as a serious part of the overall system development process.
The SE people do not have, and should not be expected to have, the UX skills and knowledge to interpret the interaction design specifications thoroughly and accurately enough to get a perfect translation to software requirements specifications. The SE people did not participate in the interaction requirements and design processes and, therefore, do not know its inner details, boundaries, and supporting rationale.
The SE people cannot know all of what is contained in and communicated by the prototype (as part of the interaction design specifications) without guidance from the UX people who created and refined it. Anyway, if your interaction design matriculates from prototype to implementation successfully, congratulations!
11.9 SOFTWARE TOOLS FOR PROTOTYPING
In a previous section we mentioned a hope for the future about tools that would allow us to replace paper prototypes with low-fidelity programmed prototypes, but with the same ease of building, modifying, and executing low-fidelity prototypes as we had in the paper medium. The human-computer interaction (HCI) research community has not been unaware of this need and has tried
to address it over the years (Carey & Mason, 1989; Gutierrez, 1989; Hix & Schulman, 1991; Landay & Myers, 1995). However, the technical challenges of designing and building such a software tool have been steep. We are not there yet, but such a software tool would certainly be valuable.
In the 1990s, user interface management systems (UIMSs), broad tool suites for defining, implementing, and executing interaction designs and prototypes (Myers, 1989, 1993, 1995; Myers, Hudson, & Pausch, 2000), and software prototyping tools were hot topics. Hix and Schulman (1991) also did some work on software tool evaluation methods.
There were different and competing looks, feels, and interaction styles built into many of these tools, such as CUA (IBM), OpenLook (Next), Toolbook (Asymetrix), Altia design (Altia, Inc.), Delphi (Borlund), Visual Basic (Microsoft Windows), and Dreamweaver (from Macromedia, for Web-based interaction). Some were not available commercially but were developed by the organizations that used them (e.g., TAE Plus by NASA). And some tools depended on a variety of different "standards," such as OSF Motif, not to mention Windows vs.
Macintosh.
Many of the first tools for prototyping of interactive systems required a great deal of programming. Thus, interaction designers lacking programming skills could not use them directly, and "compiling" a new design iteration into executable form could be lengthy, complex, and fraught with bugs. Because many of the early UIMSs had a strong connection to computer graphics, resulting prototypes could be very realistic and could exhibit rather complex graphical behavior.
Some tools were, and some still are, based on interpretable interface definitions of the design entered declaratively, possibly along with some behavior structuring code. The interpretive approach offered more speed and flexibility in accommodating changes but, because of early hardware limitations, almost always caused the prototype execution to suffer slow performance. As the ability to produce user interface fac�ades advanced in the tools, provision was made to program or at least stub non-interface functionality, moving the technology slowly toward whole system prototypes.
There is still no single prototyping platform capable of facilitating rapid prototype construction while meeting requirements to simulate today's complex interaction paradigms. This has been a persistent problem in HCI, where the prototyping tools are always a little behind on the state of the art of interaction possibilities.
For example, in a study we conducted with eight student teams working on building a real-world software system, we observed a situation where the
interaction designers in the team needed an autocomplete feature in a pull- down menu as a core feature of their prototype. But because they could not get autocompletion in a prototype without a database, the software engineers in the team ended up having to build the database to support this one interaction design feature.
That software investment could not be used in the product itself, however, because it was on the wrong platform. We have to keep repeating the difficult prototype programming it takes to provide the functional behaviors that are becoming expected in any interactive software system, such as log-in sequences, auto-completion functions, or data entry validation sequences. Nowadays more and more of these complex interaction patterns are being communicated using static or click-through wireframes. We hope the state of the art in prototyping tools will soon evolve to support such patterns more effectively.
11.9.1 Desiderata for Prototyping Tools
As we have said, prototyping tools so far have almost always shared the same downside: it takes too long to make changes, as even the smallest amount of programming distracts from the purpose of a low-fidelity prototype and, as the prototype grows in size, it becomes less amenable to changes. Therefore, we (especially the user interface software community) continue on a quest for the perfect software prototyping tool, among the desired features of which are:
� Fast and effortless changes
� Ease on the order of that of paper prototypes: as natural as changing a paper prototype
� Tool transparency: Needs so little focus on the software that it does not distract from the design and prototype building
� Fast turnaround to executability so there is almost no wait before it can be executed again
� Non-programmer ease of prototype definition and use
� Non-programmers must be able to define and modify design features
� Built-in common behaviors and access to large varieties of other behaviors via a library of plug-ins
� Easily include realism of features and behavior commensurate with expectations for modern interaction styles
� Supports a wide variety of interaction styles and devices, including various pointing and selecting devices, touchscreens, speech/audio, tactile/haptic, and gesture
� Ease of creating and modifying links to various points within the interaction design (e.g., buttons, icons, and menu choices to particular screens) to simulate user navigational behavior
� Communication with external procedures and programs (e.g., calls, call-backs, data transfer) to include some functionality and additional application behavior
� Capability to import text, graphics, and other media from other sources
� Capability to export look and feel components for eventual transition to final product code
Intentionally left as blank
UX Evaluation Introduction 12
Objectives
After reading this chapter, you will:
1. Understand the difference between formative and summative evaluation and the strengths and limitations of each
2. Understand the difference between analytic and empirical methods
3. Understand the difference between rigorous and rapid methods
4. Know the strengths and weaknesses of various data collection techniques, such as critical incident identification, thinking aloud, and questionnaires
5. Distinguish evaluation techniques oriented toward usability and emotional impact
6. Understand the concept of the evaluator effect and its impact on evaluation results
12.1 INTRODUCTION
12.1.1 You Are Here
We begin each process chapter with a "you are here" picture of the chapter topic in the context of the overall Wheel lifecycle template; see Figure 12-1. This chapter is an introduction that will lead us into the types and parts of UX evaluation of the following chapters.
12.1.2 Evaluate with a Prototype on Your Own Terms
Users will evaluate the interaction design sooner or later, so why not have them do
it sooner-working with your team, using the proper techniques, and under the
appropriate conditions-or you can wait until it is in the field, where you cannot control the outcome-visualize bad rumors about your product and huge costs to fix the problems because you have already committed the design to software.
12.1.3 Measurability of User Experience
But can you evaluate usability or user experience? This may come as a surprise, but neither usability nor user experience is directly measurable. In fact, most interesting phenomenon, such as teaching and learning, share the same
Figure 12-1
You are here, at the evaluation activity in the context of the overall Wheel lifecycle template.
difficulty. So we resort to measuring things we can measure and use those
measurements as indicators of our more abstract and less measurable notions.
For example, we can understand usability effects such as productivity or ease of use by measuring observable user performance-based indicators such as time to
task completion and error counts. You can design a feature so that the performance of a certain task in a usability lab will yield a desirable objective measurement of, say, time on task. In almost any work context this translates to good user performance.
Questionnaires also provide indicators of user satisfaction from their answers
to questions we think are closely related to satisfaction. Similarly, emotional impact factors such as user satisfaction and joy of use also cannot be measured directly but only through indirect indicators.
12.1.4 User Testing? No!
Before we get into the different types of evaluation, let us first calibrate our perspective on what we are testing here. Ask yourself honestly: Do you use the term "user testing?" If you do, you are not alone: the term appears in many books
and papers on human-computer interaction (HCI) as it does in a large volume of online discussions and practitioner conversations.
You know what it means and we know what it means, but no user will like the idea of being tested and, thereby, possibly made to look ridiculous. No, we are not testing users, so let us not use those words. We might be testing or evaluating the design for usability or the full user experience it engenders in users, but we are not testing the user. We call it UX evaluation or even UX testing,
but not "user testing!"
It might seem like a trivial PC issue, but it goes beyond being polite or "correct." When you are working with users, projecting the right attitude and making them comfortable in their role can make a big difference in how well they help you with evaluation. UX evaluation must be an ego-free process; you
are improving designs, not judging users, designers, or developers.
We know of a case where users at a customer location were forced to play the user role for evaluation, but were so worried that it was a ruse to find candidates for layoffs and staff reductions that they were of no real value for the evaluation activities. If your user participants are employees of the customer organization, it is especially important to be sure they know you are not testing them. Your user participants should be made to feel they are part of a design process partnership.
12.2 FORMATIVE VS. SUMMATIVE EVALUATION
In simplest terms, formative evaluation helps you form the design and summative evaluation helps you sum up the design. A cute, but apropos, way to look at the difference: "When the cook tastes the soup, that's formative; when
the guests taste the soup, that's summative" (Stake, 2004, p. 17).
The earliest reference to the terms formative evaluation and summative evaluation we know of stems from their use by Scriven (1967) in education and curriculum evaluation. Perhaps more well known is the follow-up usage by Dick and Carey (1978) in the area of instructional design. Williges (1984) and Carroll, Rosson, and Singley (1992) were among the first to use the terms in an HCI context.
Formative evaluation is primarily diagnostic; it is about collecting qualitative
data to identify and fix UX problems and their causes in the design. Summative
evaluation is about collecting quantitative data for assessing a level of quality due
to a design, especially for assessing improvement in the user experience due to
formative evaluation.
Formal summative evaluation is typified by an empirical competitive benchmark study based on formal, rigorous experimental design aimed at comparing design hypothesis factors. Formal summative evaluation is a kind of controlled
hypothesis testing with an m by n factorial design with y independent variables,
the results of which are subjected to statistical tests for significance. Formal summative evaluation is an important HCI skill, but we do not cover it in this book.
Informal summative evaluation is used, as a partner of formative evaluation, for
quantitatively summing up or assessing UX levels using metrics for user
performance (such as the time on task), for example, as indicators of progress in
UX improvement, usually in comparison with pre-established UX target levels (Chapter 10).
However, informal summative evaluation is done without experimental controls, with smaller numbers of user participants, and with only
summary descriptive statistics (such as average values). We include informal
summative evaluation in this book as a companion activity to formative
evaluation.
12.2.1 Engineering Evaluation of UX: Formative Plus Informal Summative
Life is one big, long formative evaluation.
- Anonymous
Try as you might in the design phase, the first version of your interaction design is unlikely to be the best it can be in meeting your business goals of pleasing customers and your UX goals of pleasing users. Thus the reason for iteration and refinement cycles, to which evaluation is central.
You do not expect your first design to stand for long. Our friend and colleague, George Casaday calls it: "Waffle Wisdom" or "Pancake Philosophy"- like the first waffle or pancake, you expect from the start to throw away the first design, and maybe the next few. Formative evaluation is how you find out how to make the next ones better and better.
In UX engineering, formative UX evaluation includes any method that meets the definition of helping to form the design. Most, if not all, rapid UX evaluation methods (Chapter 13) have only a formative UX evaluation component and do not have a summative component. In lab-based UX testing sessions we also often use only formative evaluation, especially in early cycles of iteration when we are defining and refining the design and are not yet interested in performance numbers.
In rigorous UX evaluation we often add an informal summative evaluation component to formative evaluation, the combination being used to improve an interaction design and to assess how well it has been improved. We call this combination "UX engineering evaluation" or just "UX evaluation," as shown in Figure 12-2.
At the end of each iteration for a product version, the informal summative evaluation is used as a kind of acceptance test to compare with our UX targets and ensure that we meet our UX and business goals with the product design.
12.2.2 Engineering vs. Science
It's all very well in practice but it will never work in theory.
- French management saying
Sometimes empirical lab-based UX testing that includes quantitative metrics is the source of controversy with respect to "validity." Sometimes we hear "If you do not include formal summative evaluation, are you not missing an opportunity to add some science?" "Since your informal summative evaluation was not controlled testing, why should we not dismiss your results as too 'soft'?" "Your informal studies just are not good science. You cannot draw any conclusions."
These questions ignore the fundamental difference between formal and informal summative evaluation and the fact that they have completely different goals and methods. This may be due, in part, to the fact that the fields of HCI and UX were formed as a melting pot of people from widely varying backgrounds. From their own far-flung cultures in psychology, human factors
Figure 12-2
UX evaluation is a
combination of formative
and informal summative
evaluation.
engineering, systems engineering, software engineering, marketing, and management they arrived at the docks of HCI with their baggage containing their own perspectives and mind-sets.
Thus, it is known that formal summative evaluations are judged on a number of rigorous criteria, such as validity, and that formal summative evaluation contributes to our science base. But informal summative evaluation may be less known as an important engineering tool in the HCI bag and that the only criterion for judging this kind of summative evaluation method is whether it works as part of an engineering process.
12.2.3 What Happens in Engineering Stays in Engineering Because informal summative evaluation is engineering, it comes with some very strict limitations, particularly on sharing informal summative results.
Informal summative evaluation results are only for internal use as engineering tools to do an engineering job by the project team and cannot be shared outside the team. Because of the lack of statistical rigor, these results especially cannot be used to make any claims inside or outside the team. To make claims about UX levels achieved, for example, from informal summative results, would be a violation of professional ethics.
We read of a case where a CEO of a company got a UX report from a project team, but discounted the results because they were not statistically significant. This problem could have been avoided by following our simple rules and not distributing formative evaluation reports outside the team or by writing the report with careful caveats.
But what if you are required to produce a formative evaluation report for consumption beyond the team or what if you need results to convince the team to fix the problems you find in a formative evaluation? We address those questions and more in Chapter 17, evaluation reporting.
12.3 TYPES OF FORMATIVE AND INFORMAL SUMMATIVE EVALUATION METHODS
12.3.1 Dimensions for Classifying Formative UX Evaluation Methods
In practice, there are two orthogonal dimensions for classifying types of
formative UX evaluation methods:
� empirical method vs. analytic method
� rigorous method vs. rapid method
12.3.2 Rigorous Method vs. Rapid Method
Formative UX evaluation methods can be either rigorous or rapid. We define rigorous UX evaluation methods to be those methods that maximize effectiveness and minimize the risk of errors regardless of speed or cost, meaning to refrain from shortcuts or abridgements.
Rigorous empirical UX evaluation methods entail a full process of
preparation, data collection, data analysis, and reporting (Chapters 12 and 14 through 18). In practical terms, this kind of rigorous evaluation is usually conducted in the UX lab. Similarly, the same kind of evaluation can be conducted at the customer's location in the field.
Rigorous empirical methods such as lab-based evaluation, while certainly not perfect, are the yardstick by which other evaluation methods are compared.
Rigorous and rapid methods exist mainly as quality vs. cost trade-offs.
� Choose a rigorous empirical method such as lab-based testing when you need effectiveness and thoroughness, but expect it to be more expensive and time- consuming.
� Choose the lab-based method to assess quantitative UX measures and metrics, such as time-on-task and error rates, as indications of how well the user does in a performance- oriented context.
� Choose lab-based testing if you need a controlled environment to limit distractions.
� Choose empirical testing in the field if you need more realistic usage conditions for ecological validity than you can establish in a lab.
However, UX evaluation methods can be faster and less expensive.
� Choose a rapid evaluation method for speed and cost savings, but expect it to be (possibly acceptably) less effective.
� Choose a rapid UX evaluation method for early stages of progress, when things are changing a lot, anyway, and investing in detailed evaluation is not warranted.
� Choose a rapid method, such as a design walkthrough, an informal demonstration of design concepts, as a platform for getting initial reactions and early feedback from the rest of the design team, customers, and potential users.
12.3.3 Analytic Method vs. Empirical Method
On a dimension orthogonal to rapid vs. rigorous, formative UX evaluation methods can be either empirical or analytic (Hix & Hartson, 1993; Hartson, Andre, & Williges, 2003). Empirical methods employ data observed in the performance of real user participants, usually data collected in lab-based testing.
Analytical methods are based on looking at inherent attributes of the design rather than seeing the design in use. Many of the rapid UX evaluation methods (Chapter 13), such as design walkthroughs and UX inspection methods, are analytic methods.
Some methods in practice are a mix of analytical and empirical. For example, expert UX inspection can involve "simulated empirical" aspects in which the expert plays the role of the users, simultaneously performing tasks and "observing" UX problems.
Empirical methods are sometimes called "payoff methods" (Carroll, Singley, & Rosson, 1992; Scriven, 1967) because they are based on how a design or design change pays off in terms of real observable usage. Examples of the kind
of data collected in empirical methods include quantitative user performance
data, such as time on task and error rates, and qualitative user data derived from
usage observation, such as UX problem data stemming from critical incident
identification and think-aloud remarks by user participants. Analytical methods are sometimes called "intrinsic methods" because they are based on analyzing intrinsic characteristics of the design rather than seeing the design in use.
In describing the distinction between payoff and intrinsic approaches to evaluation, Scriven wrote an oft-quoted (Carroll, Singley, & Rosson, 1992; Gray & Salzman, 1998, p. 215) analogy featuring an axe (Scriven, 1967, p. 53): "If you want to evaluate a tool, say an axe, you might study the design of the bit,
the weight distribution, the steel alloy used, the grade of hickory in the handle,
etc., or you might just study the kind and speed of the cuts it makes in the hands
of a good axeman," speaking of intrinsic and payoff evaluation, respectively. In Hartson, Andre, and Williges (2003) we added our own embellishments, which we paraphrase here.
Although this example served Scriven's purpose well, it also offers us a chance to make a point about the need to identify UX goals carefully before establishing evaluation criteria. Giving a UX perspective to the axe example, we note that user performance observation in payoff evaluation does not necessarily require an expert axeman (or axeperson). Expert usage might be one component of the vision in axe design, but it is not an exclusive requirement in payoff evaluation. UX goals depend on expected user classes of key work roles and the expected kind of usage.
For example, an axe design that gives optimum performance in the hands of an expert might be too dangerous for a novice user. For the weekend wood whacker, safety might be a UX goal that transcends firewood production, calling for a safer design that might necessarily sacrifice some efficiency. One hesitates to contemplate the metric for this case, possibly counting the number of 911
calls from a cellphone in the woods near Newport, Virginia, or the number of visits to the ER. Analogously, UX goals for a novice user of a software accounting system (e.g., TurboTax), for example, might place ease of use and data integrity (error avoidance) above sheer expert productivity.
Emotional impact factors can also be evaluated analytically. For example, a new axe in the hands of an expert might elicit an emotional response. Perhaps the axe head is made of repurposed steel from the World Trade Center-what patriotic and emotional impact that could afford! A beautiful, gleaming polished steel head, a gorgeously finished hickory wood handle, and a fine leather scabbard could elicit a strong admiration of the craftsmanship and aesthetics, as well as great pride of ownership.
Emotional impact factors can also be evaluated empirically. One need only observe the joy of use of a finely made, exquisitely sharpened axe. In a kind of think-aloud technique, the user exclaims with pleasure about the perfect balance as he or she hits the "sweet spot" with every fast-cutting stroke.
12.3.4 Where the Dimensions Intersect
Some example UX evaluation methods are shown in Figure 12-3 at the various intersections between the two dimensions empirical vs. analytic and rigorous vs. rapid.
We usually associate the rigorous empirical category with lab-based evaluation (Chapters 14 through 17), but empirical UX evaluation in a conference room or field setting can also be rigorous. The rapid evaluation methods (Chapter 13)
are mostly analytic methods but at least one rapid empirical method (RITE)
exists, designed to pick the low-hanging fruit at relatively low cost.
In addition, most practitioners in the field have their own versions of the lab- based method that might qualify as rapid because of severe abridgements but also still qualify as empirical because they involve data
collection using participants. Rigorous analytic methods are beyond the scope of this book.
12.4 TYPES OF EVALUATION DATA
Fundamentally, UX data can be objective or subjective and it can be quantitative or qualitative. Because the two dimensions are orthogonal, you can see all four combinations,
Figure 12-3
Sample UX evaluation
methods at intersections
between the dimensions of
UX evaluation method
types.
objective and quantitative, subjective and quantitative, and so forth. When your rigorous evaluation is driven by benchmark tasks, the kinds of data collected in the process will mirror what is specified in UX targets and metrics.
12.4.1 Objective Data vs. Subjective Data
Objective UX data are data observed directly by either the evaluator or the
participant. Subjective UX data represent opinions, judgments, and other subjective feedback usually from the user, concerning the user experience and satisfaction with the interaction design.
12.4.2 Quantitative Data vs. Qualitative Data
Quantitative data are numeric data, such as data obtained by user performance metrics or opinion ratings. Quantitative data are the basis of an informal
summative evaluation component and help the team assess UX achievements
and monitor convergence toward UX targets, usually in comparison with the specified levels set in the UX targets (Chapter 10). The two main kinds of quantitative data collected most often in formative evaluation are objective user performance data measured using benchmark tasks (Chapter 10) and subjective user-opinion data measured using questionnaires (coming up later).
Qualitative data are non-numeric and descriptive data, usually describing a UX
problem or issue observed or experienced during usage. Qualitative data are usually collected via critical incident (also coming up later) and/or the think-
aloud technique (see later) and are the key to identifying UX problems and their causes. Both objective and subjective data can be either qualitative or quantitative.
12.5 SOME DATA COLLECTION TECHNIQUES
12.5.1 Critical Incident Identification
The key objective of formative evaluation is to identify defects in the interaction design so that we can fix them. But during an evaluation session, you cannot always see the interaction design flaws directly. What we can observe directly or indirectly are the effects of those design flaws on the users. We refer to such effects on the users during interaction as critical incidents. Much of the attention of evaluators in evaluation sessions observing usage is spent looking for and identifying critical incidents.
Critical incidents
Despite numerous variations in procedures for gathering and analyzing critical incidents, researchers and practitioners agree about the definition of a critical incident. A critical incident is an event observed within task performance
that is a significant indicator of some factor defining the objective of the study (Anderssona & Nilsson, 1964).
In the UX literature (Castillo & Hartson, 2000; del Galdo, et al., 1986), critical incidents are indicators of "something notable" about usability or the user experience. Sometimes that notable indication is about something good in the user experience, but the way we usually use it is as an indicator of things that go wrong in the stream of interaction details, indicators of UX problems or features that should be considered for redesign.
The best kind of critical incident data are detailed, observed during usage, and associated closely with specific task performance. The biggest reason why lab-based UX testing is effective is that it captures exactly that kind of detailed usage data as it occurs.
Critical incidents are observed directly by the facilitator or other observers and are sometimes expressed by the user participant. Some evaluators wait for an obvious user error or task breakdown to record as a critical incident. But an experienced facilitator can observe a user hesitation, a participant comment in passing, a head shaking, a slight shrugging of the shoulders, or drumming of fingers on the table. A timely facilitator request for clarification might help determine if any of these subtle observations should be considered a symptom of a UX problem.
Critical incident data about a UX problem should contain as much detail as possible, including contextual information, such as:
� the user's general activity or task
� objects or artifacts involved
� the specific user intention and action that led immediately to the critical incident
� expectations of the user about what the system was supposed to do when the critical incident occurred
� what happened instead
� as much as possible about the mental and emotional state of the user
� indication of whether the user could recover from the critical incident and, if so, a description of how the user did so
� additional comments or suggested solutions to the problem
Relevance of critical incident data
Critical incident identification is arguably the single most important source of qualitative data in formative evaluation. These detailed data, perishable if
not captured immediately and precisely as they arise during usage, are essential for isolating specific UX problems within the user interaction design.
History of critical incident data
The origins of the critical incident technique can be traced back at least to studies performed in the Aviation Psychology Program of the U.S. Army Air Forces in World War II to analyze and classify pilot error experiences in reading and interpreting aircraft instruments. The technique was first formally codified by the work of Fitts and Jones (1947). Flanagan (1954) synthesized the landmark critical incident technique.
Mostly used as a variation
When Flanagan designed the critical incident technique in 1954, he did not see it as a single rigid procedure. He was in favor of modifying this technique to meet different needs as long as original criteria were met. The variation occurring over the years, however, may have been more than Flanagan anticipated.
Forty years after the introduction of Flanagan's critical incident technique, Shattuck and Woods (1994) reported a study that revealed that this technique has rarely been used as originally published. In fact, numerous variations of the method were found, each suited to a particular field of interest. In HCI, we have continued this tradition of adaptation by using our own version of the critical incident technique as a primary UX evaluation technique to identify UX problems and their causes.
Critical incident reporting tools
Human factors and human-computer interaction researchers have developed software tools to assist identifying and recording critical incident information. del Galdo et al. (1986) investigated the use of critical incidents as a mechanism to collect end-user reactions for simultaneous design and evaluation of both online and hard-copy documentation. As part of this work, del Galdo et al. (1986) designed a software tool to collect critical incidents from user subjects.
Who identifies critical incidents?
One factor in the variability of the critical incident technique is the issue of who makes the critical incident identification. In the original work by Fitts and Jones (1947), the user (an airplane pilot) reported the critical incidents after task performance was completed. Flanagan (1954) used trained observers to collect critical incident information while observing users performing tasks.
del Galdo et al. (1986) involved users in identifying their own critical incidents, reporting during task performance. The technique was also used as a self-reporting mechanism by Hartson and Castillo (1998, 2000) as the basis for
remote system or product usability evaluation. Further, Dzida, Wiethoff, and Arnold (1993) and Koenemann-Belliveau et al. (1994) adopted the stance that identifying critical incidents during task performance can be an individual process by either the user or an evaluator or a mutual process between the user and an evaluator.
Timing of critical incident data capture: The evaluator's awareness zone
While users are known to report major UX problems in alpha and beta testing (sending software out for comments on how well it worked), one reason these methods cannot be relied upon for thorough identification of UX problems to fix is the retrospective nature of that kind of data collection. Lab-based UX evaluation has the advantage of having the precious and volatile details right in front of you as they happen. The key to this kind of UX data is in the details, and details of these data are perishable; they must be captured immediately as they arise during usage.
As a result, do not lose this advantage; capture and document the details while they are fresh (and not just by letting the video recorder run). If you capture them as they happen, we call it concurrent data capture. If you capture data immediately after the task, we call it contemporaneous data capture. If you try to capture data after the task is well over, through someone trying to remember the details in an interview or survey after the session, this is retrospective data capture and many of the once-fresh details can be lost.
It is not as easy, however, as just capturing critical incident data immediately upon its occurrence. A critical incident is often not immediately recognized for what it is. In Figure 12-4, the evaluator's recognition of a critical incident will
Figure 12-4 Critical incident
description detail vs. time
after critical incident.
necessarily occur sometime after it begins to occur. And following the point of initial awareness, after confirming that it is a critical incident, the evaluator requires some time and thought in a kind of "awareness zone" to develop an understanding of the problem, possibly through discussion with the participant.
The optimum time to report the problem, the time when the potential for a quality problem report is highest, is at the peak of this problem understanding, as seen in Figure 12-4. Before that point, the evaluator has not yet established a full understanding of the problem. After that optimum point, natural abstraction due to human memory limitations sets in and details drop off rapidly with time, accelerated by proactive interference from any intervening tasks.
12.5.2 The Think-Aloud Technique
Also called "think-aloud protocol" or "verbal protocol," the think-aloud technique is a qualitative data collection technique in which user participants, as the name implies, express verbally their thoughts about their interaction experience, including their motives, rationale, and perceptions of UX problems. By this method, participants let us in on their thinking, giving us access to a precious understanding of their perspective of the task and the interaction design, their expectations, strategies, biases, likes, and dislikes.
Why use the think-aloud technique?
General observational data are important during an evaluation session with a participant attempting to perform a task. You can see what parts of a task the user is having trouble with, you can see hesitations about using certain widgets, and so on. But the bulk of real UX problem data is hidden from observation, in the mind of the participant. What is really causing a hesitation and why does this participant perceive it as a problem or barrier? To get the best qualitative data, you have to tap into this hidden data, buried in the participant's mind, which is the goal of the think-aloud technique.
The think-aloud technique is simple to use, for both analyst and participant. It is useful for when a participant walks through a prototype or helps you with a UX inspection. Nielsen (1993, p. 195) says "thinking aloud may be the single most valuable usability engineering method." It is effective in accessing user intentions, what they are doing or are trying to do, and their motivations, the reasons why they are doing any particular actions. The think-aloud technique is also effective in assessing emotional impact because emotional impact is felt internally and the internal thoughts and feelings of the user are exactly what the think-aloud technique accesses for you.
The think-aloud technique can be used in both rigorous empirical methods (lab-based) and rapid empirical methods (quasi-empirical and RITE)-that
is, any UX evaluation method that involves a user "participant." Variations of this simple technique are rooted in psychological and human factors
experimentation well before it was used in usability engineering (Lewis, 1982).
What kind of participant works best?
Some participants can talk while working; get them if you can. The usual participant for think-aloud techniques is someone who matches the work role and user class definitions associated with the tasks you will use to drive the evaluation. This kind of participant will not be trained as a UX practitioner, but that usually will not deter them from offering opinions and theories about UX problems and causes in your design, which is what you want.
You must remember, however, that it is your job to accept their comments as inputs to your process and it is still up to you to filter and interpret all think- aloud data in the context of your design. Participants and think-aloud techniques are not a substitute for you doing your job.
So, if think-aloud participants are not typically trained UX practitioners, what about using participants who are? You can use trained UX practitioners as participants, if they are not stakeholders in your project. They will give a different perspective on your design, often reflecting a better and deeper understanding and analysis. But their analysis may not be accurate from a work- domain or task perspective, so you are still responsible for filtering and interpreting their comments.
Is thinking aloud natural for participants?
It depends on the participant. Some people find it natural to talk while they work. Some people are able and only too willing to share their thoughts while working or while doing anything. Others are naturally quiet or contemplative and must be prompted to verbalize their thoughts as they work.
Some people, in the cannot-walk-and-chew-gum category, have difficulty expressing themselves while performing a physical task-activities that require help from different parts of the brain. For these people, you should slow things down and let them "rest" and talk only between actions. Even the most loquacious of participants at times will need prompting, "what are you thinking about?" or "what do you think you should do next?"
Also, sometimes participants ask questions, such as "what if I click on this?," to which your reply should encourage them to think it through themselves, "what do you think will happen?" When an interaction sequence leads a participant to
surprise, confusion, or bewilderment-even fleetingly-ask them, "was that what you expected?" or "what did you expect?"
How to manage the think-aloud protocol?
The think-aloud technique is intended to draw out cognitive activity, not to confirm observable behavior. Therefore, your instructions to participants should emphasize telling you what they are thinking, not describing what they are doing. You want them to tell you why they are taking the actions that you can observe.
Once your participants get into thinking aloud, they may tend to keep the content at a chatty conversational level. You may need to encourage them to get past the "it is fine" stage and get down into real engagement and introspection. And sometimes you have to make an effort to keep the think-aloud comments flowing; some participants will not naturally maintain thinking aloud while they work and will have to be prodded gently.
Seeing a participant sit there silently struggling with an unknown problem tells us nothing about the problem or the design. Because we are trying to extract as much qualitative data as possible from each participant, elicitation of participant thoughts is a valuable facilitator skill. You might even consider a brief practice lesson on thinking aloud with each participant before you start the session itself.
Retrospective think-aloud techniques
If, as facilitator, you perceive that the think-aloud technique, when used concurrently with task performance, is interfering in some way with task performance or task measurements, you can wait until after task completion (hence the name "retrospective") and review a video recording of the session with the participant, asking for more complete "thinking aloud" during this review of his or her own task performance. In this kind of retrospective think- aloud technique, the participant is acting less as a task performer and more as an observer and outside "commentator," but with detailed inside information. The audio of this verbal review can be recorded on a second track in synchronism with the video, for even further later analysis, if necessary.
This approach has the advantage of capturing the maximum amount of think- aloud data but has the obvious downside of requiring a total of at least twice as much time as just the session itself. It also suffers from the time lag after the actual events. While better than a retrospective review even later, some details will already be fading from the participant's memory.
Co-discovery think-aloud techniques
Using a single participant is the dominant paradigm in usability and UX testing literature. Single users often represent typical usage and you want to be sure that single users can work their way through the tasks. However, you may also wish to try the increasingly common practice of using two or more participants in a team approach, a technique that originated with O'Malley, Draper, and Riley (1984). Kennedy (1989) named it "co-discovery" and that name has stuck.
While it can seem unnatural and inhibiting for a lone participant to be thinking aloud, essentially talking to oneself, there is more ease in talking in a natural conversation with another person (Wildman, 1995). A single individual participant can have trouble remembering to verbalize, but it is just natural with a partner. When the other person is also verbalizing in a problem-solving context, it amounts to a real and natural conversation, making this approach increasingly popular with practitioners and organizations.
Hackman and Biers (1992) found that multiple participants, while slightly more expensive, resulted in more time spent in verbalizing and, more importantly, participant teams spent more time verbalizing statements that had high value as feedback for designers.
Co-discovery is an especially good method for early low-fidelity prototypes; it gets you more viewpoints. And there is less need for the facilitator to interact, intervene, or give hints. The participants ask each other the questions and give themselves the guidance and prodding. When one participant gets stuck, the other can suggest things to try. When two people act as a real team, they are more willing to try new things and less intimidated than if they had been working solo.
Co-discovery pays off best when their thinking aloud becomes an interactive conversation between the participants, but this can produce qualitative data at a prodigious rate, at times more than twice as fast as from one participant. Two co-participants can bounce thoughts and comments back and forth. You may have to switch from zone defense, where each data collector does their best to catch what comes their way, to a person-to-person arrangement where each data collector is assigned to focus on comments by just one of the participants.
This is where video capture makes for a good back-up to review selectively if you think you might have missed something.
There are many ways for a co-discovery session to play out. The scenarios often depend on participant personalities and who takes the lead. You should let them take turns at driving; both still have to pay attention. In cases where one participant has a dominant personality to the point where that person wants
to run things and perhaps thinks he or she knows it all, try to get the other participant to drive as much as possible, to give them some control. If one person seems to drift off and lose attention or interest, you may have to use the usual techniques for getting school children re-engaged in an activity, "Johnny, what do you think about that problem?"
Finally, as a very practical bonus from planning a co-discovery session, if one participant does not show up, you can still do the session. You avoid the cost of an idle slot in the lab and having to reschedule.
Does thinking aloud affect quantitative task performance metrics in lab-based evaluation?
It depends on the participant. Some participants can naturally chat about what they are doing as they work. For these participants, the concurrent think-aloud technique usually does not affect task performance when used with measured benchmark tasks.
This is especially true if the participant is just thinking aloud and not engaged with questions and answers by the facilitator. But for some participants, the think-aloud protocol does affect task performance. This is especially true for non-native speakers because their verbalizations just take longer.
12.5.3 Questionnaires
A questionnaire is the primary instrument for collecting subjective data from participants in all types of evaluations. It can be used to supplement objective (directly observable) data from lab-based or other data collection methods or as an evaluation method on its own. A questionnaire can contain probing questions about the total user experience. Although post-session questionnaires have been used primarily to assess user satisfaction, they can also contain effective questions oriented specifically toward evaluating broader emotional impact and usefulness of the design.
Questionnaires are a self-reporting data collection technique and, as Shih and Liu (2007) say, semantic differential questionnaires (see next section) are used most commonly because they are a product-independent method that can yield reliable quantitative subjective data. This kind of questionnaire is inexpensive to administer but requires skill to produce so that data are valid and reliable.
In the days of traditional usability, questionnaires were used mostly to assess self-reported user satisfaction. And they were "seen as a poor cousin to [measures of] efficiency" (Winograd & Flores, 1986), but Lund (2001, 2004), points out that subjective metrics, such as the kind one gets from questionnaire results, are often effective at getting at the core of the user experience and can
access "aspects of the user experience that are most closely tied to user behavior and purchase decisions."
Semantic differential scales
A semantic differential scale, or Likert scale (1932), is a range of values describing an attribute. Each value on the scale represents a different level of that attribute. The most extreme value in each direction on the scale is called an anchor. The scale is then divided, usually in equal divisions, with points between the anchors that divide up the difference between the meanings of the two anchors.
The number of discrete points we have on the scale between and including the anchors is the granularity of the scale, or the number of choices we allow users in expressing their own levels of the attribute. The most typical labeling of a point on a scale is a verbal label with an associated numeric value but it can also be pictorial.
For example, consider the following statement for which we wish to get an assessment of agreement by the user: "The checkout process on this Website was easy to use." A corresponding semantic differential scale for the "agreement" attribute to assess the user's level of agreement might have these anchors: strongly agree and strongly disagree. If the scale has five values, including the anchors, there are three points on the scale between the anchors. For example, the agreement scale might include strongly agree, agree, neutral, disagree, and strongly disagree
with the associated values, respectively, of �2, �1, 0, - 1, and -2.
The Questionnaire for User Interface Satisfaction (QUIS)
The QUIS, developed at the University of Maryland (Chin, Diehl, & Norman, 1988) is one of the earliest available user satisfaction questionnaires for use with usability evaluation. It was the most extensive and most thoroughly validated questionnaire at the time of its development for determining subjective interaction design usability.
The QUIS is organized around such general categories as screen, terminology and system information, learning, and system capabilities. Within each of these general categories are sets of questions about detailed features, with Likert scales from which a participant chooses a rating. It also elicits some demographic information, as well as general user comments about the interaction design being evaluated. Many practitioners supplement the QUIS with some of their own questions, specific to the interaction design being evaluated.
The original QUIS had 27 questions (Tullis & Stetson, 2004), but there have been many extensions and variations. Although developed originally for screen- based designs, the QUIS is resilient and can be extended easily, for example, by replacing the term "system" with "Website" and "screen" with "Web page."
Practitioners are free to use the results of a QUIS questionnaire in any reasonable way. In much of our use of this instrument, we calculated the average scores, averaged over all the participants and all the questions in a specified subset of the questionnaire. Each such subset was selected to correspond to the goal of a UX target, and the numeric value of this score averaged over the subset of questions was compared to the target performance values stated in the UX target table.
Although the QUIS is quite thorough, it can be administered in a relatively short time. For many years, a subset of the QUIS was our own choice as the questionnaire to use in both teaching and consulting.
The QUIS is still being updated and maintained and can be licensed1 for a modest fee from the University of Maryland Office of Technology Liaison. In Table 12-1 we show a sample excerpted and adapted with permission from the QUIS with fairly general applicability, at least to desktop applications.
1http://lap.umd.edu/quis/
Table 12-1
An excerpt from QUIS, with permission
For each question, please circle the number that most appropriately reflects your impressions about this topic with respect to using this computer system or product.
1. Terminology relates to task domain
[distantly] 0123 45678 9 10 [closely] NA
2. Instructions describing tasks
[confusing]
0123 45678 9 10
[clear]
NA
3. Instructions are consistent
[never]
0123 45678 9 10
[always]
NA
4. Operations relate to tasks
[distantly]
0123 45678 9 10
[closely]
NA
5. Informative feedback
[never]
0123 45678 9 10
[always]
NA
6. Display layouts simplify tasks
[never]
0123 45678 9 10
[always]
NA
7. Sequence of displays
[confusing]
0123 45678 9 10
[clear]
NA
8. Error messages are helpful
[never]
0123 45678 9 10
[always]
NA
9. Error correction
[confusing]
0123 45678 9 10
[clear]
NA
10. Learning the operation
[difficult]
0123 45678 9 10
[easy]
NA
11. Human memory limitations
[overwhelmed]
0123 45678 9 10
[are respected]
NA
12. Exploration of features
[discouraged]
0123 45678 9 10
[encouraged]
NA
13. Overall reactions
[terrible]
0123 45678 9 10
[wonderful]
NA
[frustrating]
0123 45678 9 10
[satisfying]
NA
[uninteresting]
0123 45678 9 10
[interesting]
NA
[dull]
0123 45678 9 10
[stimulating]
NA
[difficult]
0123 45678 9 10
[easy]
NA
The System Usability Scale (SUS)
The SUS was developed by John Brooke while at Digital Equipment Corporation (Brooke, 1996) in the United Kingdom. The SUS questionnaire contains 10 questions. As an interesting variation from the usual questionnaire, the SUS alternates positively worded questions with negatively worded questions to prevent quick answers without the responder really considering the questions.
The questions are presented as simple declarative statements, each with a five- point Likert scale anchored with "strongly disagree" and "strongly agree" and with values of 1 through 5. These 10 statements are (used with permission):
� I think that I would like to use this system frequently
� I found the system unnecessarily complex
� I thought the system was easy to use
� I think that I would need the support of a technical person to be able to use this system
� I found the various functions in this system were well integrated
� I thought there was too much inconsistency in this system
� I would imagine that most people would learn to use this system very quickly
� I found the system very cumbersome to use
� I felt very confident using the system
� I needed to learn a lot of things before I could get going with this system
The 10 items in the SUS were selected from a list of 50 possibilities, chosen for their perceived discriminating power.
The bottom line for the SUS is that it is robust, extensively used, widely adapted, and in the public domain. It has been a very popular questionnaire for complementing a usability testing session because it can be applied at any stage in the UX lifecycle and is intended for practical use in an industry context. The SUS is technology independent; can be used across a broad range of kinds of systems, products, and interaction styles; and is fast and easy for both analyst and participant. The single numeric score (see later) is easy to understand by everyone. Per Usability Net (2006), it is the most highly recommended of all the publically available questionnaires.
There is theoretical debate in the literature about the dimensionality of SUS scoring methods (Bangor, Kortum, & Miller, 2008; Borsci, Federici, & Lauriola, 2009; J. Lewis & Sauro, 2009). However, the practical bottom line for the SUS, regardless of these formal conclusions, is that the unidimensional approach to scoring of SUS (see later) has been working well for many
practitioners over the years and is seen as a distinct advantage. The single score that this questionnaire yields is understood easily by almost everyone.
The analysis of SUS scores begins with calculating the single numeric score for the instance of the questionnaire marked up by a participant. First, for any unanswered items, assign a middle rating value of 3 so that it does not affect the outcome on either side.
Next we calculate the adjusted score for positively worded items. Because we want the range to begin with 0 (so that the overall score can range from 0), we shift the scores for positively worded items down by subtracting 1, giving us a new range of 0 to 4.
To calculate the adjusted score for negatively worded items, we must compensate for the fact that these scales go in the opposite direction of positively worded scales. We do this by giving the negatively worded items an adjusted score of 5 minus the rating value given, also a giving us a range of 0 to 4.
Next, add up the adjusted item scores for all 10 questions, giving a range of 0 to 40. Finally, multiply by 2.5 to get a final SUS score in the range of 0 to 100.
What about interpreting the SUS score? Often a numerical score yielded by any evaluation instrument is difficult to interpret by anyone outside the evaluation team, including project managers and the rest of your project team. Given a single number out of context, it is difficult to know what it means about the user experience. The score provided by the SUS questionnaire, however, has the distinct advantage of being in the range of zero to 100.
By using an analogy with numerical grading schemes in schools based on a range of 0 to 100, Bangor, Kortum, and Miller (2008) found it practical and feasible to extend the school grading interpretation of numeric scores into letter grades, by the usual 90 to 100 being an "A", and so on (using whatever numeric to letter grade mapping you wish). Although this translation has no theoretical or empirical basis, this simple notion does seem to be an effective way to communicate the results, and using a one-number score normalized to a base of 100 allows you even to compare systems that are dissimilar.
Clearly, an evaluation grade of "A" means it was good and an evaluation of "D" or lower means the need for some improvement is indicated. At the end of the day, each project team will have to decide what the SUS scores mean to them.
The Usefulness, Satisfaction, and Ease of Use (USE) Questionnaire
With the goal of measuring the most important dimensions of usability for users across many different domains, Lund (2001, 2004) developed USE, a questionnaire for evaluating the user experience on three dimensions:
usefulness, satisfaction, and ease of use. USE is based on a seven-point Likert scale.
Through a process of factor analysis and partial correlation, the questions in Table 12-2 were chosen for inclusion in USE per Lund. As the questionnaire is still under development, this set of questions is a bit of a moving target.
The bottom line for USE is that it is widely applicable, for example, to systems, products, and Websites, and has been used successfully. It is available in the public domain and has good face validity for both users and practitioners, that is, it looks right intuitively and people agree that it should work.
Other questionnaires
Here are some other questionnaires that are beyond our scope but might be of interest to some readers.
Usefulness It helps me be more effective.
It helps me be more productive. It is useful.
It gives me more control over the activities in my life.
It makes the things I want to accomplish easier to get done. It saves me time when I use it.
It meets my needs.
It does everything I would expect it to do.
Ease of use It is easy to use.
It is simple to use.
It is user-friendly.
It requires the fewest steps possible to accomplish what I want to do with it. It is flexible.
Using it is effortless.
I can use it without written instructions.
I do not notice any inconsistencies as I use it. Both occasional and regular users would like it. I can recover from mistakes quickly and easily.
I can use it successfully every time.
Ease of learning I learned to use it quickly.
I easily remember how to use it. It is easy to learn to use it.
I quickly became skillful with it.
Satisfaction I am satisfied with it.
I would recommend it to a friend. It is fun to use.
It works the way I want it to work. It is wonderful.
I feel I need to have it. It is pleasant to use.
Table 12-2
Questions in USE questionnaire
General-purpose usability questionnaires:
� Computer System Usability Questionnaire (CSUQ), developed by Jim Lewis (1995, 2002) at IBM, is well-regarded and available in the public domain.
� Software Usability Measurement Inventory (SUMI) is "a rigorously tested and proven method of measuring software quality from the end user's point of view" (Human Factor Research Group, 1990).2 According to Usability Net,3 SUMI is "a mature questionnaire whose standardization base and manual have been regularly updated." It is applicable to a range of application types from desk-top applications to large domain-complex applications.
� After Scenario Questionnaire (ASQ), developed by IBM, is available in the public domain (Bangor, Kortum, & Miller, 2008, p. 575).
� Post-Study System Usability Questionnaire (PSSUQ), developed by IBM, is available in the public domain (Bangor, Kortum, & Miller, 2008, p. 575).
Web evaluation questionnaires:
� Website Analysis and MeasureMent Inventory (WAMMI) is "a short but very reliable questionnaire that tells you what your visitors think about your web site" (Human Factor Research Group, 1996b).
Multimedia system evaluation questionnaires:
� Measuring the Usability of Multi-Media Systems (MUMMS) is a questionnaire "designed for evaluating quality of use of multimedia software products" (Human Factor Research Group, 1996a).
Hedonic quality evaluation questionnaires:
� The Lavie and Tractinsky (2004) questionnaire
� The Kim and Moon (1998) questionnaire with differential emotions scale
Modifying questionnaires for your evaluation
As an example of adapting a data collection technique, you can make up a questionnaire of your own or you can modify an existing questionnaire for your own use by:
� choosing a subset of the questions
� changing the wording in some of the questions
� adding questions of your own to address specific areas of concern
� using different scale values
2Human Factors Research Group (http://www.ucc.ie/hfrg/) questionnaires are available commercially as a service, on a per report basis or for purchase, including scoring and report-generating software.
3http://www.usabilitynet.org/tools/r_questionnaire.htm
On any questionnaire that does not already have its scale values centered on zero, you might consider making the scale something such as - 2, -1, 0, 1, 2 to center it on the neutral value of zero. If the existing scale has an odd number of rating points, you can change it to an even number to force respondents to choose one side or the other of a middle value, but that is not essential here.
Finally, one of the downsides of any questionnaire based only on semantic differential scales is that it does not allow the participant to give indications of why any rating is given, which is important for understanding what design features work and which ones do not, and how to improve designs. Therefore, we recommend you consider supplementing key questions (or do it once at the end) with a free- form question, such as "If notable, please describe why you gave that rating."
Modifying the Questionnaire for User Interface Satisfaction. We have found an adaptation of the QUIS to work well. In this adaptation, we reduce the granularity of the scale from 12 choices (0 through 10 and NA) to 6 (-2, -1, 0, 1, 2, and NA) for each question, reducing the granularity of choices faced by the participant. We felt a midscale value of zero was an appropriately neutral value, while negative scale values corresponded to negative user opinions and positive scale values corresponded to positive user opinions. Some argue for an even number of numeric ratings to force users to make positive or negative choices. This is also an easy adaptation to the scale.
Modifying the System Usability Scale. In the course of their study of SUS, Bangor,
Kortum, and Miller (2008) provided an additional useful item for the questionnaire that you can use as an overall quality question, based on an adjective description. Getting away from the "strongly disagree" and "strongly agree" anchors, this adjective rating statement is: "Overall, I would rate the user- friendliness of this product as worst imaginable, awful, poor, ok, good, excellent, and best imaginable."
Not caring for the term "user-friendliness," we would add the recommendation to change that phrase to something else that works well for you. In studies by Bangor, Kortum, and Miller (2008), ratings assigned to this one additional item correlated well with scores of the original 10 items in the questionnaire. So, for the ultimate in inexpensive evaluation, this one questionnaire item could be used as a soft estimator of SUS scores.
In application, most users of the SUS recommend a couple of minor modifications. The first is to substitute the term "awkward" for the term "cumbersome" in item 8. Apparently, in practice, there has been uncertainty, especially among participants who were not native English speakers, about the meaning of "cumbersome" in this context. The second modification is to
substitute the term "product" for the term "system" in each item, if the questionnaire is being used to evaluate a commercial product.
Along these same lines, you should substitute "Website" for "system" when using the SUS to evaluate a Website. However, use caution when choosing the SUS as a measuring instrument for evaluating Websites. According to Kirakowski and Murphy (2009), the SUS is inappropriate for evaluating Websites because it tends to yield erroneously high ratings. They recommend using the WAMMI instead (mentioned previously). As one final caveat about using the SUS, Bangor, Kortum, and Miller (2008) warn that in an empirical study they found that SUS scores did not always correlate with their observations of success in task performance.
Warning: Modifying a questionnaire can damage its validity. At this point, the purist may be worried about validity. Ready-made questionnaires are usually created and tested carefully for statistical validity. A number of already developed and validated questionnaires are available for assessing usability, usefulness, and emotional impact.
For most things in this book, we encourage you to improvise and adapt and that includes questionnaires. However, you must do so armed with the knowledge that any modification, especially by one not expert in making questionnaires, carries the risk of undoing the questionnaire validity. The more modifications, the more the risk. The methods for, and issues concerning, questionnaire validation are beyond the scope of this book.
Because of this risk to validity, homemade questionnaires and unvalidated modifications to questionnaires are not allowed in summative evaluation but are often used in formative evaluation. This is not an invitation to be slipshod; we are just allowing ourselves to not have to go through validation for sensible modifications made responsibly. Damage resulting from unvalidated modifications is less consequential in formative evaluation. UX practitioners modify and adapt existing questionnaires to their own formative needs, usually without much risk of damaging validity.
12.5.4 Data Collection Techniques Especially for Evaluating Emotional Impact
Shih and Liu (2007), citing Dormann (2003), describe emotional impact in terms of its indicators: "Emotion is a multifaceted phenomenon which people deliver through feeling states, verbal and non-verbal languages, facial expressions, behaviors, and so on." Therefore, these are the things to "measure" or at least observe or ask about. Tullis and Albert devote an entire chapter (2008, Chapter 7) to the subject. For a "Usability Test Observation Form," a
comprehensive list of verbal and non-verbal behaviors to be noted during observation, see Tullis and Albert (2008, p. 170).
Indicators of emotional impact are usually either self-reported via verbal techniques, such as questionnaires, or physiological responses observed and measured in participants with non-verbal techniques.
Self-reported indicators of emotional impact
While extreme reactions to a bad user experience can be easy to observe and understand, we suspect that the majority of emotional impact involving aesthetics, emotional values, and simple joy of use may be perceived and felt by the user but not necessarily by the evaluator or other practitioner. To access these emotional reactions, we must tap into the user's subjective feelings; one effective way to do that is to have the user or participant do the reporting. Thus, verbal participant self-reporting techniques are a primary way that we collect emotional impact indicators.
Participants can report on emotional impact within their usage experience during usage viatheir direct commentary collected with the think-aloud technique. The think-aloud technique is especially effective in accessing the emotional impact within user experience because users can describe their own feelings and emotional reactions and can explain their causes in the usage experience.
Questionnaires, primarily those using semantic differential scales, are also an effective and frequently used technique for collecting self-reported retrospective emotional impact data by surveying user opinions about
specific predefined aspects of user experience, especially emotional impact.
Other self-reporting techniques include written diaries or logs describing emotional impact encounters within usage experience. As a perhaps more spontaneous alternative to written reports, participants can report these encounters via voice recorders or phone messages.
Being subjective, quantitative, and product independent, questionnaires as a self-reporting technique have the advantages of being easy to use for both practitioners and users, inexpensive, applicable from earliest design sketches and mockups to fully operational systems, and high in face validity, which means that intuitively they seem as though they should work (Westerman, Gardner, & Sutherland, 2006).
However, self-reporting can be subject to bias because human users cannot always access the relevant parts of their own emotions. Obviously, self-reporting techniques depend on the participant's ability for conscious
awareness of subjective emotional states and to articulate the same in a report.
Questionnaires as a verbal self-reporting technique for collecting emotional impact data (AttrakDiff and others)
Questionnaires about emotional impact allow you to pose to participants probing questions based on any of the emotional impact factors, such as joy of use, fun, and aesthetics, offering a way for users to express their feelings about this part of the user experience.
AttrakDiff, developed by Hazzenzahl, Burmester, and Koller (2003), is an example of a questionnaire especially developed for getting at user perceptions of emotional impact. AttrakDiff (now AttrakDiff2), based on Likert (semantic differential) scales, is aimed at evaluating both pragmatic (usability plus usefulness) and hedonic4 (emotional impact) quality in a product or system.
Reasons for using the AttrakDiff questionnaire for UX data collection include the following:
� AttrakDiff is freely available.
� AttrakDiff is short and easy to administer, and the verbal scale is easy to understand (Hassenzahl, Beu, & Burmester, 2001; Hassenzahl, et al., 2000).
� AttrakDiff is backed with research and statistical validation. Although only the German- language version of AttrakDiff was validated, there is no reason to believe that the English version will not also be effective.
� AttrakDiff has a track record of successful application.
With permission, we show it in full in Table 12-3 as it appears in Hassenzahl, Scho� bel, and Trautman (2008, Table 1).
Across the many versions of AttrakDiff that have been used and studied, there are broad variations in the number of questionnaire items, the questions used, and the language for expressing the questions (Hassenzahl et al., 2000).
Table 12-4 contains a variation of AttrakDiff developed by Schrepp, Held, and Laugwitz (2006), shown here with permission.
For a description of using AttrakDiff in an affective evaluation of a music television channel, see Chorianoipoulos and Spinellis (2004).
Once an AttrakDiff questionnaire has been administered to participants, it is time to calculate the average scores. Begin by adding up all the values given by the participant, excluding all unanswered questions. If you used a numeric scale of 1 to 7 between the anchors for each question the total will be in the range of 1 to 7 times the number of questions the participant answered.
4"Hedonic" is a term used mainly in the European literature that means about the same as emotional impact.
Scale Item
Semantic Anchors
Pragmatic Quality 1
Comprehensible
Incomprehensible
Pragmatic Quality 2
Supporting
Obstructing
Pragmatic Quality 3
Simple
Complex
Pragmatic Quality 4
Predictable
Unpredictable
Pragmatic Quality 5
Clear
Confusing
Pragmatic Quality 6
Trustworthy
Shady
Pragmatic Quality 7
Controllable
Uncontrollable
Hedonic Quality 1
Interesting
Boring
Hedonic Quality 2
Costly
Cheap
Hedonic Quality 3
Exciting
Dull
Hedonic Quality 4
Exclusive
Standard
Hedonic Quality 5
Impressive
Nondescript
Hedonic Quality 6
Original
Ordinary
Hedonic Quality 7
Innovative
Conservative
Appeal 1
Pleasant
Unpleasant
Appeal 2
Good
Bad
Appeal 3
Aesthetic
Unaesthetic
Appeal 4
Inviting
Rejecting
Appeal 5
Attractive
Unattractive
Appeal 6
Sympathetic
Unsympathetic
Appeal 7
Motivating
Discouraging
Appeal 8
Desirable
Undesirable
For example, because there are 22 questions in the sample in Table 12-3, the total summed-up score will be in the range of 22 to 154 if all questions were answered. If you used a scale from -3 to �3 centered on zero, the range for the sum of 22 question scores would be -66 to �66. The final result for the questionnaire is the average score per question.
Modifying AttrakDiff. In applying the AttrakDiff questionnaire in your own project, you can first make a choice among the different existing versions of AttrakDiff. You can then choose how many of those questions or items, and which ones, you wish to have in your version.
Table 12-3 AttrakDiff emotional
impact questionnaire as
listed by Hassenzahl, Scho�bel, and Trautman (2008), with permission
Table 12-4
A variation of the AttrakDiff emotional impact questionnaire, as listed in Appendix A1 of Schrepp, Held, and Laugwitz (2006), reordered to group related items together, with permission
Scale
Item
English Anchor 1
English Anchor 2
Pragmatic quality
PQ1
People centric
Technical
Pragmatic quality
PQ2
Simple
Complex
Pragmatic quality
PQ3
Practical
Impractical
Pragmatic quality
PQ4
Cumbersome
Facile
Pragmatic quality
PQ5
Predictable
Unpredictable
Pragmatic quality
PQ6
Confusing
Clear
Pragmatic quality
PQ7
Unmanageable
Manageable
Hedonic - identity
HQI1
Isolates
Connects
Hedonic - identity
HQI2
Professional
Unprofessional
Hedonic - identity
HQI3
Stylish
Lacking style
Hedonic - identity
HQI4
Poor quality
High quality
Hedonic - identity
HQI5
Excludes
Draws you in
Hedonic - identity
HQI6
Brings me closer to people
Separates me from people
Hedonic - identity
HQI7
Not presentable
Presentable
Hedonic - stimulation
HQS1
Original
Conventional
Hedonic - stimulation
HQS2
Unimaginative
Creative
Hedonic - stimulation
HQS3
Bold
Cautious
Hedonic - stimulation
HQS4
Innovative
Conservative
Hedonic - stimulation
HQS5
Dull
Absorbing
Hedonic - stimulation
HQS6
Harmless
Challenging
Hedonic - stimulation
HQS7
Novel
Conventional
Attractiveness
ATT1
Pleasant
Unpleasant
Attractiveness
ATT2
Ugly
Pretty
Attractiveness
ATT3
Appealing
Unappealing
Attractiveness
ATT4
Rejecting
Inviting
Attractiveness
ATT5
Good
Bad
Attractiveness
ATT6
Repulsive
Pleasing
Attractiveness
ATT7
Motivating
Discouraging
You then need to review the word choices and terminology used for each of the anchors and decide on the words that you think will be understood most easily and universally. For example, you might find "Pretty - Ugly" of the Schrepp et al. (2006) version a better set of anchors than "Aesthetic -
Unaesthetic" of the Hassenzahl version or you may wish to add "Interesting - Boring" to "Exciting - Dull" as suggested in Hassenzahl, Beu, and Burmester (2001).
Note also that the questions in AttrakDiff (or any questionnaire) represent strictly operational definitions of pragmatic and hedonic quality, and because you may have missed some aspects of these measures that are important to you, you can add your own questions to address issues you think are missing.
Alternatives to AttrakDiff. As an alternative to the AttrakDiff questionnaire, Hassenzahl, Beu, and Burmester (2001) have created simple questionnaire of their own for evaluating emotional impact, also based on semantic differential scales. Their scales have the following easy-to-apply anchors (from their Figure 1):
� outstanding vs. second rate
� exclusive vs. standard
� impressive vs. nondescript
� unique vs. ordinary
� innovative vs. conservative
� exciting vs. dull
� interesting vs. boring
Like AttrakDiff, each scale in this questionnaire has seven possible ratings, including these end points, and the words were originally in German.
Verbal emotion measurement instruments, such as questionnaires, can assess mixed emotions because questions and scales in a questionnaire or images in pictorial tools can be made up to represent sets of emotions (Desmet, 2003). PrEmo, developed by Desmet, uses seven animated pictorial representations of pleasant emotions and seven unpleasant ones. Desmet concludes that "PrEmo is a satisfactory, reliable emotion measurement instrument in terms of applying it across cultures."
There is a limitation, however. Verbal instruments tend to be language dependent and, sometimes, culture dependent. For example, the vocabulary for different dimensions of a questionnaire and their end points are difficult to translate precisely. Pictorial tools can be the exception, as the language of pictures is more universal. Pictograms of facial expressions can sometimes express emotions elicited more effectively than verbal expression, but the question of how to draw the various pictograms most effectively is still an unresolved research challenge.
An example of another emotional impact measuring instrument is the Self- Assessment Manikin (SAM) (Bradley & Lang, 1994). SAM contains nine symbols
indicating positive emotions and nine indicating negative emotions. Often used for Websites and print advertisements, the SAM is administered during or immediately after user interaction. One problem with application after usage is that emotions can be fleeting and perishable.
Observing physiological responses as indicators of emotional impact
In contrast to self-reporting techniques, UX practitioners can obtain emotional impact indicator data through monitoring of participant physiological responses to emotional impact encounters as usage occurs. Usage can be teeming with user behaviors, including facial expressions, such as ephemeral grimaces or smiles, and body language, such as tapping of fingers, fidgeting, or scratching one's head, that indicate emotional impact.
Physiological responses can be "captured" either by direct behavioral observation or by instrumented measurements. Behavioral observations include those of facial expressions, gestural behavior, and body posture.
The emotional "tells" of facial and bodily expressions can be fleeting and subliminal, easily missed in real-time observation. Therefore, to capture facial expressions data and other similar observational data reliably, practitioners usually need to make video recordings of participant usage behavior and do frame-by-frame analysis. Methods for interpreting facial expressions have been developed, including one called the Facial Action Coding System (Ekman & Friesen, 1975).
Kim et al. (2008) remind us that while we can measure physiological effects, it is difficult to connect the measurements with specific emotions and with causes within the interaction. Their solution is to supplement with traditional synchronized video-recording techniques to correlate measurements with usage occurrences and behavioral events. But this kind of video review has disadvantages: the reviewing process is usually tedious and time-consuming, you may need an analyst trained in identifying and interpreting these expressions often within a frame-by-frame analysis, and even a trained analyst cannot always make the right call.
Fortunately, software-assisted recognition of facial expressions and gestures in video images is beginning to be feasible for practical applications. Software tools are now becoming available to automate real-time recognition and interpretation of facial expressions. A system called "faceAPI"5 from Seeing Machines is advertised to both track and understand faces. It comes as a software
5http://www.seeingmachines.com/product/faceapi/
module that you embed in your own product or application. An ordinary Webcam, focused on the user's face, feeds both faceAPI and any digital video- recording program with software-accessible time stamps and/or frame numbers.
Facial expressions do seem to be mostly culture independent, and you can capture expressions without interruption of the usage. However, there are limitations that generally preclude their use. The main limitation is that they are useful for only a limited set of basic emotions such as anger or happiness, but not mixed emotions. Dormann (2003) says it is, therefore, difficult to be precise about what kind of emotion is being observed.
In order to identify facial expressions, faceAPI must track the user's face during head movement that occurs in 3D with usage. The head-tracking feature outputs X, Y, Z position and head orientation coordinates for every video frame. The facial feature detection component of faceAPI tracks three points on each eyebrow and eight points around the lips.
The detection algorithm is "robust to occlusions, fast movements, large head rotations, lighting, facial deformation, skin color, beards, and glasses." This part of faceAPI outputs a real-time stream of facial feature data, time coordinated with the recorded video, that can be understood and interpreted via a suite of image-processing modules. The faceAPI system is a commercial product, but a free version is available to qualified users for non-commercial use.
Bio-metrics to detect physiological responses to emotional impact
The use of instrumented measurement of physiological responses in participants is called biometrics. Biometrics are about detection and measurement of autonomic or involuntary bodily changes triggered by nervous system responses to emotional impact within interaction events. Examples include changes in heart rate, respiration, perspiration, and eye pupil dilation. Changes in perspiration are measured by galvanic skin response measurements to detect changes in electrical conductivity.
Such nervous system changes can be correlated with emotional responses to interaction events. Pupillary dilation is an autonomous indication especially of interest, engagement, and excitement and is known to correlate with a number of emotional states (Tullis & Albert, 2008).
The downside of biometrics is the need for specialized monitoring equipment. If you can get some good measuring instruments and are trained to use them to get good measures, it does not get more "embodied" than this. But most equipment for measuring physiological changes is out of reach for the average UX practitioner.
It is possible to adapt a polygraph or lie detector, for example, to detect changes in pulse, respiration, and skin conductivity that could be correlated with emotional responses to interaction events. However, the operation of most of this equipment requires skills and experience in medical technology, and interpretation of raw data can require specialized training in psychology, all beyond our scope. Finally, the extent of culture independence of facial expressions and other physiological responses is not entirely known.
12.5.5 Data Collection Techniques to Evaluate Phenomenological Aspects of Interaction
Long-term studies required for phenomenological evaluation Phenomenological aspects of interaction involve emotional impact, but emotional impact over time not emotional impact in snapshots of usage as you might be used to observing in other kinds of UX evaluation. The new perspective that the phenomenological view brings to user experience requires a new kind of evaluation (Thomas & Macredie, 2002).
Phenomenological usage is a longitudinal effect in which users invite the product into their lives, giving it a presence in daily activities. As an example of a product with presence on someone's life, we know someone who carries a digital voice recorder in his pocket everywhere he goes. He uses it to capture thoughts, notes, and reminders for just about everything. He keeps it at his bedside while sleeping and always has it in his car when driving. It is an essential in his lifestyle.
Thus, phenomenological usage is not about tasks but about human activities. Systems and products with phenomenological impact are understood through usage over time as users assimilate them into their lifestyles (Thomas & Macredie, 2002). Users build perceptions and judgment through exploration and learning as usage expands and emerges.
The timeline defining the user experience for this kind of usage starts even before first meeting the product, perhaps with the desire to own or use the product, researching the product and comparing similar products, visiting a store (physical or online), shopping for it, and beholding the packaging and product presentation. By the time long-term physiological studies are done, they really end up being case studies. The length of these studies does not necessarily mean large amounts of person-hours, but it can mean significant calendar time. Therefore, the technique will not fit with an agile method or any other approach based on a very short turnaround time.
It is clear that methods for studying and evaluating phenomenological aspects of interaction must be situated in the real activities of users to encounter a broad range of user experience occurring "in the wild." This means that you
cannot just schedule a session, bring in user participants, have them "perform," and take your data. Rather, this deeper importance of context usually means collecting data in the field rather than in the lab.
Thebestrawphenomenologicaldatawouldcomefromconstant attentiontothe user and usage, but it is seldom, if ever, possible to live with a participant 24/7 and be in all the places that a busy life takes a participant. Even if you could be with the participant all the time, you would find that most of the time you will observe just dead time when nothing interesting or useful is happening or when the participants are not even using the product. When events of interest do happen, they tend to be episodic in bursts, requiring special techniques to capture phenomenological data.
But, in fact, the only ones who can be there all the times and places where usage occurs are the participants. Therefore, most of the phenomenological data collection techniques are self-reporting techniques or at least have self- reporting components-the participants themselves report on their own activities, thoughts, problems, and kinds of usage. Self-reporting techniques are not as objective as direct observation, but they do offer practical solutions to the problem of accessing data that occur in your absence.
These longer term user experience studies are, in some ways, similar to contextual inquiry and even approach traditional ethnography in that they require "living with the users." The Petersen, Madsen, and Kjaer (2002) study of two families' usage of TV sets over 4 to 6 months is a good example of a phenomenological study in the context of HCI and UX.
The iPod is an example of a device that illustrates how usage can expand over time. At first it might be mostly a novelty to play with and to show friends. Then the user will add some applications, let us say the iBird Explorer: An Interactive Field Guide to Birds of North America.6 Suddenly usage is extended out to the deck and perhaps eventually into the woods. Then the user wants to consolidate devices by exporting contact information (address book) from an old PDA.
Finally, of course, the user will start loading it up with all kinds of music and books on audio. This latter usage activity, which might come along after several months of product ownership, could become the most fun and the most enjoyable part of the whole usage experience.
Goals of phenomenological data collection techniques Regardless of which technique is used for phenomenological data collection, the objective is to look for occurrences within long-term usage that are indicators of:
6http://www.ibird.com/
� ways people tend to use the product
� high points of joy in use, revealing what it is in the design that yields joy of use and opportunities to make it even better
� problems and difficulties people have in usage that interfere with a high-quality user experience
� usage people want but is not supported by the product
� how the basic mode of usage changes, evolves, or emerges over time
� how usage is adapted; new and unusual kinds of usage people come up with on their own
The idea is to be able to tell stories of usage and emotional impact over time.
Diaries in situated longitudinal studies
In one kind of self-reporting technique, each participant maintains a "diary," documenting problems, experiences, and phenomenological occurrences within long-term usage. There are many ways to facilitate this kind of data capture within self-reporting, including:
� paper and pencil notes
� online reporting, such as in a blog
� cellphone voice-mail messages
� pocket digital voice recorder
We believe that the use of voice-mail diaries for self-reporting on usage has importance that goes well beyond mobile phone studies. In another study (Petersen, Madsen, & Kjaer, 2002), phone reporting proved more successful than paper diaries because it could occur in the moment and had a much lower incremental effort for the participant. The key to this success is readiness at hand.
A mobile phone is, well, very mobile and can be kept ready to use at all times. Participants do not need to carry paper forms and a pen or pencil and can make the calls any time day or night and under conditions not conducive to writing reports by hand. Cellphones keep users in control during reporting; they can control the amount of time they devote to each report.
As Palen and Salzman (2002) learned, the mobile phone voice-mail method of data collection over time is also low in cost for analysts. Unlike paper reports, recorded voice reports are available immediately after their creation and
systematic transcription is fairly easy. They found that unstructured verbal data supplemented their other data very well and helped explain some of the observations or measurements they made.
Users often expressed subjective feelings, bolstering the phenomenological aspects of the study and relating phone usage to other aspects of their daily lives. These verbal reports, made at the crucial time following an incident, often mentioned issues that users forgot to bring up in later interviews, making
voice-mail reports a rich source of issues to follow up on in subsequent in-person interviews.
If a mobile phone is not an option for self-reporting, a compact and portable handheld digital voice recorder is a viable alternative. If you can train the participants to carry it essentially at all times, a dedicated personal digital recorder is an effective and low-cost tool for self-reporting usage phenomena in a long-term study.
Evaluator triggered reporting for more representative data Regardless of the reporting medium, there is still the question of when the self- reporting is to be done during long-term phenomenological evaluation. If you allow the participant to decide when to report, it could bias reporting toward times when it is convenient or times when things are going well with the product, or the participant might forget and you will lose opportunities to collect data.
To make the reporting a bit more randomly timed and according to your choice of frequency, thereby possibly being more likely to capture representative phenomenological activity, you can be proactive in requesting reports.
Buchanan and Suri (2000) suggest that the participant be given a dedicated pager to carry at all times. You can then use the pager to signal randomly timed "events" to the participant "in the wild." As soon as possible after receiving the pager signal, the participant is to report on current or most recent product usage, including specific real-world usage context and any emotional impact being felt.
Periodic questionnaires over time
Periodically administered questionnaires are another self-reporting technique for collecting phenomenological data. Questionnaires can be used efficiently with a large number of participants and can yield both quantitative and qualitative data. This is a less costly method that can get answers to predefined questions, but it cannot be used easily to give you a window into usage in context to reveal growth and emergence of use over time. As a last resort, you can use a
series of questionnaires spaced over time and designed to elicit understanding of changes in usage over those time periods.
Direct observation and interviews in simulated real usage situations
The aforementioned techniques of self reporting, triggered reporting, and periodic questionnaires are ways of sampling long-term phenomenological usage activity. As another alternative, the analyst team can simulate real long- term usage within a series of direct observations and interviews. The idea is to meet with participant(s) periodically, each time setting up conditions to encourage episodes of phenomenological activity to occur during these observational periods. The primary techniques for data collection during these simulated real usage sessions are direct observation and interviews.
We described an example of using this technique in Chapter 15. Petersen, Madsen, and Kjaer (2002) conducted a longitudinal study of the use of a TV and video recorder by two families in their own homes. During the time of usage, analysts scheduled periodic interviews within which they posed numerous usage scenarios and had the participants do their best to enact the usage, while giving their feedback, especially about emotional impact. The idea is to set up conditions so you can capture the essence of real usage and reflect real usage in a tractable time-frame.
12.6 VARIATIONS IN FORMATIVE EVALUATION RESULTS
Before we conclude this chapter and move on to rapid and rigorous evaluation methods, we have to be sure that you do not entertain unrealistically high expectations for the reliability of formative evaluation results. The reality of
formative evaluation is that, if you repeat an evaluation of a design, prototype, or
system applying the same method but different evaluators, different
participants, different tasks, or different conditions, you will get different
results. Even if you use the same tasks, or the same evaluators, or the same participants, you will get different results. And you certainly get even more variation in results if you apply different methods. It is just not a repeatable process. When the variation is due to using different evaluators, it is called the "evaluator effect" (Hertzum & Jacobsen, 2003; Vermeeren, van Kesteren, & Bekker, 2003).
Reasons given by Hertzum and Jacobsen (2003) for the wide variation in results of "discount" and other inspection methods include:
� vague goals (varying evaluation focus)
� vague evaluation procedures (the methods do not pin down the procedures so each application is a variation and an adaptation)
� vague problem criteria (it is not clear how to decide when an issue represents a real problem)
The most important reason for this effect is due to the individual differences among people. Different people see usage and problems differently. Different people have different detection rates. They naturally see different UX problems in the same design. Also in most of the methods, issues found are not questioned for validity. This results in numerous false positives, and there is no approach for scrutinizing and weeding them out. Further, because of the vagueness of the methods, intra-evaluator variability can contribute as much as inter-evaluator variability. The same person can get different results in two successive evaluations of the same system.
As said earlier, much of this explanation of limited effectiveness applies equally well to lab-based testing, too. That is because many of the phenomena and principles are the same and the working concepts are not that different.
In our humble opinion, the biggest reason for the limitations of our current methods is that the problem-evaluating UX in large system designs-is very difficult. The challenge is enormous-picking away at a massive application such as MS Word or a huge Website with our Lilliputian UX tweezers. And this is true regardless of the UX method, including lab-based testing. Of course, for these massive and complex systems, everything else is also more difficult and more costly.
How can you ever hope to find your way through it all, let alone do a thorough job of UX evaluation? There are just so many issues and difficulties, so many places for UX problems to hide. It brings to mind the image of a person with a metal detector, searching over a large beach. There is no chance of finding all the detectable items, not even close, but often the person does find many things of value.
No one has the resources to look everywhere and test every possible feature on every possible screen or Web page in every possible task. You are just not going to find all the UX problems in all those places. One evaluator might find a problem in a place that other evaluators did not even look. Why are we surprised that each evaluator does not come up with the same comprehensive problem list? It would take a miracle.
Intentionally left as blank
Rapid Evaluation Methods 13
Objectives
After reading this chapter, you will:
1. Understand design walkthroughs, demonstrations, and reviews as early rapid evaluation methods
2. Understand and apply inspection techniques for user experience, such as heuristic evaluation
3. Understand and apply rapid lab-based UX evaluation methods, such as RITE and quasi-empirical evaluation
4. Know how to use questionnaires as a rapid UX evaluation method
5. Appreciate the trade-offs involved with "discount" formative UX evaluation methods
13.1 INTRODUCTION
13.1.1 You Are Here
We begin each process chapter with a "you are here" picture of the chapter topic in the context of the overall Wheel lifecycle template; see Figure 13-1. This chapter, about rapid UX evaluation methods, is a very important side excursion along the way to the rest of the fully rigorous evaluation chapters.
Some projects, especially large domain-complex system projects, require the rigorous lab-based UX evaluation process (Chapters 14 through 17). However, many smaller fast-track projects, including those for developing commercial products, often demand techniques that are faster and less costly than lab-based evaluation in the hope of achieving much of the effectiveness but at a lower cost. We call these techniques "rapid" because they are about being fast, which means saving cost.
Here are some of the general characteristics of rapid evaluation methods:
� Rapid evaluation techniques are aimed almost exclusively at finding qualitative data- finding UX problems that are cost-effective to fix.
� Seldom, if ever, is attention given to quantitative measurements.
Figure 13-1
You are here, the chapter on rapid evaluation, within the evaluation activity in the context of the overall Wheel lifecycle template.
� There is a heavy dependency on practical techniques, such as the "think-aloud" technique.
� Everything is less formal, with less protocol and fewer rules.
� There is much more variability in the process, with almost every evaluation "session" being different, tailored to the prevailing conditions.
� This freedom to adapt to conditions creates more room for spontaneous ingenuity, something experienced practitioners do best.
In early stages of a project you will have only your conceptual design, scenarios, storyboards, and maybe some screen sketches or wireframes-usually not enough for interacting with customers or users. Still, you can use an informal rapid evaluation method to get your design on track. You can use interaction design demonstrations, focus groups, or walkthroughs where you do the driving.
Beyond these early approaches, when you have an interactive prototype- either a low-fidelity paper prototype or a medium-fidelity or high-fidelity prototype-most rapid evaluation techniques are abridged variations of what have generally been known as inspection techniques or of the lab-based
approach. If you employ participants, engage in a give and take of questions and answers, comments, and feedback. In addition, you can be proactive with prescripted interview questions, which you can ask in a kind of structured think- aloud data-gathering technique during or after a walkthrough.
Very few practitioners or teams today use any one "pure" rapid evaluation method; they adapt and combine to suit their own processes, schedules, and resource limitations. We highlight some of the most popular techniques, the suitability of which for your project depends on your design and evaluation context. Inspection is probably the primary rapid evaluation technique, but quasi-empirical methods, abridged versions of lab-based evaluation, are also very popular and effective.
13.2 DESIGN WALKTHROUGHS AND REVIEWS
A design walkthrough is an easy and quick evaluation method that can be used with almost any stage of progress but which is especially effective for early interaction design evaluation before a prototype exists (Bias, 1991). Memmel, Gundelsweiler, and Reiterer (2007, Table 8) declare that user and expert reviews are less time-consuming and more cost-effective than participant-based testing and that their flexibility and scalability mean the effort can be adjusted to match the needs of the situation. Even early lab-based tests can include walkthroughs (Bias, 1991). Sometimes the term is used to refer to a more comprehensive team evaluation, more like a team-based UX inspection.
Who is involved? Design walkthroughs usually entail a group working collaboratively under the guidance of a leader. The group can include the design team, UX analysts, subject-matter experts, customer representatives, and potential users.
The goal of a design walkthrough is to explore a design on behalf of users to
simulate the user's view of moving through the design, but to see it with an
expert's eye. The team is trying to anticipate problems that users might have if they were the ones using the design.
What materials do you need upfront? You should prepare for a design walkthrough by gathering at least these items:
� Design representation(s), including storyboards, screen sketches, illustrated scenarios (scenario text interspersed with storyboard frames and/or screen sketches), paper prototypes, and/or higher fidelity prototypes
� Descriptions of relevant users, work roles, and user classes
� Usage or design scenarios to drive the walkthrough
Here is how it works. It is usually more realistic and engaging to explore the design through the lens of usage or design scenarios. The leader walks the group through key workflow patterns that the system is intended to support. A characteristic that distinguishes design walkthroughs from various kinds of user-based testing is that the practitioner in charge does the "driving" instead of the customer or users.
As the team follows the scenarios, looking systematically at parts of the design and discussing the merits and potential problems, the leader tells stories about users and usage, user intentions and actions, and expected outcomes. The leader explains what the user will be doing, what the user might be thinking, and how the task fits in the work practice, workflow, and context. As potential UX problems arise, someone records them on a list for further consideration.
Walkthroughs may also include considerations of compliance with design guidelines and style guides as well as questions about emotional impact, including aesthetics and fun. Beyond just the details of UX and other design problems that might emerge, it is a good way to communicate about the design and keep on the same page within the project.
13.3 UX INSPECTION
When we use the term "UX inspection," we are aware that you cannot inspect UX but must inspect a design for user experience issues. However, because it is awkward to spell it out that way every time, we use "UX inspection" as a short- hand for the longer phrase. As an analogy, if you hire someone to do a safety inspection of your house, you want them to "inspect the house for safety issues" just as we want a UX inspection to be an inspection of the design for user experience issues. This is consistent with our explanation of how we use the term "UX" in a broader denotation than that of the term "user experience."
13.3.1 What Is UX Inspection?
A UX inspection is an "analytical" evaluation method in that it involves evaluating by looking at and trying out the design yourself as a UX expert instead of having participants exercise it while you observe. Here we generalize the original concept of usability inspection to include inspection of both usability characteristics and emotional impact factors and we call it UX inspection.
The evaluator is both participant surrogate and observer. Inspectors ask themselves questions about what would cause users problems. So, the essence of these methods is the inspector giving an expert opinion predicting UX problems.
Because the process depends on the evaluator's judgment, it requires an expert, a UX practitioner or consultant, which is why this kind of evaluation method is also sometimes called "expert evaluation" or "expert inspection." These evaluation methods are also sometimes called "heuristic evaluation (HE)" but that term technically applies only to one particular version, "the heuristic evaluation method" (Nielsen, 1994b), in which "heuristics" or generalized design guidelines are used to drive an inspection (see later).
13.3.2 Inspection Is a Valuable Tool in the UX Toolbox
Not all human-computer interaction (HCI) literature is supportive of inspection as an evaluation tool, but practitioners in the field have been using it for years with great success. In our own practice, we definitely find value in inspection methods and highly recommend their use in cases (for example):
� Where they are applied in early stages and early design iterations. It is an excellent way to begin UX evaluation and pick the low-hanging fruit and clear out the mass of obvious problems.
� Where you should save the more expensive and more powerful tools, such as lab-based testing, for later to dig out the more subtle and difficult problems. Starting with lab- based testing on an immature and quickly evolving design can be like using a precision shovel on a large snow drift.
� Where you have not yet done any other kind of evaluation. It is especially appropriate when you are brought in to evaluate an existing system that has not undergone previous UX evaluation and iterative redesign.
� Where you cannot afford or cannot do lab-based testing for some reason but still want to do some evaluation. UX inspection can still do a good job for you when you do not have the time or other resources to do a more thorough job.
13.3.3 How Many Inspectors Are Needed?
In lab-based UX testing, you can improve evaluation effectiveness by adding more participants until you get diminishing returns. Similarly, in UX inspection, to improve effectiveness you can add more inspectors. But does it help? Yes, for inspections, a team approach is beneficial, maybe even necessary, because low individual detection rates preclude finding enough problems by one person.
Experience has shown that different experts find different problems. But this diversity of opinion is valuable because the union of problems found over a group of inspectors is much larger than the set of problems found by any individual. Most heuristic inspections are done by a team of two or more usability inspectors, typically two or three inspectors.
But what is the optimal number? It depends on conditions and a great deal on the system you are inspecting. Nielsen and Landauer (1993) found that, under some conditions, a small set of experts, in the range of 3 to 5, is optimal before diminishing returns. See the end of Chapter 14 for further discussion about the "3 to 5 users" rule and its limitations. As with almost any kind of evaluation, some is better than none and, for early project stages, we often are satisfied with a single inspection by one or two inspectors working together.
13.3.4 What Kind of Inspectors Are Needed?
Not surprisingly, Nielsen (1992a) found that UX experts (practitioners or consultants) make the best inspection evaluators. Sometimes it is best to get a fresh view by using an expert evaluator who is not on the project team. If those UX experts also have knowledge in the subject-matter domain of the interface being evaluated, all the better. Those people are called dual experts and can evaluate through both a design guidelines perspective and a work activity, workflow, and task perspective. The equivalent of having a dual expert can be approximated by a team approach-pairing up a UX expert with a work domain expert.
13.4 HEURISTIC EVALUATION, A UX INSPECTION METHOD
13.4.1 Introduction to Heuristic Evaluation
For most development projects in the 1990s, the "default usability person," the unqualified software developer pressed into usability service, was the rule. Few trained UX specialists actually worked in design projects. Now the default practitioner is slowly moving toward becoming the exception. As more people specifically prepared for the UX practitioner role became available, the definition of "novice evaluator" has shifted from the default practitioner who perhaps had an SE day job to a trained practitioner, just with less experience than an expert.
But in reality there still is, and will be for some time, a shortage of good UX practitioners, and the heuristic method is intended to help these novices perform acceptably good usability inspections. It has been described as a method that novices can grab onto and use without a great deal of training. The effectiveness of a rule-based method used by a novice, of course, cannot be expected to be on a par with a more sophisticated approach or by any approach used by an expert practitioner.
As Nielsen (1992a; Nielsen & Molich, 1990) states, the heuristic evaluation (HE) method has the advantages of being inexpensive, intuitive, and easy to motivate practitioners to do, and it is effective for use early in the UX process. Therefore, it is no surprise that of all the inspection methods, the HE method is the best known and the most popular. Another important point about the heuristics is that they teach the designers about criteria to keep in mind while doing their own designs so they will not violate these guidelines.
A word of caution, however: Although the HE method is popular and successful, there will always be some UX problems that show up in real live user-based interaction that you will not see in a heuristic, or any other, inspection or design review.
13.4.2 How-to-Do-It: Heuristic Evaluation
Heuristics
Following publication of the original heuristics, Nielsen (1994a) enhanced the heuristics with a study based on factor analysis of a large number of real usability problems. The resulting new heuristics (Nielsen, 1994b) are given in Table 13-1.
The system should always keep users informed about what is going on through appropriate feedback within reasonable time.
Table 13-1
Nielsen's refined heuristics, quoted with permission from www.useit.com
The system should speak the users' language, with words, phrases, and concepts familiar to the user rather than system-oriented terms. Follow real-world conventions, making information appear in a natural and logical order.
Users often choose system functions by mistake and will need a clearly marked "emergency exit" to leave the unwanted state without having to go through an extended dialogue. Support undo and redo.
Users should not have to wonder whether different words, situations, or actions mean the same thing. Follow platform conventions.
Continued
Even better than good error messages is a careful design that prevents a problem from occurring in the first place. Either eliminate error-prone conditions or check for them and present users with a confirmation option before they commit to the action.
Minimize the user's memory load by making objects, actions, and options visible. The user should not have to remember information from one part of the dialogue to another.
Instructions for use of the system should be visible or easily retrievable whenever appropriate.
Accelerators-unseen by the novice user-may often speed up the interaction for the expert user such that the system can cater to both inexperienced and experienced users. Allow users to tailor frequent actions.
Dialogues should not contain information that is irrelevant or rarely needed. Every extra unit of information in a dialogue competes with the relevant units of information and diminishes their relative visibility.
Error messages should be expressed in plain language (no codes), indicate the problem precisely, and suggest a solution constructively.
Even though it is better if the system can be used without documentation, it may be necessary to provide help and documentation. Any such information should be easy to search, focused on the user's task, list concrete steps to be carried out, and not be too large.
The procedure
Despite the large number of variations in practice, we endeavor to describe what roughly represents the "plain" or "standard" version. These inspection sessions can take from a couple of hours for small systems to several days for larger systems. Here is how to do it:
� The project team or manager selects a set of evaluators, typically three to five.
� The team selects a small, tractable set, about 10, of "heuristics," generalized and simplified design guidelines in the form of inspection questions, for example, "Does the
interaction design use the natural language that is familiar to the target user?" The set of heuristics given in the previous section are a good start.
� Each inspector individually browses through each part of the interaction design, asking the heuristic questions about that part:
� assesses the compliance of each part of the design
� notes places where a heuristic is violated as candidate usability problems
� notes places where heuristics are supported (things done well)
� identifies the context of each instance noted previously, usually by capturing an image of the screen or part of the screen where the problem or good design feature occurs
� All the inspectors get together and, as a team, they:
� merge their problem lists
� select the most important ones to fix
� brainstorm suggested solutions
� decide on recommendations for the designers based on the most frequently visited screens, screens with the most usability problems, guidelines violated most often, and resources available to make changes
� issue a group report
A heuristic evaluation report should:
� start with an overview of the system being evaluated
� give an overview explanation of inspection process
� list the inspection questions based on heuristics used
� report on potential usability problems revealed by the inspection, either:
� by heuristic-for each heuristic, give examples of design violations and of ways the design supports the heuristic
� by part of the design-for each part, give specific examples of heuristics violated and/ or supported
� include as many illustrative screen images or other visual examples as possible.
The team then puts forward the recommendations they agreed on for design modifications, using language that will motivate others to want to make these changes. They highlight a realistic list of the "Top 3" (or 4 or 5) suggestions for modifications and prioritize suggestions, to give the biggest improvement inusability for the least cost (perhaps using the cost-importance analysis of Chapter 16).
Reporting We have found it best to keep HE reporting simple. Long forms with lots of fields can capture more information, but tend to be tedious for practitioners who have to report large numbers of problems. Table 13-2 is a simple HE reporting
Table 13-2
Simple HE reporting form, adapted from Brad Myers
Heuristic Evaluation Report
Dated: MM/DD/YYYY
Prepared By:
NAME:
SIGNATURE:
Evaluation Of:
Name of system being evaluated: XYZ Website
Other information about the system being evaluated:
Problem #: 1
Prototype screen, page, location of problem:
Name of heuristic: Consistency
Reason for reporting as negative or positive: Inconsistent placement of "Add to Cart" buttons: The "Add to Cart" button is below the item in CDW but above in CDW-G.
Scope of problem: Every product page
Severity of problem (high/medium/low): Low-minor, cosmetic problem
Justification for severity rating: Unlikely that users will have trouble with finding or recognizing the button
Suggestions to fix: Move the button on one of the sites to be in the same place as on the other site.
Possible trade-offs (why fix might not work): This may result in an inconsistency with something else, but unknown what that might be.
form that we have adapted, with permission, from one developed by Brad Myers. You can make up a Word document or spreadsheet form to put these headings in columns as an efficient way to report multiple problems, but they do not fit that way in the format of our book.
Description of columns in Table 13-2 is as follows:
Prototype screen, page, location of problem: On which screen and/or which location on a screen of the user interface was critical incident or problem located?
Name of heuristic: Which of the 10 heuristics is being referenced? Enter the full name of the heuristic.
Reason for reporting as negative or positive: Explain reasons why the interface violates or upholds this heuristic. Be sure to be clear about where in the screen you are referencing.
Scope of problem: Describe the scope of the feedback or the problem; include whether the scope of the issue is throughout the product or within a specific screen or screens. If the problems are specific to a page, include the appropriate page numbers.
Severity of problem (high/medium/low): Your assessment as to whether the implication of the feedback is high, medium, or low severity.
Justification of severity rating: The reason why you gave it that rating.
Suggestions to fix: Suggestion for the modifications that might be made to the interaction design to address the issue or issues.
Possible trade-offs (why fix might not work): Mentioning trade-offs adds to your credibility.
Be specific and insightful; include subtlety and depth. Saying "the system does not have good color choices because it does not use color" is pretty trivial and is not helpful. Also, if you evaluated a prototype, saying that functions are not implemented is obvious and unhelpful.
Variations abound
The one "constant" about the HE method and most other related rapid and inspection methods is the variation with which they are used in practice. These methods are adapted and customized by almost every team that ever uses them usually in undocumented and unpublished ways.
Task-based or heuristic-based expert UX inspections can be conducted with just one evaluator or with two or more evaluators, each acting independently or all working together. Other expert UX inspections can be scenario based, persona based, checklist based, or as a kind of "can you break it?" test.
As an example of a variation that was described in the literature, participatory heuristic evaluation extends the HE method with additional heuristics to address broader issues of task and workflow, beyond just the design of user
interface artifacts to "consider how the system can contribute to human goals and human experience" (Muller et al., 1998, p. 16). The definitive difference in participatory HE is the addition of users, work domain experts, to the inspection team.
Sears (1997) extended the HE method with what he calls heuristic walkthroughs. Several lists are prepared and given to each practitioner doing the inspection: user tasks, inspection heuristics, and "thought-focusing questions." Each inspector performs two inspections, one using the tasks as a guide and supported by the thought-focusing questions. The second inspection is the more traditional kind, using the heuristics. Their studies showed that "heuristic walkthroughs resulted in finding more problems than cognitive walkthroughs and fewer false positives than heuristic evaluations."
Perspective-based usability inspection (Zhang, Basili, & Shneiderman, 1999) is another published variation on the HE method. Because a large system
can present a scope too broad for any given inspection session, Zhang et al. (1999) proposed "perspective-based usability inspection," allowing inspectors to focus on a subset of usability issues in each inspection. The resulting focus of attention afforded a higher problem detection rate within that narrower perspective.
Examples of perspectives that can be used to guide usability inspections are novice use, expert use, and error handling. In their study, Zhang et al. (1999) found that their perspective-based approach did lead to significant improvement in detection of usability problems in a Web-based application. Persona-based UX inspection is a variation on the perspective-based inspection in that it includes consideration of context of use via the needs of personas (Wilson, 2011).
As our final example, Cockton et al. (2003) developed an extended problem- reporting format that improves heuristic inspection methods by finding and eliminating many of the false positives typical of the usability inspection approach. Traditional heuristic methods poorly support problem discovery and analysis. Their Discovery and Analysis Resource (DARe) model allows analysts to bring distinct discovery and analysis resources to bear to isolate and analyze false negatives as well as false positives.
Limitations
While a helpful guide for inexperienced practitioners, we find that heuristics usually get in the way of the experts. To be fair to the heuristic method, the heuristic method was intended as a kind of "scaffolding" to help novice practitioners do usability inspections so it should not really be compared with expert usability inspection methods anyway.
It was perhaps self-confirming when we read that others found the actual heuristics to be similarly unhelpful (Cockton, Lavery, & Woolrych, 2003; Cockton & Woolrych, 2001). In their studies, Cockton et al. (2003) found that it is experts who find problems with inspection, not experts using heuristics.
Cockton and Woolrych (2002, p. 15) also claim that the "inspection methods do not encourage analysts to take a rich or comprehensive view of interaction." While this may be true for heuristic methods, it does not have to be true for all inspection methods.
A major drawback with any inspection method, including the HE method, is the danger that novice practitioners will get too comfortable with it and
think the heuristics are enough for any evaluation situation. There are few indications in its usage that let the novice practitioner know when it is not working well and when a different method should be tried.
Also, like all UX inspection methods, the HE method can generate a lot of false negatives, situations in which inspectors identified "problems" that turned out to be not real problems or not very important UX problems. Finally, like most other rapid UX evaluation methods, the HE method is not particularly effective in finding usability problems below the surface-problems about sequencing and workflow.
13.5 OUR PRACTICAL APPROACH TO UX INSPECTION
We have synthesized existing UX inspection methods into a relatively simple and straightforward method that, unlike the heuristic method, is definitely for
UX experts and not for novices. Sometimes we have novices sit in and observe the process as a kind of apprentice training, but they do not perform these inspections on their own.
13.5.1 The Knock on Your Door
It is the boss. You, the practitioner, are being called in and asked to do a quick UX assessment of a prototype, an early product, or an existing product being considered for revision. You have 1 or 2 days to check it out and give feedback. You feel that if you can give some valuable feedback on UX flaws, you will gain some credibility and maybe get a bigger role in the project next time.
What method should you use? No time to go to the lab, and even the "standard" inspection techniques will take too long, with too much overhead. What you need is a practical, fast, and efficient approach to UX inspection. As a solution, we offer an approach that evolved over time in our own practice. You
can apply this approach at almost any stage of progress, but it usually works better in the early stages. We believe that most real-world UX inspections are more like our approach than like the somewhat more elaborate techniques to inspection described in the literature.
13.5.2 Driven by Experience, Not Heuristics or Guidelines
We should say upfront that we do not explicitly use design guidelines or even "heuristics" to drive or guide this kind of UX inspection. In our own industry and consulting experience, we have just not found specific heuristics as useful as we would like.
To be clear, we are saying that we do not employ user design guidelines to drive the inspection process. The driving perspective is usage. We focus on tasks, work activities, and work context. We do insist, however, that an expert working and practical knowledge of design guidelines is essential to support the rapid analysis used to decide what issues are real problems and to understand the underlying nature of the problems and potential solutions. For this analysis, intertwined with inspection activities, we depend on our knowledge of design guidelines and their interpretation within the design.
We like a usage-based approach because it allows the practitioner to take on the role of user better, taking the process from purely analytic to include at least a little "empirical" flavor. Using this approach, and our UX intuition honed over the years, we can see, and even anticipate, UX problems, many of which might not have been revealed under the heuristic spotlight.
13.5.3 Use a Co-Discovery or Team Approach in UX Inspection Expert UX practitioners as inspectors are in the role of "UX detectives." To aid the detective work, it can help to use two practitioners, working together as mutual sounding boards in a give-and-take interplay, potentiating each other's efforts to keep the juices flowing, to promote a constant flow of think-aloud comments from the inspectors, and to maintain a barrage of problem notes flying.
It is also often useful to have a non-UX person with you to look at the design from a global point of view. Teaming up with customers, users, designers, and other people familiar with the overall system can help make up for any lack of system knowledge you may have, especially if you have not been with the team during the entire project. Teaming up with users or work domain experts (which you might already have on your team) can reinforce your user-surrogate role and bring in more work-domain expertise (Muller et al., 1998).
13.5.4 Explore Systematically with a Rich and Comprehensive Usage-Oriented View
As an inspector, you should not just look for individual little problems associated with individual tasks or functions. Use all your experience and knowledge to see the big picture. Keep an expert eye on the high-level view of workflow,
the overall integration of functionality, and emotional impact factors that go
beyond usability.
For example, how are the three design perspectives covered in Chapter 8 accounted for by the system? Does the system ecology make sense? Is the conceptual design for interaction appropriate for envisioned workflows? What about the conceptual design for emotional impact?
Representative user tasks help us put ourselves in the users' shoes. By exploring the tasks ourselves and taking our own think-aloud data, we can imagine what real users might encounter in their usage. This aspect of our inspections is driven as systematically as possible by two things: the task structure and the interaction design itself. A hierarchical task inventory (Chapter 6) is helpful in attaining a good understanding of the task structure and to ensure broad coverage of the range of tasks.
If the system is highly specialized and complex and you are not a work domain expert, you might not be able to comprehend it in a short time so get help from a subject-matter expert. Usage scenarios and design scenarios (Chapter 6) are fruitful places to look to focus on key user work roles and key user tasks that must be supported in the design.
Driving the inspection with the interaction design itself means trying all possible actions on all the user interface artifacts, trying out all user interface objects such as buttons, icons, and menus. It also means being opportunistic in following leads and hunches triggered by parts of the design.
The time and effort required for a good inspection are more or less proportional to the size of the system (i.e., the number of user tasks, choices, and system functions). System complexity can have an even bigger impact on inspection time and effort.
The main skill you need for finding UX problems as you inspect the design is your detective's "eagle eye" for curious or suspicious incidents or phenomena. The knowledge requirement centers on design guidelines and principles and your mental inventory of typical interaction design flaws you have seen before. You really have to know the design guidelines cold, and the storehouse of problem examples helps you anticipate and rapidly spot new occurrences of the same types of problems.
Soon you will find the inspection process blossoming into a fast-moving narration of critical incidents, UX problems, and guidelines. By following various threads of UX clues, you can even uncover problems that you do not encounter directly within the tasks.
13.5.5 Emotional Impact Inspection
In the past, inspections for evaluating interaction designs have been almost exclusively usability inspections. But this kind of evaluation can easily be extended to a more complete UX inspection by addressing issues of emotional impact, too. The process is essentially the same, but you need to look beyond a task view to the overall usage experience. Ask additional questions.
Among the emotional impact questions to have in mind in a UX inspection are:
� Is usage fun?
� Is the visual design attractive (e.g., colors, shapes, layout) and creative?
� Will the design delight the user visually, aurally, or tactilely?
� If the target is a product:
� Is the packaging and product presentation aesthetic?
� Is the out-of-the-box experience exciting?
� Does the product feel robust and good to hold?
� Can the product add to the user's self-esteem?
� Does the product embody environmental and sustainable practices?
� Does the product convey the branding of the organization?
� Does the brand stand for progressive, social, and civic values?
� Are there opportunities to improve emotional impact in any of the aforementioned areas?
Most of the questions in a questionnaire for assessing emotional impact are also applicable as inspection questions here. As an example, using attributes from AttrakDiff:
� Is the system or product interesting?
� Is it exciting?
� Is it innovative?
� Is it engaging?
� Is it motivating?
� Is it desirable?
13.5.6 Use All Your Personalities
Roses are red;
Violets are blue.
I'm schizophrenic, ...
And I am, too.
You need to be at least a dual personality, with a slightly schizophrenic melding of the UX expert perspective and a user orientation. As a surrogate for users, you must think like a user and act like a user. But you must simultaneously think and act like an expert, observing and analyzing yourself in the user role.
Your UX expert credentials have never been in doubt, but demands of the user surrogate role can take you outside your comfort zone. You have to shed the designer and analyst mind-sets in favor of design-informing views of the world. You must immerse yourself in the usage-oriented perspective; become the user and live usage!
If you have doubts about your ability to play the user-surrogate role, as we said in a previous section, you should recruit an experienced user (hopefully one who is familiar with UX principles) or user representative to sit with you and help direct the interaction, informing the process with their work domain, user goal, and task knowledge.
13.5.7 Take Good Notes
As you do your inspection and play user, good note taking is essential for capturing precious critical incidents and UX problem data in the moment that they occur. Just as prompt capture of critical incidents is essential in
lab-based testing to capture the perishable details while they are still fresh, you need to do the same during your inspections. You cannot rely on your memory and write it all down at the end. Once you get going, things can happen fast, just as they do in a lab-based evaluation session.
We often take our notes orally, dictating them on a handheld digital recorder. Because we can talk much faster than we can write or type, we can record our thoughts with minimal interruption of the flow or of your intense cognitive focus. Try to include as much analysis and diagnosis as you can, stating causes in the design in terms of design guidelines violated. As with most skill-based activities, you get better with practice.
You may not be able to suggest immediate solutions for more complex problems (e.g., reorganizing workflow) that require significant thought and
discussion. However, you can usually suggest a cause and a fix for most problems. Given enough detail in the problem description, the solutions are often self- suggesting. For example, if a label has low color contrast between the text and the background, the solution is to increase the color contrast.
Dumas and Redish (1999) suggest that you should be more specific in suggesting this kind of solution, including what particular colors to use. It is a good idea to capture these design solution ideas, but treat them only as points of departure. Those decisions still need to be thought out carefully by someone with the requisite training in the use of colors and with knowledge of organizational style standards concerning color, branding, and so on. If you give an example of some colors that might work, you need to ensure that the designers do not take those colors as the exact solution without thinking about it further.
13.5.8 Analyze Your Notes
Sort out your inspection notes and organize them by problem type or design feature. If necessary, you can use a fast affinity diagram approach (see Chapter 4) on the top of a large work table. Print all notes on small pieces of paper and organize them by topic. Prioritize your recommendations for fixing, maybe with cost-importance analysis (Chapter 16).
13.5.9 Report Your Results
Your inspection report (Chapter 17) will be a lot like the one we described for the heuristic method earlier in this chapter, only you will not refer to heuristics. Tell about how you did the process in enough detail for your audience to understand the evaluation context.
Sometimes UX inspection, as does any evaluation method, raises questions. In your report, you should include recommendations for further evaluation, with specific points to look for and specific questions to answer.
13.6 DO UX EVALUATION RITE
13.6.1 Introduction to the Rapid Iterative Testing and Evaluation (RITE) UX Evaluation Method
There are many variations of rapid UX evaluation techniques. Most are some variation of inspection methods, but one in particular that stands out is not based on inspection: the approach that Medlock, Wixon, and colleagues (Medlock et al., 2002, 2005; Wixon, 2003) call RITE, for "rapid iterative testing and evaluation," is an empirical rapid evaluation method and is one of the best.
RITE employs user-based UX testing in a fast collaborative (team members and participants) test-and-fix cycle designed to pick the low-hanging fruit at relatively low cost. In other methods, the rest of the team is usually not present to see the process, so problems found by UX evaluators in the mystical methods are sometimes not believed. This is solved by the collaborative evaluation process in RITE; the whole team is involved in arriving at the results.
The key feature of RITE is the fast turnaround. UX problems are analyzed right after the product is evaluated with a number of user participants and the whole project team decides on which changes to make. Changes are then implemented immediately. If warranted, another iteration of testing and fixing might ensue.
Because changes are included in all testing that occurs after that point, further testing can determine the effectiveness of the changes-whether the problem is, in fact, fixed and whether the fix introduces any new problems. Fixing a problem immediately also gives access to any aspects of the product that could not be tested earlier because they were blocked by that problem.
In his inimitable Wixonian wisdom, our friend Dennis reminds us that, "In practice, the goal is to produce, in the quickest time, a successful product that meets specifications with the fewest resources, while minimizing risk" (Wixon, 2003).
13.6.2 How-to-Do-It: The RITE UX Evaluation Method This description of the RITE UX evaluation method is based mainly on Medlock et al. (2002).
The project team starts by selecting a UX practitioner, whom we call the facilitator, to direct the testing session. The UX facilitator and the team prepare by:
� identifying the characteristics needed in participants
� deciding on which tasks they will have the participants perform
� agreeing on critical tasks, the set of tasks that every user must be able to perform
� constructing a test script based on those tasks
� deciding how to collect qualitative user behavior data
� recruiting participants (Chapter 14) and scheduling them to come into the lab
The UX facilitator and the team conduct the evaluation session for one to three participants, one at a time:
� gathering the entire project team and any other relevant project stakeholders, either in the observation room of a UX lab or around a table in a conference room
� bringing in the participant playing the role of user
� introducing everyone and setting the stage, explaining the process and expected outcomes
� making sure that everyone knows the participant is helping evaluate the system and the team is not in any way evaluating the participant
� having the participant perform a small number of selected tasks, while all project stakeholders observe silently
� having the participants think aloud as they work
� working together with the participants to find UX problems and ways the design should be improved
� taking thorough notes on problem indicators, such as task blocking and user errors
� focusing session notes on finding usability problems and noting their severity
The UX facilitator and other UX practitioners:
� identify from session notes the UX problems observed and their causes in the design
� give everyone on the team the list of UX problems and causes
The UX practitioner and the team address problems:
� identifying problems with obvious causes and obvious solutions, such as those involving wording or labeling, to be fixed first
� determining which other problems can also reasonably be fixed
� determining which problems need more discussion
� determining which problems require more data (from more participants) to be sure they are real problems
� sorting out which problems they cannot afford to fix right now
� deciding on feasible solutions for the problems to be addressed
� implementing fixes for problems with obvious causes and obvious solutions
� starting to implement other fixes and bringing them into the current prototype as soon as feasible
The UX practitioner and the team immediately conduct follow-up evaluation:
� bringing in new participants
� having them perform the tasks associated with the fixed problems, using the modified design
� working with the participants to see if the fixes worked and to be sure the fixes did not introduce any new UX problems
The entire process just described is repeated until you run out of resources or the team decides it is done (all major problems found and addressed).
13.6.3 Variations in RITE Data Collection
Although RITE is unusual as a rapid evaluation method that employs UX testing with user participants, what really distinguishes RITE is the fast turnaround and tight coupling of testing and fixing. As a result, it is possible to consider alternative data collection techniques within the RITE method. For example, instead of testing with user participants, the team could employ a UX inspection method, heuristic evaluation or otherwise, for data collection while retaining the fast analysis and fixing parts of the cycle.
13.7 QUASI-EMPIRICAL UX EVALUATION
13.7.1 Introduction to Quasi-Empirical UX Evaluation Quasi-empirical UX evaluation methods are empirical because they involve taking some kind of data using volunteer participants. Beyond that, their similarities to other empirical methods fade rapidly. Most
empirical methods are characterized by formal protocols and procedures; rapid methods are anything but formal or protocol bound. Thus, the qualifier "quasi."
Most empirical methods have at least some focus on quantitative data; quasi- empirical approaches have none. The single paramount mission is qualitative data to identify UX problems that can be fixed efficiently.
Although formal empirical evaluations often take place in a UX lab or similar setting, quasi-empirical testing can occur almost anywhere, including UX lab space, a conference room, an office, a cafeteria, or in the field. Like other rapid evaluation methods, practitioners using quasi-empirical techniques thrive on going with what works. While most empirical methods require controlled conditions for user performance, it is now not only acceptable but recommended to interrupt and intervene at opportune moments to elicit more thinking aloud and to ask for explanations and specifics.
Quasi-empirical methods are defined by the freedom given to the practitioner to innovate, to make it up as they go. Quasi-empirical evaluation sessions mean being flexible about goals and approaches. When conducted by the best practitioners, quasi-empirical evaluation is punctuated with impromptu changes of pace, changes of direction, and changes of focus-jumping on issues as they arise and milking them to get the most information about problems, their effects on users, and potential solutions.
This innovation in real time is where experience counts. Because of the ingenuity required and the need to adapt to each situation, experienced practitioners are usually more effective at quasi-empirical techniques, as they are
with all rapid evaluation techniques. Each quasi-empirical session is different and can be tailored to the project conditions. Each session participant is different-some are more knowledgeable whereas some are more helpful. You must find ways to improvise, go with the flow, and learn the most you can about the UX problems.
Unlike other empirical methods, there are no formal predefined "benchmark tasks," but a session can be task driven, drawing on usage scenarios, essential use cases, step-by-step task interaction models, or other task data or task models you collected and built up in contextual inquiry and analysis and modeling. Quasi-empirical sessions can also be driven by exploration of features, screens, widgets, or whatever suits.
13.7.2 How-to-Do-It: Quasi-Empirical UX Evaluation
Prepare
Begin by ensuring that you have a set of representative, frequently used, and mission-critical tasks for your participants to explore. Draw on your contextual data and task models (structure models and interaction models). Have some exploratory questions ready (see next section).
Assign your UX evaluation team roles effectively, including evaluator, facilitator, and data collectors. If necessary, use two evaluators for co-discovery. Further prepare for your quasi-empirical session the same way you would for a full empirical session, only less formally and less thoroughly, to match the more rapid and more opportunistic nature of the quasi-empirical approach.
Thus preparation includes lightweight selection and recruiting of participants, preparation of materials such as the informed consent form,
and establishment of protocols and procedures for the sessions. You should also do pilot testing to shake down the prototype and the procedures, but getting the prototype bug-free is a little less important for quasi-empirical evaluation, as you can be very flexible during the session.
Conduct session and collect data
As you, the facilitator, sit with each participant:
� Cultivate a partnership; you get the best results from working closely in collaboration.
� Make extensive use of the think-aloud data collection technique. Encourage the participant by prompting occasionally: "Remember to tell us what you are thinking as you go."
� Make sure that the participant understands the role as that of helping you evaluate the UX.
� Although recording audio or video is sometimes helpful in rigorous evaluation methods, to retain a rapidness in this method, it is best not to record audio or video; just take notes. Keep it simple and lightweight.
� Encourage the participant to explore the system for a few minutes and get familiarized with it. This type of free-play is important because it is representative of what happens when users first interact with a system (except in cases where walk up and use is an issue).
� Use some of the tasks that you have at hand, from the preparation step given earlier, more or less as props to support the action and the conversation. You are not interested in user performance times or other quantitative data.
� Work together with the participant to find UX problems and ways the design should be improved. Take thorough notes; they are sole raw data from the process.
� Let the user choose some tasks to do.
� Be ready to follow threads that arise rather than just following prescripted activities.
� Listen as much as you can to the participant; most of the time it is your job to listen, not talk.
� It is also your job to lead the session, which means saying the right thing at the right time to keep it on track and to switch tracks when useful.
At any time during the session, you can interact with the participant with questions such as:
� Ask participants to describe initial reactions as they interact with this system.
� Ask questions such as "How would you describe this system to someone who has never seen it before? What is the underlying "model" for this system? Is that model appropriate? Where does it deviate? Does it meet your expectations? Why and how? These questions get to the root of determining the user's mental model for the system.
� Ask what parts of the design are not clear and why.
� Inquire about how the system compares with others they have used in the past.
� Ask if they have any suggestions for changing the designs.
� To place them in the context of their own work, ask them how they would use this system in their daily work. In other words, ask them to walk you through some tasks they would perform using this system in a typical workday.
Analyze and report results
Because the UX data analysis procedure (Chapter 16) pretty much applies regardless of how you got data, use the parts of that chapter about analyzing qualitative data.
13.8 QUESTIONNAIRES
A questionnaire, discussed at length in Chapter 12, is a fast and easy way to collect subjective UX data, either as a supplement to any other rapid UX evaluation method or as a method on its own.
Questionnaires with good track records, such as the Questionnaire for User Interface Satisfaction (QUIS), the System Usability Scale (SUS), or Usefulness, Satisfaction, and Ease of Use (USE), are all easy and inexpensive to use and can yield varying degrees of UX data. Perhaps the AttrakDiff questionnaire might be the best choice for a rapid stand-alone method, as it is designed to address both pragmatic (usability and usefulness) and hedonic (emotional impact) issues.
For a general discussion of modifying questionnaires for your particular evaluation session, see Chapter 12 about modifying the AttrakDiff questionnaire.
13.9 SPECIALIZED RAPID UX EVALUATION METHODS
13.9.1 Alpha and Beta Testing and Field Surveys
Alpha and beta testing are useful post-deployment evaluation methods. After almost all development is complete, manufacturers of software applications sometimes send out alpha and beta (pre-release) versions of the application software to select users, experts, customers, and professional reviewers as a preview. In exchange for the early preview, users try it out and give feedback on the experience. Often little or no guidance is given for the review process beyond just "tell us what you think is good and bad and what needs fixing, what additional features would you like to see, etc."
An alpha version of a product is an earlier, less polished version, usually with a smaller and more trusted "audience." Beta is as close to the final product as they can make it and is sent out to a larger community. Most companies develop a beta trial mailing list of a community of early adopters and expert users, mostly known to be friendly to the company and its products and helpful in their comments.
Alpha and beta testing are easy and inexpensive ways to get feedback. But you do not get the kind of detailed UX problem data observed during usage and associated closely with user actions and their consequences in the context of specific interaction design features-the kind of data essential for isolating specific UX problems within the formative evaluation process.
Alpha and beta testing are very much individualized to a given development organization and environment. Full descriptions of how to do alpha and beta testing are beyond our scope. Like alpha and beta testing, user field survey information is retrospective and, while it can be good for getting at user satisfaction, it does not capture the details of use within the usage experience. Anything is better than nothing, but let us hope this is not the only formative evaluation used within the product lifecycle in a given organization.
13.9.2 Remote UX Evaluation
Remote UX evaluation methods (Dray & Siegel, 2004; Hartson & Castillo, 1998) are good for evaluating systems after they have been deployed in the field.
Methods include:
� simulating lab-based UX testing using the Internet as a long extension cord to the user (e.g., UserVue by TechSmith)
� online surveys for getting after-the-fact feedback
� software instrumentation of click stream and usage event information
� software plug-ins to capture user self-reporting of UX issues
The Hartson and Castillo (1998) approach uses self-reporting of UX problems by users as the problems occur during their normal usage, allowing you to get at the perishable details of the usage experience, especially in real-life daily work usage. As always, the best feedback for design improvement is feedback deriving from Carter's (2007) "inquiry within experience," or formative data given concurrent with usage rather than retrospective recollection. A full description of how to do remote UX testing is highly dependent on the type of technology used to mediate the evaluation, and therefore not possible to describe in detail here.
13.9.3 Local UX Evaluation
Local UX evaluation is UX evaluation using a local prototype. A local prototype is very limited in both depth and breadth, restricted to a single interaction design issue involving particular isolated interaction details, such as the appearance of an icon, wording of a message, or behavior of an individual function. If your design team cannot agree on the details of a single feature, such as a particular dialogue box, you can mockup local prototypes of the alternatives and take them to users to compare in local UX evaluation.
13.9.4 Automatic UX Evaluation
Lab-based and UX inspection methods are labor-intensive and, therefore, limited in scope (small number of users exercising small portions of large systems). But large and complex systems with large numbers of users offer the potential for a vast volume of usage data. Think of "observing" a hundred thousand users using Microsoft Word. Automatic methods have been devised to take advantage of this boundless pool of data, collecting and analyzing usage data without need for UX specialists to deal with each individual action.
The result is a massive amount of data about keystrokes, click-streams, and pause/idle times. But all data are at the low level of user actions, without any information about tasks, user intentions, cognitive processes, etc. There are no direct indications of when the user is having a UX problem somewhere in the midst of that torrent of user action data. Basing redesign on click counts and low-level user navigation within a large software application could well lead
to low-level optimization of a system with a bad high-level design. A full description of how to do automatic usability evaluation is beyond our scope.
13.10 MORE ABOUT "DISCOUNT" UX ENGINEERING METHODS
13.10.1 Nielsen and Molich's Original Heuristics
The first set of heuristics that Nielsen and Molich developed for usability inspection (Molich & Nielsen, 1990; Nielsen & Molich, 1990) were 10 "general principles" for interaction design. They called them heuristics because they are not strict design guidelines. Table 13-3 lists these original heuristics from Nielsen's Usability Engineering book (Nielsen, 1993, Chapter 5).
13.10.2 "Discount" Formative UX Evaluation Methods Although the concepts have been challenged, mainly by academics, as inferior and scientifically unsound, we use the term "discount method" in a positive sense. UX evaluation is the center of the iterative process and, despite its highly varied effectiveness, somehow in practice it still works. Here we wholeheartedly affirm the value of discount UX methods among your UX engineering tools!
What is a "discount" evaluation method?
Because UX inspection techniques are less costly, they have been called "discount" evaluation techniques (Nielsen, 1989). Although the term was intended to reflect the advantage of lower costs, it soon was used pejoratively to connote inferior bargain-basement goods (Cockton & Woolrych, 2002;
� Simple and natural dialogue
� Good graphic design and use of color
� Screen layout by gestalt rules of human perception
� Less is more; avoid extraneous information
� Speak the users' language
� User-centered terminology, not system or technology centered
� Use words with standard meanings
� Vocabulary and meaning from work domain
� Use mappings and metaphors to support learning
� Minimize user memory load
� Clear labeling
� Consistency
� Help avoid errors, especially by novices
� Feedback
� Make it clear when an error has occurred
� Show user progress
� Clearly marked exits
� Provide escape from all dialogue boxes
� Shortcuts
� Help expert users without penalizing novices
� Good error messages
� Clear language, not obscure codes
� Be precise rather than vague or general
� Be constructive to help solve problem
� Be polite and not intimidating
� Prevent errors
� Many potential error situations can be avoided in design
� Select from lists, where possible, instead of requiring user to type in
� Avoid modes
� Help and documentation
� When users want to read the manual, they are usually desperate
� Be specific with online help
Table 13-3
Original Nielsen and Molich heuristics
Gray & Salzman, 1998) because of the reduced effectiveness and susceptibility to errors in identifying UX problems.
Inspection methods have been criticized as "damaged merchandise" (Gray & Salzman, 1998) or "discount goods" (Cockton & Woolrych, 2002); however
we feel that, as in most things, the value of these methods depends on the context of their use. Although the controversy could be viewed by those outside of HCI research as academic fun and games, it could be important to you because it is about very practical aspects of your choices of UX evaluation methods and bounds in their use.
Do "discount" methods work?
It depends on what you mean by "work." Much of the literature by researchers studying the effectiveness of UX evaluation methods decries the shortcomings of inspection methods when measured by a science-oriented yardstick.
Studies have established that even with a large number of evaluators, some evaluations reveal only a percentage of the existing problems. We know that there is a broad variability of results across methods and across people using the same method. Different evaluators even report very different problems when observing the same evaluation session. Different UX teams interpret the same evaluation report in different ways.
However, in an engineering context, "working" means being effective and being cost-effective, and in this context discount UX engineering methods have a well-documented record of success. From a practical perspective, it is difficult to avoid the conclusion that using these methods is still better than not doing anything about evaluating UX.
Yes, you might miss many real user experience problems, but you will get some good ones, too. That is the trade-off you must be willing to accept if you use "discount" methods. You might even get some false positives, things that look like problems but really are not. It is hoped that you can sort those out. In any case, the idea is that you will be able to achieve a good engineering result much faster and with far less cost than a full empirical treatment that some authors demand.
Finally, although lab-based evaluation is often held up as the "gold standard" or yardstick against which other evaluation methods are compared, lab
testing is not perfect, either, and does not escape criticism for limitations in effectiveness (Molich et al., 1998, 1999; Newman, 1998; Spool & Schroeder, 2001). The lab-based approach to UX testing suffers from many of the same kinds of flaws as do discount and other inspection methods.
Pros and cons as engineering tools Of course, with any discount approach, there are trade-offs. The upside is that, in the hands of an experienced evaluator, inspection methods can be very effective-you can get a lot of UX problems dealt with and out of the way at a low cost. Another advantage is that UX inspection methods can be very fast, more quickly responsive than lab testing, for example, to fast iteration. Under the right conditions, you can do a UX inspection and its analysis, fix the major problems, and update the prototype design in a day!
The major downsides are that because inspection methods do not employ real users, they can be error-prone, can tend to find a higher proportion of lower severity problems, and can suffer from validity problems. This means they will yield some false positives (UX issues that turn out not to be real problems) and will miss some UX problems because of false negatives. Having to deal with
low-severity problems and false positives can be distracting to UX practitioners and can be wasteful of resources.
Another potential drawback is that the UX experts doing the inspection may not know the subject-matter domain or the system in depth. This can lead to a less effective inspection but can be offset somewhat by including a subject-matter expert on the inspection team.
Evaluating UX evaluation methods
Some of the value of current methods for assessing and improving UX in interactive software systems is somewhat offset by a general lack of understanding of the capabilities and limitations of each. Practitioners need to know which methods are more effective and in what ways and for what purposes. Thus emerged the need to evaluate and compare usability evaluation methods. The literature has a number of limited studies and commentaries on the effectiveness of usability evaluation methods, each report with its own different goals, results, and inferences.
However, there are no standard criteria for usability evaluation method comparison from study to study. And researchers planning full formal summative studies of competing methods in a real-world commercial development project environment are faced with virtually prohibitive difficulty and expense. It is hard enough to develop a system once, let alone redeveloping it over and over with multiple different approaches.
So we have a few imperfect but still enlightening studies to go by, mostly studies emerging as a by-product or afterthought attached to some other primary effort. In sum, usability evaluation methods have not been evaluated and compared reliably.
In the literature, usability inspection methods are often compared with lab- based testing, but we do not see inspection as a one-or-the-other alternative or a substitute for lab-based testing. Each method is one of the available UX engineering tools, each appropriate under its own conditions.
Andrew Sears made some of the most important early contributions about usability metrics (e.g., thoroughness, validity, and reliability) in usability evaluation methods (Sears, 1997; Sears & Hess, 1999). Hartson, Andre, and Williges (2003) introduced usability evaluation method comparison criteria and extended the measures of Sears to include effectiveness, an overall metric taking into account both thoroughness and validity. Their weightings between thoroughness and validity have the potential to enhance the possibilities for usability evaluation method performance measures in comparison studies.
Hartson, Andre, and Williges (2003) include a modest review of 18 comparative usability evaluation method studies.
Gray and Salzman (1998) spoke to the weaknesses of most usability evaluation method comparison studies conducted to that date. Their critical review of usability evaluation method studies concluded that flaws in experimental design and execution "call into serious question what we thought we knew regarding the efficacy of various usability evaluation methods." Using specific critiques of well-known usability evaluation method studies to illustrate, they argued the case that experimental validity (of various kinds) and other shortcomings of statistical analyses posed a danger in using the "conclusions" to recommend usability evaluation methods to practitioners.
To say that this paper was controversial is an understatement. Perhaps it was in part the somewhat cynical title ("Damaged Merchandise") or the overly severe indictment of the research arm of HCI, but the comments, rebuttals, and rejoinders that followed added more than a little fun and excitement into the discipline.
Also, we have noticed a trend since this paper to transfer the blame from the studies of discount usability evaluation methods to the methods themselves, a subtle attempt to paint the methods with the same "damaged merchandise" brush. The CHI'95 panel called "Discount or Disservice?" (Gray et al., 1995) is an example. In Gray and Salzman (1998), the term "damaged merchandise" was, at least ostensibly, aimed at flawed studies comparing usability evaluation methods. But many have taken the term to refer to the usability evaluation methods themselves and this panel title does nothing to disabuse us of this semantic sleight of hand.
More recently, in a comprehensive meta study of usability methods and practice, Hornbaek (2006) looked at 180 (!) studies of usability evaluation methods. Hornbaek proposed more meaningful usability measures, both objective and subjective, and contributed a useful in-depth discussion of what it means to measure usability, an elusive but fundamental concept in this context.
Finally, one of the practical problems with evaluation methods and their evaluation is the question of "Now that you have found the usability problems, what is next?" John and Marks (1997) consider downstream (after usability data gathering) utility, the usefulness of usability evaluation method outputs (problem reports) in convincing team members of the need to fix a problem and the usefulness in helping to effect the fixes. Only a few others also consider this issue, including Medlock et al. (2005), Gunn (1995), and Sawyer, Flanders, and Wixon (1996).
The Comparative Usability Evaluation (CUE) series
In a series of usability evaluation method evaluation studies that became known as the Comparative Usability Evaluation series (seven studies that we know of as of this writing), a number of usability evaluation methods were tested under a variety of conditions and a major observation emerged under the name of the evaluator effect (Hertzum & Jacobsen, 2003), which essentially states that the variation among results from different evaluators using the same method on the same target application can be so large as to call into question the effectiveness of the methods. These studies have been called by some the studies that discredit usability engineering, but we think they just bring some important issues about reality to light.
In CUE-1 (Molich et al., 1998), four professional usability labs performed usability tests of a Windows calendar management application. Of the 141 usability problems reported overall, 90% were reported by only one lab and only one problem was reported by as many as four labs.
In CUE-2 (Molich et al., 2004), nine organizations evaluated a Website, focusing on a prescribed task set. Seventy-five percent of the 310 overall problems were reported by just one team, while only 2 problems were reported by as many as six groups.
CUE-3 (Hertzum, Jacobsen, & Molich, 2002) again evaluated a Website using a specific task set. This time the experimenters began with 11 individuals. The subject evaluators then met in four groups to combine their individual results. Following group discussions, the individuals "felt that they were largely in agreement," despite only a 9% overlap in reported problems between any two evaluators. This perception of agreement in the face of data apparently to the contrary seemed to be based on the feeling that the different problem reports were actually about similar underlying problems but coming at them from different directions.
In a secondary part of CUE-4, the authors concluded that a large proportion of recommendations given following a usability evaluation were neither useful nor usable for making design changes to improve product usability (Molich, Jeffries, & Dumas, 2007). The authors concluded that designers have difficulty acting on most problem descriptions because problem reports are often poorly written, unconvincing, and ineffective at guiding a design solution. To make it worse, in many cases the entire outcome rides on the report itself as there is no opportunity to explain the problems or argue the case afterward.
As of this writing, the latest in the CUE series was CUE-9 (Molich, 2011).
13.10.3 Yet Somehow Things Work
Press on
So what is all the fuss in the literature about damaged merchandise, discount methods, heuristic methods, and so on? Scientifically, there are valid issues with these methods. However, the evaluator effect applies to virtually all kinds of UX evaluation, including our venerable yardstick of performance, the lab-based formative evaluation. Formative evaluation, in general, just is not very reliable or repeatable.
There is no one evaluation method that will reveal all the UX problems. So what is a UX newbie to do? Give it up? No, this is engineering and we just have to get over it, be practical, and do our best-and sometimes it works really well.
While researchers continue to pursue and validate better methods, we make things work and we use tools and methods that are far less than perfect. We always seem to get a better design by evaluating and iterating and that is the goal. One application of our methods may not find all UX problems, but we usually get some good ones. If we fix those, maybe we will get the others the next time.
We will find the important ones in other ways-we are not doing the whole process with our eyes closed. In the meantime, experienced practitioners read about how these evaluation methods do not work and smile as they head off to the lab or to an inspection session.
Among the reasons we have to be optimistic in the long run about our UX evaluation methods are:
� Goals are engineering goals, not scientific goals
� Iteration helps close the gap
� Disagreement in the studies was subject to interpretation
� Evaluation methods can be backed up with UX expertise
Practical engineering goals
Approaching perfection is expensive. In engineering, the goal is to make it good enough. Wixon (2003) speaks up for the practitioner in the discussion of usability evaluation methods. From an applied perspective, he points out that a focus on validity and proper statistical analysis in these studies is not serving the needs of practitioners in finding the most suitable usability evaluation method and best practices for their work context in the business world.
Wixon (2003) would like to see more emphasis on usability evaluation method comparison criteria that take into account factors that determine their success in real product development organizations. As an example, the value of a method might be less about absolute numbers of problems detected and more
about how well a usability evaluation method fits into the work style and development process of the organization. And to this we can only say amen.
Managing risk by mitigating evaluation errors
Cockton and Woolrych (2002) cast errors made with usability evaluation methods (discount inspection methods and "lite" lab-based usability testing methods) in terms of risks, "But are discount methods really too risky to justify the 'low' cost?" What kinds of risks are there? There is the risk of not fixing usability issues missed by the method and the risk of "fixing" false alarms.
To be sure, however, the risks associated with errors are real and are part of any engineering activity involving evaluation and iterative improvement. But we have to ask, how serious are the risks? Rarely can an error of these types make or break the success of the system. As Cockton and Woolrych (2002,
pp. 17-18) point out, "errors ... may be more costly in some contexts than others."
Where human lives are at risk, we are compelled to spend more resources to be more thorough in all our process activities and will surely not allow an evaluation error to weaken the design in the end. In most applications, though, each error simply means that we missed an opportunity to improve usability in one detail of the design.
Managing the risk of false negatives with iteration
Many of the comparisons of usability inspection methods point out the susceptibility of these methods to making errors in identifying usability problems, with disparaging conclusions. However, these conclusions are usually made in the context of science, where evaluation errors can count heavily against the method. In balance, others (Manning, 2002) question the working assumptions of such problem validity arguments when examined in the light of real-world development projects.
One kind of problem identification error, a false negative-failure to detect a real usability problem-can lead to missing out on needed fixes. The risk here, of course, is not greater than it would be if no evaluation is done. So every problem you do find is one for the good. What, then, are the alternatives?
Discount methods are being used presumably because of budget constraints, so the more expensive lab-based testing is not going to be the answer.
One important factor that some studies of evaluation methods neglect is iteration. To temper the consequences of missing some problems the first time around, you always have other iterations and other evaluation methods that might well catch them by the end of the day. If we look at the results over a few
iterations instead of each attempt in isolation, we are likely to see net occurrences of false negatives reduced greatly.
If we find a set of bona fide problems and fix them, we remove them from contention in the next cycle of evaluation, which helps us focus on the remaining problems. If we combine different results (i.e., different problems uncovered) from different evaluators or different iterations, the overall process can still converge.
Managing the risk of false positives with UX expertise
Another kind of problem identification error is a false positive-identifying something as a problem when it is not. The risk associated with this kind of error, that of trying to fix a problem that is not real, could exact a penalty. For example, false positives in problem identification can lead to unneeded and possibly damaging "fixes." But, as we said, this risk occurs with any kind of evaluation method.
The important point to remember here is that UX inspection is only an engineering tool. You, the practitioner, must maintain your engineering judgment and not let evaluation results be interpreted as the final word. You are still in charge.
Also, as many point out, this is just the initial finding of candidate problems. To abate the effects of false positives, think of the method as an engineering tool not giving you absolute indicators of problems but suggesting possible problems, possibilities that you, the expert practitioner, must still investigate, analyze, and decide upon. Then, if there are still false positives, you can blame yourself and not the inspection method. In the discount methods controversy, we lay a lot of responsibility on the methods for finding problems without considering that the UX evaluation methods are backed up by UX specialists.
One important way that a UX specialist can augment the limited power of a UX evaluation method is by learning from problems that are found and keeping alert for similar issues elsewhere in the design. An interaction design is a web of features and relationships. If you detect one instance of a UX problem in a particular design feature, you are likely to encounter similar problems in similar situations elsewhere as you go about the fixes and redesign.
Suppose there are 10 instances of a UX problem of a certain general type in your application, but our UX evaluation method finds only 1. There is still a good chance that analysis and redesign for that problem will lead a dedicated and observant UX specialist eventually to find and fix some other similar or related problems, giving you a UX evaluation method/practitioner team with a higher net problem detection rate.
Look at the bright side of studies
In CUE-3 (Hertzum, Jacobsen, & Molich, 2002), the dissimilarity among results across individual evaluators was not viewed as disagreements by the evaluators but as different expressions of the same underlying problems. Although statistically the evaluators had a relatively small overlap in problems reported, after a group discussion the evaluators all felt they "were largely in agreement." The evaluators perceived their disparate observations as multiple sources of evidence in support of the same issues, not as disagreements.
We have also experienced this in our consulting work when different problem reports at first seemed not comparable, but further discussion revealed that they were saying different things about essentially the same underlying problems, and we felt that even if we had not detected this in the analysis stage, the different views would have converged in the process of fixing the problems.
This seems to say that the evaluation methods were not as bad at problem detection as data initially implied. However, it also seems to shift the spotlight to difficulties in how we analyze and report problems in those methods. There is a large variation in the diagnoses and expressions used to describe problems.
Finally, in situations where thoroughness is low, low reliability across evaluators can actually be an asset. As long as each evaluator is not finding most or all of the problems, differences in detection across evaluators mean that, by adding more evaluators, you can find more problems in their combined reports through a diversity in detection abilities.
In sum, although criticized as false economy in some HCI literature, especially the academic literature, these so-called "discount" methods are practiced heavily and successfully in the field.
Intentionally left as blank
Rigorous Empirical Evaluation: Preparation
Be prepared; that's the Boy Scouts' marching song Don't be nervous, don't be flustered,
don't be scared Be prepared!
- Tom Lehrer
14.1 INTRODUCTION
14.1.1 You Are Here
We begin each process chapter with a "you are here" picture of the chapter topic in the context of the overall Wheel lifecycle template; see Figure 14-1. This chapter, about preparing for evaluation, begins a series of chapters about rigorous empirical UX evaluation.
This chapter begins a series of four about rigorous empirical UX evaluation methods, of which lab-based testing is the archetype example. Some of what is in these chapters applies to either lab-based or in-the-field empirical evaluation,
Figure 14-1
You are here, at preparing for evaluation, within the evaluation activity in the context of the overall Wheel lifecycle template.
but parts are specific to just lab based. Field-based rigorous empirical UX evaluation is essentially the same as lab based except the work is done on location in the field instead of in a lab.
Although we do include quantitative UX data collection and analysis, this is emphasized less than it used to be in previous usability engineering books because of less focus in practice on quantitative user performance measures and more emphasis on evaluation to reveal UX problems to be fixed.
14.2 PLAN FOR RIGOROUS EMPIRICAL UX EVALUATION
Planning your empirical UX evaluation means making cost-effective decisions and trade-offs. As Dray and Siegel (1999) warn, "Beware of expediency as a basis for decision making." In other words, do not let small short-term savings undercut your larger investment in evaluation and in the whole process lifecycle.
14.2.1 A Rigorous UX Evaluation Plan
The purpose of your plan for rigorous UX evaluation, whether lab based or in the field, is to describe evaluation goals, methods, activities, conditions, constraints, and expectations. Especially if the plan will be read by people outside your immediate project group, you might want an upfront "boilerplate" introduction with some topics like these, described very concisely:
� Goals and purpose of this UX evaluation
� Overview of plan
� Overview of product or parts of product being evaluated (for people outside the group)
� Goals of the product user interface (i.e., what will make for a successful user experience)
� Description of the intended user population
� Overview of approach to informed consent
� Overview of how this evaluation fits into the overall iterative UX process lifecycle
� Overview of the UX evaluation process in general (e.g., preparation, data collection, analysis, reporting, iteration)
� General evaluation methods and activities planned for this session
� Estimated schedule
� Responsible personnel
The body of the plan can include topics such as:
� Description of resources and constraints (e.g., time needed/available, state of prototype, lab facilities and equipment)
� Pilot testing plan
� Approach to evaluation, choices of data collection techniques
� Mechanics of the evaluation (e.g., materials used, informed consent, location of testing, UX goals and metrics involved, tasks to be explored, including applicable benchmark tasks)
� All instruments to be used (e.g., benchmark task descriptions, questionnaires)
� Approaches to data analysis
� Specifics of your approach to evaluate emotional impact and, if appropriate, phenomenological aspects of interaction
14.2.2 Goals for Rigorous UX Evaluation Session
One of the first things to do in an evaluation plan is to set, prioritize, and document your evaluation goals. Identify the most important design issues and user tasks to investigate. Decide which parts of the system or functionality
you simply will not have time to look at.
Your evaluation goals, against which you should weigh all your evaluation choices and activities, can include:
� Application scope (parts of the system to be covered by this evaluation)
� Types of data to collect (see Chapter 12)
� UX goals, targets, and metrics, if any, to be addressed
� Matching this evaluation to the current stage of product design evolution
14.3 TEAM ROLES FOR RIGOROUS EVALUATION
Select your team members for evaluation activities. Encourage your whole project team to participate in at least some evaluation. The greater the extent that the whole team is involved from the start, in both the planning and the execution of the studies, the better chance you have at addressing everyone's concerns. Broad participation begets buy-in and ownership, necessary for your results to be taken as a serious mandate to fix problems.
However, your evaluation team will practically be limited to practitioners with active roles, perhaps plus a few observers from the rest of your project team or organization. So, everyone on your evaluation team is an "evaluator," but you also need to establish who will play more specific roles, including the facilitator, the prototype "executor," and all observers and data collectors. Whether your prototype is low or high fidelity, you will need to select appropriate team roles for conducting evaluation.
14.3.1 Facilitator
Select your facilitator, the leader of the evaluation team. The facilitator is the orchestrator, the one who makes sure it all works right. The facilitator has the primary responsibility for planning and executing the testing sessions, and the final responsibility to make sure the laboratory is set up properly. Because the facilitator will be the principal contact for participants during a session and responsible for putting the participant at ease, you should select someone with good "people skills."
14.3.2 Prototype Executor
If you are using a low-fidelity (e.g., paper) prototype, you need to select a prototype executor, a person to "execute" the prototype as though it were being run on a computer. The person in this role is the computer.
The prototype executor must have a thorough technical knowledge of how the design works. So that the prototype executor responds only to participant actions, he or she must have a steady Vulcan sense of logic. The executor must also have the discipline to maintain a poker face and not speak a single word throughout the entire session.
14.3.3 Quantitative Data Collectors
If you intend to collect quantitative data, you will need quantitative data collectors. Depending on your UX metrics and quantitative data collection instruments, people in these roles may be walking around with stopwatches and counters (mechanical, electronic, or paper and pencil). Whatever quantitative data are called for by the UX metrics, these people must be ready to take and record those data. Quantitative data collectors must be attentive and not let data slip by without notice. They must have the ability to focus and not let their minds wander during a session. If you can afford it, it is best to let someone specialize in only taking quantitative data. Other duties and distractions often lead to forgetting to turn on or off timers or forgetting to count errors.
14.3.4 Qualitative Data Collectors
Although facilitators are usually experienced in data collection, they often do not have time to take data or they need help in gathering all qualitative data. When things are happening fast, the need for facilitation can trump data taking for the facilitator.
Select one or more others as data collectors and recorders. No evaluation team member should ever be idle during a session. Thoroughness will improve with more people doing the job. Everyone should be ready to collect qualitative data, especially critical incident data; the more
data collectors, the better.
14.3.5 Supporting Actors
Sometimes you need someone to interact with the participant as part of the task setting or to manage the props needed in the evaluation. For example, for task realism you may need someone to call the participant on a telephone in the participant room or, if your user participant is an "agent" of some kind, you may need a "client" to walk in with a specific need involving an agent
task using the system. Select team members to play supporting roles and handle props.
14.4 PREPARE AN EFFECTIVE RANGE OF TASKS
If evaluation is to be task based, including lab-based testing and task-driven UX inspection methods, select appropriate tasks to support evaluation. Select different kinds of tasks for different evaluation purposes.
14.4.1 Benchmark Tasks to Generate Quantitative Measures If you plan to use UX goals and targets to drive your UX evaluation, everyone in an evaluator role should have already participated with other members of the project team in identifying benchmark tasks and UX target attributes and metrics (Chapter 10). These attributes and metrics must be ready and waiting as a comparison point with actual results to be observed
in the informal summative component of the evaluation sessions with participants.
Be sure that descriptions of all benchmark tasks associated with your UX targets and metrics are in final form, printed off, and ready to use by participants to generate data to be measured. Benchmark tasks portray representative, frequent, and critical tasks that apply to the key work role and user class represented by each participant (Chapter 10). Make sure each task description says only what to do, with no hints about how to do it. Also, do not use any language that telegraphs any part of the design (e.g., names of user interface objects or user actions, or words from labels or menus).
14.4.2 Unmeasured Tasks
Like benchmark tasks, additional unmeasured tasks, used especially in early cycles of evaluation, should be ones that users are expected to perform often. Unmeasured tasks are tasks for which participant performance will not be measured quantitatively but which will be used to add breadth to qualitative evaluation. Evaluators can use these representative tasks to address aspects of the design not covered in some way by the benchmark tasks.
In early stages, you might employ only unmeasured tasks, the sole goal of which is to observe critical incidents and identify initial UX problems to root out, to fix at least the most obvious and most severe problems before any measured user performance data can be very useful.
Just as for benchmark tasks created for testing UX attributes, you should write up representative unmeasured task descriptions, which should be just as specific as the benchmark task descriptions, and give them to the participant to perform in the evaluation sessions.
14.4.3 Exploratory Free Use
In addition to strictly specified benchmark and unmeasured tasks, the evaluator may also find it useful to observe the participant in informal free use of the interface, a free-play period without the constraints of predefined tasks. This does not necessarily mean that they are even doing tasks, just exploring.
Be prepared to ask your participants to explore and experiment with the interaction design, beyond task performance. To engage a participant in free use, the evaluator might simply say "play around with the interface for awhile, doing anything you would like to, and talk aloud while you are playing." Free use is valuable for revealing participant expectations and system behavior in situations not anticipated by designers, often situations that can break a poor design.
14.4.4 User-Defined Tasks
Sometimes tasks that users come up with will address unexpected aspects of your design (Cordes, 2001). You can include user-defined tasks by giving your participants a system description in advance of the evaluation sessions and ask them to write down some tasks they think are appropriate to try or you can wait until the session is under way and ask each participant extemporaneously to come up with tasks to try.
If you want a more uniform task set over your participants but still wish to include user-defined tasks, you can ask a different set of potential users to come up with a number of candidate task descriptions before starting any evaluation session. This is a good application for a focus group. You can vet, edit, and merge these into a set of user-defined tasks to be given to each participant as part
of each evaluation session.
14.5 SELECT AND ADAPT EVALUATION METHOD AND DATA COLLECTION TECHNIQUES
14.5.1 Select Evaluation Method and Data Collection Techniques
Using the descriptions of the evaluation methods and data collection techniques in Chapter 12, including the descriptions of the kinds of evaluation each is used for, select your evaluation method and data collection techniques to fit your evaluation plan and goals and the particular evaluation needs of your system or product.
For example, at a high level, you should choose first between rigorous or rapid evaluation methods (see Chapter 12). If you choose rigorous, you might choose between a lab-based or in-the-field method. If you choose rapid methods, your next choice should be from among the many such evaluation methods given in Chapter 13.
Your approach to choosing evaluation methods and techniques should be goal driven. For example, when you wish to evaluate usefulness-the coverage, completeness, and appropriateness of functionality and the coverage, completeness, and appropriateness of its support in the user interface- consider doing it:
� Objectively, by cross-checks of functionality implied by your hierarchical task inventory, design scenarios, conceptual design, and task intention descriptions.
� Subjectively, by user questionnaires (Chapter 12).
� Longitudinally, by following up with real users in the field after a product or system is released. Use downstream questionnaires directed at usefulness issues to guiding functional design thinking for subsequent versions.
Your choices of specific data collection techniques should also be goal driven. If you are using participants, as you will in rigorous evaluation, you should strongly consider using the critical incident identification, think-aloud, and co-discovery techniques (Chapter 12). If you are doing a task-driven expert UX inspection (Chapter 13), you can collect data about your own critical incidents.
Questionnaires (Chapter 12) are a good choice if you want to supplement your objective UX evaluation data with subjective data directly from the user. Questionnaires are simple to use, for both analyst and participant, and can be used with or without a lab. Questionnaires can yield quantitative data as well as qualitative user opinions.
For example, a questionnaire can have numerical choices that participants must choose from to provide quantitative data or it can have open-ended questions to elicit qualitative free-form feedback. Questionnaires are good for evaluating specific predefined aspects of the user experience, including perceived usability and usefulness.
If you want to collect data to evaluate emotional impact, questionnaires are probably the least expensive choice and the easiest to administer. More advanced data collection techniques for evaluating emotional impact include biometrics and other ways to identify or measure physiological responses in users (Chapter 12).
If you choose to use a questionnaire in your evaluation, your next step is to use the information on questionnaires in Chapter 12 to decide which questionnaire to use. For example, you might choose our old standby, the Questionnaire for User Interface Satisfaction (QUIS), for subjective evaluation of, and user satisfaction about, traditional user performance and usability issues in screen- based interaction designs; the System Usability Scale (SUS) for a versatile and broadly applicable general purpose subject user experience evaluation instrument; or the Usefulness, Satisfaction, and Ease of Use (USE) questionnaire for a general-purpose subjective user experience evaluation instrument.
Part of choosing a questionnaire will involve deciding the timing of administration, for example, after each task or at the end of each session.
14.5.2 Adapt Your Choice of Evaluation Method and Data Collection Techniques
For UX evaluation, as perhaps for most UX work, our motto echoes that old military imperative: Improvise, adapt, and overcome! Be flexible and customize your methods and techniques, creating variations to fit your evaluation goals and needs. This includes adapting any method by leaving out steps, adding new steps, and changing the details of a step.
14.6 SELECT PARTICIPANTS
The selection and recruitment of participants are about finding representative users outside your team and often outside your project organization to help with evaluation. This section is mainly focused on participants for lab-based UX evaluation, but also applies to other situations where participants are needed, such as some non-lab-based methods for evaluating emotional impact and phenomenological aspects.
In formal summative evaluation, this part of the process is referred to as "sampling," but that term is not appropriate here because what we are doing has nothing to do with the implied statistical relationships and constraints.
14.6.1 Establish Need and Budget for Recruiting User Participants Upfront
Finding and recruiting evaluation participants might be part of the process where you are tempted to cut corners and save a little on the budget or might be something you think to do at the last minute.
In participant recruiting, to protect the larger investment already made in the UX lifecycle process and in setting up formative evaluation so far, you need to secure a reasonable amount of resources, both budget money and schedule time to recruit and remunerate the full range and number of evaluation participants you will need. If you do this kind of evaluation infrequently, you can engage the services of a UX evaluation consulting group or a professional recruiter to do your participant recruiting.
14.6.2 Determine the Right Participants
Look for participants who are "representative users," that is, participants who match your target work activity role's user class descriptions and who are knowledgeable of the general target system domain. If you have multiple work roles and corresponding multiple user classes, you must recruit participants representing each category. Prepare a short written demographic survey to administer to participants to confirm that each one meets the requirements of your intended work activity role's user class characteristics.
Participants must also match the user class attributes in any UX targets they will help evaluate. For example, if initial usage is specified, you need participants unfamiliar with your design. So, for example, even though a user may be a perfect match to a given key work role's user class characteristics, if the UX target involved specifies "initial performance" as the UX attribute and this participant has already seen and used the interaction design, maybe in a previous iteration, this person is not the right participant for this part of the evaluation.
"Expert" participants
Recruit an expert user, someone who knows the system domain and knows your particular system, if you have a session calling for experienced usage. Expert users are good at generating qualitative data. These expert users will understand the tasks and can tell you what they do not like about the design. But you cannot necessarily depend on them to tell you how to make the design better.
Recruit a UX expert if you need a participant with broad UX knowledge and who can speak to design flaws in terms of design guidelines. As participants, these experts may not know the system domain as well and the tasks might not make as much sense to them, but they can analyze user experience, find subtle problems (e.g., small inconsistencies, poor use of color, confusing navigation), and offer suggestions for solutions.
Consider recruiting a so-called double expert, a UX expert who also knows your system very well, perhaps the most valuable kind of participant. But the question of what constitutes being an expert of value to your evaluation is not
always clear-cut. Also, the distinction between expert and novice user is not a simple dichotomy. Not all experts make good evaluation participants and not all novices will perform poorly. And being an expert is relative: an expert in one thing can very well be a novice at something else. And even the same person can be an expert at one thing today and less of an expert in a month due to lack of practice and retroactive interference (intervening activities of another type).
14.6.3 Determine the Right Number of Participants
The question of how many participants you need is entirely dependent on the kind of evaluation you are doing and the conditions under which you are doing it. There are some rules of thumb, such as the famous "three to five participants is enough" maxim, which is quoted so often out of context as to be almost meaningless. However, real answers are more difficult. See the end of this chapter for further discussion about the "three to five users" rule and its limitations.
The good news is that your experience and intuition will be good touchstones for knowing when you have gotten the most of an iteration of UX evaluation and when to move on. One telltale sign is the lack of many new critical incidents or UX problems being discovered with additional participants.
You have to decide for yourself every time you do UX testing-how many participants you can or want to afford. Sometimes it is just about achieving your UX targets, regardless of how many participants and iterations it takes. More often it is about getting in, getting some insight, and getting out.
14.7 RECRUIT PARTICIPANTS
Now the question arises as to where to find participants. Inform your customer early on about how your evaluation process will proceed so you will have the best chance of getting representative users from the customer organization at appropriate times.
14.7.1 Recruiting Methods and Screening
Here are some hints for successful participant recruiting.
� Try to get the people around you (co-workers, colleagues elsewhere in your organization, spouses, children, and so on) to volunteer their time to act as participants, but be sure their characteristics fit your key work role and the corresponding user class needs.
� Newspaper ads and emailings can work to recruit participants, but these methods are usually inefficient.
� If the average person off the street fits your participant profile (e.g., for a consumer software application), hand out leaflets in shopping malls and parking lots or post notices in grocery stores or in other public places (e.g., libraries).
� Use announcements at meetings of user groups and professional organizations if the cross section of the groups matches your user class needs.
� Recruit students at universities, community colleges, or even K-12, if appropriate.
� Consider temporary employment agencies as another source for finding participants.
A possible pitfall with temporary employment agencies is that they usually know nothing about UX evaluation, nor do they understand why it is so important to choose appropriate people as participants. The agency goal, after all, is to keep their pool of temporary workers employed, so screen their candidates with your user classes.
14.7.2 Participant Recruiting Database
No matter how you get contact information for your potential participants (advertising campaign, references from marketing, previously used participants), if you are going to be doing evaluation often, you should maintain a participant recruiting database. Because all the participants you have used in the past should be in this database, you can draw on the good ones for repeat performances.
You can also sometimes use your own customer base or your customer's contact lists as a participant recruiting source. Perhaps your marketing department has its own contact database.
14.7.3 Incentives and Remuneration
Generally, you should not ask your participants to work for free, so you will usually have to advertise some kind of remuneration. Try to determine the going rate for evaluation participants in your local area.
You will usually pay a modest hourly fee (e.g., about a dollar above minimum wage for an off-the-street volunteer). Expert participants cost more, depending on your specialized requirements. Do not try to get by too cheaply; you might get what you pay for.
Instead of or in addition to money, you can offer various kinds of premium gifts, such as coffee mugs with your company logo, gift certificates for local restaurants and shops, T-shirts proclaiming they survived your UX tests, free pizza, or even chocolate chip cookies! Sometimes just having a chance to learn about a new product before it is released or to help shape the design of some new technology is motivation enough.
14.7.4 Difficult-to-Find User Participants
Be creative in arranging for hard-to-find participant types. Sometimes, the customer-for whatever reasons-simply will not let the developer organization have access to representative users. The Navy, for example, can be rightfully hesitant about calling in its ships and shipboard personnel from the high seas to evaluate a system being developed to go on board.
Specialized roles (such as an ER physician) have time constraints that make if difficult, or impossible, to schedule them in advance. Sometimes you can have an "on call" agreement through which they call you if they have some free time and you do your best to work them in.
Sometimes when you cannot get a representative user, you can find a user representative, someone who is not exactly in the same role but who knows the role from some other angle. A domain expert is not necessarily the same as a user, but might serve as a participant, especially in an early evaluation cycle. We once were looking for a particular kind of agent of an organization who worked with the public, but had to settle, at least at the beginning, for supervisors of those agents.
14.7.5 Recruiting for Co-Discovery
Consider recruiting pairs of participants specifically for co-discovery evaluation. Your goal is to find people who will work well together during evaluation and, as a practical matter, who are available at the same time. We have found it best not to use two people who are close friends or who work together on a daily basis; such close relationships can lead to too much
wise-cracking and acting out.
In extreme cases, you might find two participants who are friends or work together who exemplify a kind of "married couple" phenomenon. They finish each other's sentences and much of their communication is implicit because they think alike. This is likely to yield less useful think-aloud data for you.
Look for people whose skills, work styles, and personality traits complement each other. Sometimes this is a good place to give them the Myers-Briggs test (Myers et al., 1998) for collaborative personality types.
14.7.6 Manage Participants as Any Other Valuable Resource
Once you have gone through the trouble and expense to recruit participants, do not let the process fail because a participant forgot to show up. Devise a mechanism to manage participant contact to keep in touch, remind in advance of appointments, and to follow up, if useful, afterward.
You need a standard procedure, and fool-proof way to remind you to follow it, for calling your participants in advance to remind them of their appointment, just as they do in doctor's offices. No-show participants cost money in unused lab facilities, frustration in evaluators, wasted time in rescheduling, and delays in the evaluation schedule.
14.7.7 Select Participants for Subsequent Iterations
A question that commonly arises is whether you should use the same participants for more than one cycle of formative evaluation. Of course you would not use a repeat participant for tasks addressing an "initial use" UX attribute.
But sometimes reusing a participant (maybe one out of three to five) can make sense. This way, you can get a reaction to design changes from the previous cycle, in addition to a new set of data on the modified design from the two new participants. Calling on a previously used participant tells them you value their help and gives them a kind of empowerment, a feeling that they are helping to make a difference in your design.
14.8 PREPARE FOR PARTICIPANTS
14.8.1 Lab and Equipment
If you are planning lab-based evaluation, the most obvious aspect of preparation is to have the lab available and configured for your needs. If you plan to use specialized equipment, such as for physiological measurement, you also need to have that set up and an expert scheduled to operate it.
If you plan to collect quantitative UX data, prepare by having the right kind of timers on hand, from simple stopwatches to instrumented software for automatically extracting timing data. You can also get high-precision timing data from video recordings of the session (Vermeeren et al., 2002).
Using video to compute timing originated with the British data collection system called DRUM (Macleod & Rengger, 1993). DRUM was the tool support for the larger usability evaluation methodology called MUSiC (Macleod et al., 1997). Today, most software available to control and
analyze digital video streams (e.g., TechSmith's Morae) can do this routinely.
As part of your post-session processing, you just tag the start and end
of task performance in the video stream and the elapsed time is computed directly.
A Modern UX Lab at Bloomberg LP
Bloomberg LP, a leader in financial informatics, unveiled a modern UX evaluation lab in 2011. We describe some of the main features of the lab in this sidebar.
The lab has two areas-a participant room and an observation room-separated by a one-way mirror. Each has an independent entrance. The participant room has a multi-monitor workstation on which Bloomberg's desktop applications are evaluated. The following photos depict a formative evaluation session in progress at this station.
On the other side of this participant room, another station is designed for evaluations with paper prototypes or mobile devices. In the following photo we show a formative evaluation session using paper prototypes where the facilitator (left) is responding to the actions of the participant (center) as the note taker (right) observes.
In the photos that follow, we show the same station being used to evaluate a mobile prototype (left). The photo on the right shows a close-up of the mobile device holder with a mounted camera. This setup allows the participant to hold and move the mobile device as she interacts while allowing a full capture of the user interface and her actions using the mounted camera.
The following photos are views of the observation room. This room is kept dark to prevent people in the participant room from seeing through. The lab is set up to pipe up to five selections of the seven video sources and four screen capture sources from the participant room to the large screens seen at the top in the observation room.
In the photo on the left you can see the participant room showing through the one-way mirror. In this photo we see stakeholders observing and tagging the video stream of the ongoing evaluation at four different stations.
In the photo on the right, you can see a view of the evaluation using the mobile prototype. Note on the left-hand screen above a close-up view of the evaluation from the overhead camera. The feed from the camera mounted on the mobile device holder is not shown in this photo.
This UX lab has been instrumental in defining the interaction designs of Bloomberg's flagship desktop and mobile applications. Special thanks to Shawn Edwards, the CTO; Pam Snook; and Vera Newhouse at Bloomberg L.P. for providing us these lab photos.
14.8.2 Session Parameters
Evaluators must determine protocol and procedures for conducting the testing-exactly what will happen and for how long during an evaluation session with a participant.
Task and session lengths
The typical length of time of evaluation session for one participant is anywhere from 30 minutes to 4 hours. However, most of the time you should plan on an average session length of 2 hours or less.
However, some real-world UX evaluation sessions can become a day-long experience for a participant. The idea is to get as much as possible from each user without burning him or her out.
If you require sessions longer than a couple of hours, it will be more difficult for participants. In such cases, you should:
� Prepare participants for possible fatigue in long sessions by warning them in advance.
� Mitigate fatigue by scheduling breaks between tasks, where participants can get up and walk around, leave the participant room, get some coffee or other refreshment, and even run screaming back home.
� Have some granola bars and/or fruit available in case hunger becomes an issue.
� Always have water, and possibly other beverages, on hand to assuage thirst from the hard work you are putting them through.
Number of full lifecycle iterations
Just as a loose rule of thumb from our experience, the typical number of full UX engineering cycle iterations per version or release is about three, but resource constraints often limit it to fewer iterations. In many projects you can expect only one iteration. Of course, any iterations are better than none.
14.8.3 Informed Consent
As practitioners collecting empirical data involving human subjects, we have certain legal and ethical responsibilities. There are studies, of course, in which harm could come to a human participant, but the kinds of data collection performed during formative evaluation of an interaction design are virtually never of this kind.
Nonetheless, we have our professional obligations, which center on the informed consent form, a document to establish explicitly the rights of your participants and which also serves as legal protection for you and your
520 THE U X BOOK: PROCESS AND GUIDELINES FO R E NSURING A QUALITY U SER E XPERIENC E
organization. Therefore, you should always have all participants, anyone from who you collect data of any kind, sign an informed consent form regardless of whether data are collected in the lab, in the field, or anywhere else.
Informed consent permission application
Your preparation for informed consent begins with an application to your institutional review board (IRB), an official group within your organization responsible for the legal and ethical aspects of informed consent (see later). The evaluator or project manager should prepare an IRB application that typically will include:
� summary of the evaluation plan
� statement of complete evaluation protocol
� statement of exactly how human subjects will be involved
� your written subject/participant instructions
� a copy of your informed consent form
� any other standard IRB forms for your organization
Because most UX evaluation does not put participants at risk, the applications are usually approved without question. The details of the approval process vary by organization, but it can take up to weeks and can require changes in the documents. The approval process is based on a review of the ethical and legal issues, not the quality of the proposed evaluation plan.
Informed consent form
The informed consent form, an important part of your IRB application and an important part of your lab-based UX evaluation, is a requirement; it is not optional. The informed consent form is to be read and signed by each participant and states that the participant is volunteering to participate in your evaluation, that you are taking data that the participant helped generate, and that the participant gives permission to use data-usually with the provision that the participant's name or identity will not be associated with data, that the participant understands the evaluation is in no way harmful, and that the participant may discontinue the session at any time. The consent form may also include non-disclosure requirements.
This form must spell out participant rights and what you expect
the participants to do, even if there is overlap with the general instructions sheet. The form they sign must be self-standing and must tell the
whole story.
Be sure that your informed consent form contains:
� a statement that the participant can withdraw anytime, for any reason, or for no reason at all
� a statement of any foreseeable risks or discomforts
� a statement of any benefits (e.g., educational benefit or just the satisfaction of helping make a good design) or compensation to participants (if there is payment, state exactly how much; if not, say so explicitly)
� a statement of confidentiality of data (that neither the name of the participant nor any other kind of identification will be associated with data after it has been collected)
� all project/evaluator contact information
� a statement about any kind of recording (e.g., video, audio, photographic, or holodeck) involving the participant you plan to make and how you intend to use it, who will view it (and not), and by what date it will be erased or otherwise destroyed
� a statement that, if you want to use a video clip (for example) from the recording for any other purpose, you will get their additional approval in writing
� clear writing in understandable language
An example of a simple informed consent form is shown in Figure 14-2.
Informed consent may or may not also be required in the case where your participants are also organization employees. In any case you should have two copies of the consent form ready for reading and signing by participants when they arrive. One copy is for the participant to keep.
14.8.4 Other Paperwork
General instructions
In conjunction with developing evaluation procedures, you, as the evaluator, should write introductory instructional remarks that will be read uniformly by each participant at the beginning of the session. All participants thereby start with the same level of knowledge about the system and the tasks they are to perform. This uniform instruction for each participant will help ensure consistency across the test sessions.
These introductory instructions should explain briefly the purpose of the evaluation, tell a little bit about the system the participant will be using, describe what the participant will be expected to do, and the procedure to be followed by the participant. For example, instructions might state that a
participant will be asked to perform some benchmark tasks that will be given by the evaluator, will be allowed to use the system freely for awhile, then will be
Informed Consent for Participant of Development Project
<Name of your development organization> <Date or version number of form> Title of Project: <Project title>
Project team member(s) directly involved: <Team member names> Project manager: <Project manager name>
I. THE PURPOSE OF YOUR PARTICIPATION IN THIS PROJECT
As part of the <project title> project, you are invited to participate in evaluating and improving various designs of <name of system or product>, <description of system or product>.
II. PROCEDURES
You will be asked to perform a set of tasks using the <name of system or product>. These tasks consist of <description of range of tasks>. Your role in these tests is to help us evaluate the designs. We are not evaluating you or your performance in any way. As you perform various tasks with the system, your actions and comments will be noted and you will be asked to describe verbally your learning process. You may be asked questions during and after the evaluation, in order to clarify our understanding of your evaluation. You may also be asked to fill out a questionnaire relating to your usage of the system.
The evaluation session will last no more than four hours, with the typical session being about two hours. The tasks are not very tiring, but you are welcome to take rest breaks as needed. If you prefer, the session may be divided into two shorter sessions.
III. RISKS
There are no known risks to the participants of this study.
IV. BENEFITS OF THIS PROJECT
Your participation in this project will provide information that may be used to improve our designs for <name of system or product>. No guarantee of further benefits has been made to encourage you to participate. (Change this, if a benefit such as payment or a gift is offered.) You are requested to refrain from discussing the evaluation with other people who might be in the candidate pool from which other participants might be drawn.
V. EXTENT OF ANONYMITY AND CONFIDENTIALITY
The results of this study will be kept strictly confidential. Your written consent is required for the researchers to release any data identified with you as an individual to anyone other than personnel working on the project. The information you provide will have your name removed and only a subject number will identify you during analyses and any written reports of the research.
The experiment may be videotaped. If it is taped, the tapes will be stored securely, viewed only by the experimenters and erased after 3 months. If the experimenters wish to use a portion of your videotape for any other purpose, they will get your written permission before
using it. Your signature on this form does not give them permission to show your videotape to anyone else.
VI. COMPENSATION
Your participation is voluntary and unpaid. (Change this, if a benefit such as payment or a gift is offered.)
VII. FREEDOM TO WITHDRAW
You are free to withdraw from this study at any time for any reason.
VIII. APPROVAL OF RESEARCH
This research has been approved, as required, by the Institutional Review Board < or the name of your review committee> for projects involving human subjects at <your organization>.
IX. PARTICIPANT RESPONSIBILITIES AND PERMISSION
I voluntarily agree to participate in this study, and I know of no reason I cannot participate. I have read and understand the informed consent and conditions of this project. I have had all my questions answered. I hereby acknowledge the above and give my voluntary consent for participation in this project. If I participate, I may withdraw at any time without penalty. I agree to abide by the rules of this project
Signature Date
Name (please print) Contact: phone or address or email
Figure 14-2
Sample informed consent form for participants.
given some more benchmark tasks, and finally will be asked to complete an exit questionnaire.
In your general instructions to participants, make it clear that the purpose of the session is to evaluate the system, not to evaluate them. You should say explicitly "You are helping us evaluate the system-we are not evaluating you!" Some participants may be fearful that if somehow their performance is not up to "expectations," participation in this kind of test session could reflect poorly on them or even be used in their employment performance evaluations (if, for example, they work for the same organization that is designing the interface they are helping evaluate). They should be reassured that this is not the case. This is where it is important for you to reiterate your guarantee of confidentiality with respect to individual information and anonymity of data.
The instructions may inform participants that you want them to think aloud while working or, for example, may indicate that they can ask the evaluator questions at any time. The expected length of time for the evaluation session, if known (the evaluator should have some idea of how long a session will take after performing pilot testing), should also be included. Finally, you should always say, clearly and explicitly, that the participant is free to leave at any time.
Print out and copy the general instructions so that you can give one to each participant.
Non-disclosure Agreements (NDAs)
Sometimes an NDA is required by the developer or customer organizations to protect the intellectual property contained in the design. If you have an NDA, print out copies for reading, signing, and sharing with the participant.
Questionnaires and surveys
If your evaluation plan includes administration of one or more participant questionnaires, make sure that you have a good supply available. It is best to keep blank questionnaires in the control room or away from where a newly arriving participant could read them in advance.
Data collection forms
Make up a simple data collection form in advance. Your data collection form(s) should contain fields suitable for all types of quantitative data you collect and, probably separate, data collection forms for recording critical incidents and UX problems observed during the sessions. The latter should include spaces for the kind of supplementary data you like to keep, including associated task, effect on user (e.g., minor or task-blocking), guidelines involved, potential cause of
problem in design, relevant designer knowledge (e.g., how it was supposed to work), etc. Keep your data collection forms simple and easy to use on the fly. Consider a spreadsheet form on a laptop.
14.8.5 Training Materials Use training materials for participants only if you anticipate that a user's manual, quick reference cards, or any sort of training material will be available to and needed by users of the final system. If you do use training materials in evaluation, make the use of these materials explicit in the task descriptions.
If extensive participant training is required, say for an experienced participant role, it should have been administered in advance of evaluation. In general, training the user how to use a system during the evaluation session must be avoided unless you are evaluating the training. If the materials are used more as reference materials than training materials, participants might be given time to read any training material at the beginning of the session or might be given the material and told they can refer to it, reading as necessary to find information as needed during tasks. The number of times participants refer to the training material, and the amount of assistance they are able to obtain from the material, for example, can also be important data about overall UX of the system.
14.8.6 Planning Room Usage
As part of the evaluation plan for each major set of evaluation sessions, you need to document the configurations of rooms, equipment connections, and evaluator roles plus the people in these roles. Post diagrams of room and equipment setups so you do not have to figure this out at the last minute, when participants are due to arrive.
14.8.7 Ecological Validity in Your Simulated Work Context Thomas and Kellogg (1989) were among the first to warn us of the need for realistic contextual conditions in usability testing. If an element of work or usage context could not be addressed in the usability lab, they advised us to leave the lab and seek other ways to assess these ecological concerns. The challenge is to ensure that usability evaluation conditions reflect real-world usage conditions well enough to evoke the corresponding kinds of user performance and behavior. Your response to this challenge is especially important if you are addressing issues beyond the usual UX concepts to the full user experience and the out-of-the-box experience.
How do you know what you need for ecological validity? Usage or design scenarios are a good source of information about props and roles needed for tasks. Does your service agent user talk with a person through a hole in a glass panel or across a desk sitting down? Do they talk with patients, clients, or customers on the telephone? How does holding a telephone affect simultaneous computer task performance? Have your props and task aids ready at hand when the sessions begin.
One interesting "far-out" example of a prop for ecological validity is the "third age suit" developed at Loughborough University and used by architects, automobile designers, and others whose users include older people. The suit is like an exoskeleton of Velcro and stiff material, limiting mobility and simulating stiffness, osteoarthritis, and other confining and restricting conditions. New designs can be evaluated with this prop to appreciate their usability by older populations.
The early A330 Airbus-An example of the need for ecological validity in testing
We experienced a real-world example of a product that could have benefited enormously from better ecological validity in its testing. We traveled in an A330 Airbus airplane when that model first came out; our plane was 1 week old. (Advice: for many reasons, do not be the first to travel in a new airplane design.) We were told that a human-centered approach was taken in the A330 Airbus design, including UX testing of buttons, layout, and so on of the passenger controls for the entertainment system. Apparently, though, they did not do enough in situ testing. Each passenger had a personal viewing screen for movies and other entertainment, considered an advantage over the screens hanging from the ceiling. The controls for each seat were more or less like a TV remote control, only tethered with a "pull-out" cord. When not in use, the remote control snapped into a recessed space on the seat arm rest. Cool, eh?
The LCD screen had nice color and brightness but a low acceptable viewing angle. Get far off the axis (away from perpendicular to the screen) and you lose all brightness and, just before it disappears altogether, you see the picture as a color negative image. But the screen is right in front of you, so no problem, right? Right, until in a real flight the person in front of you tilts back the seat. Then we could barely see it. We could tell it was affecting others, too, because we could see many people leaning their heads down into an awkward position just to see the screen. After a period of fatigue, many people gave up, turned it off, and leaned back for comfort. If the display screen was used in UX testing, and we have to assume it was, the question of tilting the seat never entered the
discussion, probably because the screen was propped up on a stand in front of each participant in the UX lab. Designers and evaluators just did not think about passengers in front of screen users tilting back their seats. Testing in a more realistic setting, better emulating the ecological conditions of real flight, would have revealed this major flaw.
It does not end there. Once the movie got going, most people stowed the remote control away in the arm rest. But, of course, what do you also rest on an arm rest? Your arm. And in so doing, it was easy to bump a button on the control and make some change in the "programming." The design of this clever feature almost always made the movie disappear at a crucial point in the plot. And because we were syncing our movie viewing, the other one of us had to pause the movie while the first one had to go back through far too many button pushes to get the movie back and fast-forwarded to the current place.
It still does not end here. After the movie was over (or for some viewers, after they gave up) and we wanted to sleep, a bump of the arm on the remote caused the screen to light up brightly, instantly waking us to the wonderful world of entertainment. The flight attendant in just 1 week with this design had already come up with a creative workaround. She showed us how to pull the remote out on its cord and dangle it down out of the way of the arm rest. Soon, and this is the UX-gospel truth, almost everyone in the plane had a dangling remote control swinging gracefully in the aisle like so many synchronized reeds in the breeze as the plane moved about on its course. All very reminiscent of a wonderful Gary Larson cartoon showing a passenger sitting in flight. Among the entertainment controls on his arm rest is one switch, labeled "Wings stay on" and "Wings fall off." The caption reads, "Fumbling for his recline button, Ted unwittingly instigates a disaster."
The Social Security Administration (SSA) Model District Office (MDO)-An extreme and successful example
In the mid-1990s we worked extensively with the SSA in Baltimore, mainly in UX lifecycle training. A system we worked with there is used by a public service agent who serves clients, people who walk in off the street or call on the phone. The agent is the user, but the clients are essential to usage ecology; client needs provide the impetus for the user to act, the need for a system task. For evaluation then, they need people to act as clients, perhaps using scripts that detail the services needed, which then drive the computer-based tasks of the agent. And they need telephones and/or "offices" into which clients can come for service.
We worked with a small group pioneering the introduction of usability engineering techniques into an "old school," waterfall-oriented, mainframe
software development environment. Large Social Security systems were migrating slowly from mainframes (in Baltimore) plus terminals (by the thousands over the country) to client-server applications, some starting to run on PCs, and they wanted UX to be a high priority. Sean Wheeler was the group spark plug and UX champion, strongly supported by Annette Bryce and Pat Stoos.
What impressed us the most about this organization was their Model District Office. A decade earlier, as part of a large Claims Modernization Project, a program of system design and training to "revolutionize the way that SSA serves the public," they had built a complete and detailed model of a Social Security Administration district office from middle America right in the middle of
SSA headquarters building in Baltimore. The MDO, with its carpeting, office furniture, and computer terminals, right down to the office lamps and pictures on the wall, was indistinguishable from a typical agency office in
a typical town. They brought in real SSA employees from field offices from all over the United States to sit in the MDO to pilot and test new systems and procedures.
When SSA was ready to focus on UX, the MDO provided a perfect evaluation environment; simply put, it was an extreme and successful example of leveraging ecological validity for application development and testing, as well as for user training. In the end, the group created a UX success story upstream against the inertia and enormous weight of the rest of the organization and ended up winning a federal award for the quality of their design!
As a testament to their seriousness about ecological validity and UX, the SSA was spending about $1 million a year to bring employees in to stay and work at the MDO, sometimes for a few months at a time. Their cost justification calculations proved the activity was saving many times more.
14.8.8 The UX Evaluation Session Work Package
To summarize, as you do the evaluation preparation and planning described in this chapter, you need to gather your evaluation session work package, all the materials you will need in the evaluation session. Bring this evaluation session work package to each evaluation session.
Examples of package contents include:
� The evaluation configuration plan, including diagrams of rooms, equipment, and people in evaluation roles
� General instruction sheets
� Informed consent forms, with participant names and date entered
� Any non-disclosure agreements
� All questionnaires and surveys, including any demographic survey
� All printed benchmark task descriptions, one task per sheet of paper
� All printed unmeasured task descriptions (these can be listed several to a page)
� For each evaluator, a print out (or laptop version) of the UX targets associated with the day's sessions
� All data collection forms, on paper or on laptops
� Any props needed to support tasks
� Any training materials to be used as part of the evaluation
� Any compensation to be given out (e.g., money, gift cards, T-shirts, coffee mugs, used cars)
� Any special instructions to watch out for particular parts of the design, evaluation scripts, things to do before each participant session (e.g., to reset browser caches so that no auto complete entries from previous participant's session interferes with the current session), etc.
Why should benchmark tasks be printed just one per sheet of paper? What about the trees? We want our participants to focus on just the task at hand. If you give them descriptions of additional tasks, they will read them prematurely and distract themselves by thinking about those, too. It is just human nature. You need to control their mental focus.
Also, focusing on the participant, it is possible that not all participants will complete all tasks. There is no need for anyone to see that they have not accomplished them all. If they see only one at a time, they will never know and never feel bad.
14.9 DO FINAL PILOT TESTING: FIX YOUR WOBBLY WHEELS
If your UX evaluation plan involves using a prototype, low or high fidelity, make sure it is robust before you do anything more to prepare for your UX evaluation, regardless of whether your evaluation is lab based. If the evaluation team has not yet performed thorough pilot testing of the product or prototype, now is the time to give it a final shakedown. Exercise the prototype thoroughly. Pilot testing is essential to remove any major weaknesses in the prototype and any "show stopper" problems.
You need to be confident that the prototype will not "blow up" unceremoniously the first time it is brought into the proximity of real user
participants. It is embarrassing to have to apologize and dismiss a participant because the hardware or software wheels came off during an evaluation session. And, because good representative participants may be hard to find, you do not want to add to your time and expense by "burning" user participants unnecessarily.
While pilot testing of the prototype may be obvious to prepare for lab-based testing, it is similarly important prior to critical reviews and UX inspections
by outside human-computer interaction (HCI) experts. These experts do not work for free, and you will not want things going amiss during a session, causing delays while a hefty hourly fee is being paid for expert advice.
In addition to shaking down your prototype, think of your pilot testing as a dress rehearsal to be sure of your lab equipment, benchmark tasks, procedures, and personnel roles:
� Make sure all necessary equipment is available, installed, and working properly, whether it be in the laboratory or in the field.
� Run through the evaluation tasks completely at least once using the intended hardware and software (i.e., the interface prototype) by someone other than the person(s) who created the task descriptions.
� Make sure the prototype supports all the necessary user actions.
� Make sure the participant instructions and benchmark task descriptions are worded clearly and unambiguously.
� Make sure all session materials, such as any instruction sheets, the informed consent, and so on, are sufficient.
� Make sure that the metrics the benchmark tasks are intended to produce are practically measurable. Counting the number of tasks completed in either 5 seconds or 5 hours, for example, is not reasonable.
� Be sure that everyone on the evaluation team understands his or her role.
� Be sure that all the roles work together in the fast-paced events associated with user interaction.
14.10 MORE ABOUT DETERMINING THE RIGHT NUMBER OF PARTICIPANTS
One of your activities in preparing for formative evaluation is finding appropriate users for the evaluation sessions. In formal summative evaluation, this part of the process is referred to as "sampling," but that term is not appropriate here because what we are doing has nothing to do with the implied statistical relationships and constraints.
14.10.1 How Many Are Needed? A Difficult Question
How many participants are enough? This is one of those issues that some novice UX practitioners take so seriously and yet it is a question to which there is no definitive answer. Indeed, there cannot be one answer. It depends so much on the specific context and parameters of your individual situation that you have to answer this question for yourself each time you do formative evaluation.
There are studies that lead UX gurus to proclaim various rules of thumb, such as "three to five users are enough to find 80% of your UX problems," but when you see how many different assumptions are used to arrive at those "rules" and how few of those assumptions are valid within your project, you realize that this is one place in the process where it is most important for you to use your own head and not follow vague generalizations.
And, of course, cost is often a limiting factor. Sometimes you just get one or two participants in each of one or two iterations and you have to be satisfied with that because it is all you can afford. The good news is that you can do a lot with only a few good participants. There is no statistical requirement for large numbers of "subjects" as there is for formal summative evaluation; rather, the goal is to focus on extracting as much information as possible from every participant.
14.10.2 Rules of Thumb Abound
There are bona fide studies that predict the optimum number of participants needed for UX testing under various conditions. Most "rules of thumb" are based empirically but, because they are quoted and applied so broadly without regard to the constraints and conditions under which the results were obtained, these rules have become among the most folklorish of folklore out there.
Nielsen and Molich (1990) had an early paper about the number of users/ participants needed to find enough UX problems and found that 80% of their known UX problems could be detected with four to five participants, and the most severe problems were usually found with the first few participants. Virzi (1990, 1992) more or less confirmed Nielsen and Molich's study.
Nielsen and Landauer (1993) found that detection of problems as a function of the number of participants is well modeled as a Poisson process, supporting the ability to use early results to estimate the number problems left to be found and the number of additional participants needed to find a certain percentage.
Depending on the circumstances, though, some say that even five participants is no way near enough (Hudson, 2001; Spool & Schroeder, 2001), especially for complex applications or large Websites. In practice, each of these numbers has
proven to be right for some set of conditions, but the question is whether they will work for you in your evaluation.
14.10.3 An Analytic Basis for the Three to Five Users Rule
The underlying probability function
In Figure 14-3 you can see graphs, related to the binomial probability distribution, of cumulative percentages of problems likely to be found for a given number of participants used and at various detection rates, adapted from Lewis (1994).
Y-axis values in these curves are for "discovery likelihood," expressed as a cumulative percentage of problems likely to be found, as a function of the number of participants or evaluators used. These curves are based on the probability formula:
discovery likelihood (cumulative percentage of problems likely to be found 1/4 1- (1 - p)n, where n is the number of participants used (X-axis values) and p is what we call the "detection rate" of a certain category of participants.
As an example, this formula tells us that a sample size of five participant evaluators (n) with an individual detection rate (p) of at least 0.30 is sufficient to find approximately 80% of the UX problems in a system.
Figure 14-3
Graphs of cumulative percentages of problems likely to be found for a given number of participants used and at various detection rates [adapted from Lewis (1994)].
The old balls-in-an-urn analogy
Let us think of an interaction design containing flaws that cause UX problems as analogous to the old probability setting of an urn containing various colored balls. Among an unknown number of balls of all colors, suppose there are a number of red balls, each representing a different UX problem.
Suppose now that a participant or evaluator reaches in and grabs a big handful of balls from the urn. This is analogous to an evaluation session using a single expert evaluator, if it is a UX inspection evaluation, or a single participant, if it is a lab-based empirical session. The number of red balls in that handful is the number of UX problems identified in the session.
In a UX inspection, it is the expert evaluator, or inspector, who finds the UX problems. In an empirical UX test, participants are a catalyst for UX problem detection-not necessarily detecting problems themselves but encountering critical incidents while performing tasks, enabling evaluators to identify the corresponding UX problems. Because the effect is essentially the same, for simplicity in this discussion we will use the term "participant" for both the inspector and the testing participant and "find problems" for whatever way the problems are found in a session.
Participant detection rates
The detection rate, p, of an individual participant is the percentage of existing problems that this participant can find in one session. This corresponds to the number of red balls a participant gets in one handful of balls. This is a function of the individual participant. For example, in the case of the balls in the urn, it might be related to the size of the participant's hand. In the UX domain, it is perhaps related to the participant's evaluation skills.
In any case, in this analysis, if a participant has a detection rate of p 1/4 0.20, it
means that this participant will find 20% of the UX problems existing in the design. The number of participants with that same individual detection rate who, in turn, reach into the urn is the value on the X axis. The curve shown with a green line is for a detection rate of p 1/4 0.20. The other curves are for different
detection rates, from p 1/4 0.15 up to p 1/4 0.45.
Most of the time we do not even know the detections rates of our participants. To calculate the detection rate for a participant, we would have to know how many total UX problems exist in a design. But that is just what we are trying to find out with evaluation. You could, we guess, run a testing session with the participant against a design with a certain number of known flaws. But that
would tell you that participant's detection rate for that day, in that context, and for that system. Unfortunately, a given participant's detection rate is not constant.
Cumulative percentage of problems to be found
The Y axis represents values of the cumulative percentage of problems to be found. Let us look at this first for just one participant. The curve for p 1/4 0.20, for example, has a Y axis value of 20%, for n 1/4 1 (where the curve intersects the Y axis). This is consistent with our expectation that one participant with p 1/4 0.20 will find 20% of the problems, or get 20% of the red balls, in the first session.
Now what about the "cumulative" aspect? What happens when the second participant reaches into the urn depends on whether you replaced the balls from the first participant. This analysis is for the case where each participant returns all the balls to the urn after each "session"; that is, none of the UX problems are fixed between participants.
After the first participant has found some problems, there are fewer new problems left to find by the second participant. If you look at the results with the two participants independently, they each help you find a somewhat different 20% of the problems, but there is likely to be overlap, which reduces the cumulative effect (the union of the sets of problems) of the two.
This is what we see in the curves of Figure 14-3 as the percentage of problems likely to be found drops off with each new participant (moving to the right on the X axis) because the marginal number of new problems found is decreasing. That accounts for the leveling off of the curves until, at some high number of participants, essentially no new problems are being found and the curve is asymptotically flat.
Marginal added detection and cost-benefit One thing we do notice in the curves of Figure 14-3 is that, despite the drop-off of effective detection rates, as you continue to add more participants you will continue to uncover more problems. At least for a while. Eventually, high detection rates coupled with high numbers of participants will yield results that asymptotically approach about 100% in the upper right-hand part of the figure and virtually no new problems will be found with subsequent participants.
But what happens along the way? Each new participant helps you find fewer new problems, but because the cost to run each participant is about the same, with each successive participant the process becomes less efficient (fewer new problems found for the same cost).
As a pretty good approximation of the cost to run a UX testing session with n participants, you have a fixed cost to set up the session plus a variable cost (or cost per participant) 1/4 a � bn. The benefit of running a UX testing session with n participants is the discovery likelihood. So the cost benefit is the ratio benefit/ cost, each as a function of n, or benefit/cost 1/4 (1- (1 - pn)/ (a � bn).
If you graph this function (with some specific values of a and b) against n 1/4 1, 2, ... , you will see a curve that climbs for the first few values of n and then starts dropping off. The values of n around the peak of cost-benefit are the optimum (from a cost-benefit perspective) number of participants to use. The range of n for which the peak occurs depends on parameters a, b, and p of your setup; your smileage can vary.
Nielsen and Landauer (1993) showed that real data for both UX inspections and lab-based testing with participants did match this mathematical cost-benefit model. Their results showed that, for their parameters, the peak occurred for values of n around 3 to 5. Thus, the "three to five users" rule of thumb.
Assumptions do not always apply in the real world
This three-to-five users rule, with its tidy mathematical underpinning, can and does apply to many situations similar to the conditions Nielsen and Landauer (1993) used, and we believe their analysis brings insight into the discussion. However, we know there are many cases where it just does not apply.
For starters, all of this analysis, including the analogy with the balls-in-an-urn setting, depends on two assumptions:
� Each participant has a constant detection rate, p
� Each UX problem is equally likely to be found in testing
If UX problems were balls in an urn, our lives would be simpler. But neither of these assumptions is true and the UX life is not simple.
Assumptions about detection rates. Each curve in Figure 14-3 is for a fixed detection rate and the cost-benefit calculation given earlier was based on a fixed detection rate, p. But the "evaluator effect" tells us not only will different evaluators find different problems, but it tells us that even the detection rate can vary widely over participants (Hertzum & Jacobsen, 2003).
In fact, a given individual does not even have a fixed "individual detection rate"; it can be influenced from day to day or even from moment to moment by how rested the participant is, blood caffeine and ethanol levels, attitude, the system, how the evaluators conduct the evaluation, what tasks are used, the evaluator's skills, and so on.
Also, what does it really mean for a testing participant to have a detection rate of p 1/4 0.20? How long does it take in a session for that participant to achieve that 20% discovery? How many tasks? What kinds of tasks? What if that participant continues to perform more tasks? Will no more critical incidents be encountered after 20% detection is achieved?
Assumptions about problem detectability. The curves in Figure 14-3 are also based on an assumption that all problems are equally detectable (like all red balls in the urn are equally likely to be drawn out). But, of course, we know that some problems are almost obvious on the surface and other problems can be orders of magnitude more difficult to ferret out. So detectability, or likelihood of being found, can vary dramatically across various UX problems.
Task selection. One reason for the overlap in problems detected from one
participant to another, causing the cumulative detection likelihood to fall off with additional participants, as it does in Figure 14-3, is the use of prescribed tasks. Participants performing essentially the same sets of tasks are looking in the same places for problems and are, therefore, more likely to uncover many of the same problems.
However, if you employ user-directed tasks (Spool & Schroeder, 2001), participants will be looking in different places and the overlap of problems found could be much less. This keeps the benefit part of the curves growing linearly for more participants, causing your optimum number of participants to be larger.
Application system effects. Another factor that can torpedo the three-to-five users rule is the application system being evaluated. Some systems are very much larger than others. For example, an enormous Website or a large and complex word processor will harbor many more possibilities for UX problems than, say, a simple inter-office scheduling system. If each participant can explore only a small portion of such an application, the overlap of problems among participants may be insignificant. In such cases the cost-benefit function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment