Usability testing: 5 users is NOT enough
I have seen a lot of usability test results in my time, and I have also carried out a whole truck load these tests myself. I find that a lot of people carrying out these tests are unaware of the statistical analysis necessary to interpret the results of the test. When I ask:
- "Why did you only use 5 people?"
The answer is often:
- "Because that's what Nielsen says to use"
The problem is that the Nielsen 5 user rule is correct only in very very specific circumstances, and even then it is pushing the boat out a fair bit. The Nielsen rule says that:
N(1-(1-L)n)
where N is the total number of usability problems in the design and L is the proportion of usability problems discovered while testing a single user. The typical value of L is 31%, averaged across a large number of projects they studied.
Landauer and Nielsen conclude:
"As you add more and more users, you learn less and less because you will keep seeing the same things again and again. There is no real need to keep observing the same thing multiple times, and you will be very motivated to go back to the drawing board and redesign the site to eliminate the usability problems. After the fifth user, you are wasting your time by observing the same findings repeatedly but not learning much new."
5 users tend to show you what you already expect to see, but deciding that having any more users test out your software will not teach you anything new is just incorrect. A lot of people completely ignore the following information that Nielsen and Landauer give:
"This formula only holds for comparable users who will be using the site/product in fairly similar ways"
Now that really narrows things down, and means that in many cases, the rule is not applicable.
How can you tell that you have tested with enough users?
It all depends on:
- The results you are observing
- The range of target user groups
- The range of tasks that you want to observe
- What the results are going to be used for
- How many rounds of testing you intend on doing
- Whether your results are statistically significant
- The mission criticality of what you;re testing
- And a whole lot more...
Laura Faulkner conducted a good in depth study of the 5 user rule and shared the results in her paper "Beyond the 5 user assumption: benefits of increased sample sizes in usability testing". She proved that confidence intervals in samples of five users were around 18.6%...ouch! She concludes:
"Although practitioners like simple directive answers such as the 5-user assumption, the only clear answer to valid usability testing is that the test users must be representative of the target population. The important and often complex issue, then, becomes defining the target population. There are strategies that a practitioner can employ to attain a higher accuracy rate in usability testing. One would be to focus testing on users with goals and abilities representative of the expected user population. When fielding a product to a general population, one should run as many users of varying experience levels and abilities as possible. Designing for a diverse user population and testing usability are complex tasks. It is advisable to run the maximum number of participants that schedules, budgets, and availability allow. The mathematical benefits of adding test users should be cited. More test users means greater confidence that the problems that need to be fixed will be found."
Look at Figure 1 in her paper, and submit it to your memory for keeps. Do your own tests, and measure things accurately. What do you find? Recently in a test of my own, I used 42 users and couldn't get a statistically significant result. I needed more users, because otherwise although system B appeared to be preferred over system A, it was too close a call. It could just as easily have gone the other way. I needed more users.
Also, it helps to have a hypothesis...
It's crucial to test with a clear hypothesis in mind. If there is none, you will find patterns that are meaningless, although they are there. This statistical phenomenon is called "Clustering". In large random datasets, you'll find clusters of the same type of information, because there are lot more clustered patterns out there than non-clustered ones. It doesn't mean that you have necessarily found anything meaningful. For example if you ask 100 people what their star sign is and a whole bunch of other questions, you might find that all Leos have red shoes.
Where to start with stats
It's good to understand stats if you're going to be measuring pretty much anything. If you don't realise, you may be misled by what you are seeing. A sound place to start:
