Any UX-research study aims to answer general questions about our design or about our users. What percentage of our user population will be able to subscribe to our newsletter? What major usability issues will people encounter on our site? Is design A more usable than design B for our target audience? But any time we set up a UX-research study, whetherquantitative or qualitative, there is danger that it will not reflect the reality we want to capture because the study is poorly designed.
There are two big types of study-design errors:
- 内部有效性errors that bias participants towards a certain response or behavior
我们将单独讨论这些中的每一个。但在我们这样做之前，让我们注意到validity is separate fromreliability. Reliability of a study simply means that you will get the same result if you repeat the study. In other words, findings are not random. There are plenty of statistical methods to calculate the degree of study reliability, and the main way to increase reliability is to test more participants. But无效的可靠性并不好: a study with high reliability and low validity is one where you get a really good measurement of the wrong thing.
Internal Validity for UX Studies
Not necessarily. This study setup favors design B because, when they get to it, participants will be already used to the testing situation and with the task domain — if they’re testing car-rental sites, they will already know what a LDW (loss-damage waver) is when they get to site B and they may have certain expectations regarding the steps of the rental process. They will also know what you expect them to do and how they’re supposed to perform the task. Therefore, this study is missing internal validity. (The usual fix to this problem is to alternate which site goes first, and have half of the users try site B first.)
Definition: A study hasinternal validityif it does not favor or encourage any particular participant response or behavior.
Internal validity is an issue in both qualitative and quantitative studies. With moderated qualitative studies, the facilitator may inadvertently偏见或引出某种反应来自参与者。例如，即使是一个简单的问题，如“你发现结账难吗？”可能使研究结果无效，因为参与者是primedto think of difficulties, so they may identify more than normal (like with Richard Nixon’s “I am not a crook” statement).
With quantitative studies, lack of internal validity may produce results that skew in one direction, but do not reflect the reality. You may, for instance, in abenchmarkingstudy, discover that your time on task is better on a redesigned version of the site than on the original and you may infer that you did a good job with the redesign, when in fact, the difference was due to different study protocols — the original test used thethink-aloud protocol，但重新设计的测试没有。（并且大声思考确实需要一些额外的时间，所以它可能导致更长的任务时间。）
In this example, the protocol is an example of a混淆变量— a hidden variable that can affect the results of your study, but that you didn’t take into account when you designed the study.
External validity is about how naturalistic your study is.
If you’re designing a site for seniors and recruit study participants from the general population, will that study be valid? Will it tell you something relevant about your real audience? Possibly not, because younger participants are likely to behave differently than older ones. Or, if you’re testing a mobile design on a desktop, will your findings generalize to the use of the design in the wild? Maybe yes, maybe no — it’s impossible to know for sure (unless you do another study). In both these situations, the studies are missing external validity.
Definition: A study has外部有效性如果参与者和研究设置是代表使用设计的真实情况的代表。
The concept of external validity also applies to both qualitative and quantitative studies — for obvious reasons.
Recommendations for Study Design
Randomization is essential for ensuring internal validity.
- Use random ordering of tasks.
Task order can bias task responses. At the beginning of a study, people are usually new to both the study environment and to the system that they’re testing. It’s normal for them to take longer to perform the first tasks in a session and perhaps make more errors than normal. On the other hand, tasks that are shown at the end of the session might see the effect of participant fatigue.
That is why we strongly recommend that in any test, whether qualitative or quantitative, you randomize the order of the tasks as much as possible. (Sometimes, however, following this recommendation may not be entirely feasible — for example, if the tasks areLog inand存款检查, it may not be possible for存款检查to followLog in).
- If your study contrasts two or more conditions (e.g., you want to compare your site with a competitor site) and each participant will be exposed to all conditions (i.e.,在主题设计中),you should counterbalance or randomize the order in which each participant is exposed to those conditions(for instance, the order in which they see your site and your competitor’s).
这个建议是与前一个有关— randomizing the task order. However, if you’re testing, say, 2 ecommerce sites, sometimes it may be unrealistic or unfeasible to ask the participant to shop on site one, then add an item to a wishlist on site 2, then go back to site 1 and subscribe to the newsletter, then shop on site 2 — this would be a detrimental and possibly confusing setup, if you want, for instance, to collect post-test questionnaires such asSUSandnps.for the two designs at the end of the session.
在这种情况下，我们建议您将设计1的所有任务组合在一起，以及一起设计的所有任务。但是，您应该随机化参与者看到两种设计的顺序 - 一些参与者看到设计1和其他人首先看到设计2。并且，在每个设计本身内，任务的顺序应该随机化。
- Control study setup from one session to the next and look for confounding variables —隐藏的因素可能影响您的结果。
For example, assume a researcher is interested in comparing two sites and uses a between-subject design. She decides to study site A with the participants in the morning sessions and site B with those participants coming for afternoon sessions. If she ends up finding that participants perform better on, say, site A, it could be because site A is better, or it could be because people are less tired in the morning.
同样，如果一位同事帮助您促进研究和划分网站 - 您将与网站A的会话带走，她接受网站B，辅导员是一个隐藏的变量。这可能是一个促进者的风格比其他人更偏见，或者一个促进者是一个自然是一个更令人愉快的人，参与者觉得更加健谈和放松她。
When you put together a benchmarking program for your organization, planning carefully for internal validity is essential. You have to document very carefully your study conditions (task wording, study protocol, whether think-aloud was used, and so on) so that they could be replicated in further studies that you will run in order to determine design improvements over time. Otherwise, a difference between a current version of a system and a prior installment may simply be due to study setup rather than to usability improvements.
- Recruit participants who are representative of your target audience- 在人口统计数据和用户目标方面。
一般来说，研究人员创造了非常小心screenersthat match the exact demographics of their population, yet that may not be enough to ensure external validity. It could be that your participants are in the right demographics but have very different goals than your users (or they’re simply not motivated enough). Always strive to find participants who are likely to have the same goals as your users.
- Replicate, to the best of your abilities, the natural situationin which participants will use the UI that they test.
Are your participants supposed to use your car-repair mobile application in their garage? Then don’t have them test it in a conference room. The environment — light, dirty hands, place where the phone is positioned, time available, tools available — are all likely to play a role in how usable this app is.
Is External Validity Always Possible?
In some sense, any study will lack external validity — we rarely use interfaces with a stranger watching over our shoulder, sitting at a desk or in a lab. (To some extent, one could even argue that some远程研究are more externally valid than in-person ones because at least the participants may be in their natural environments.) We also know that participants tend to behave slightly differently — more compliant and more persistent — in a usability-testing situation than by themselves.
Also, sometimes, it may be too cost-prohibitive to test a design in the natural environment. For example, we are great advocates of纸质原型设计, but these types of tests will always lack external validity. So, what should we do?
Another common situation that lacks external validity ismobile testing— most participants will not use mobile designs uninterrupted, sitting at a desk, and connected to wifi. It can, however, be acceptable to test in that setup to identify those issues that will be encountered even in the best-case scenario of a great connection and no interruptions. Those are likely the first issues many mobile sites will need to address — if the site has problems even under ideal conditions, then the design needs to be fixed. Once you’ve ironed out those issues, you still will need to retest under more realistic conditions.
Similarly, some quantitative-study professionals recommend to include only expert participants in certain quantitative studies in order to reduce variability (lack of variability translates into a lower margin of error for the study results and may allow the researchers to reduce the number of participants). The expert users will give you a best-case scenario and you should be fine as long as you don’t assume that the results will generalize to all your users.
In general, if you find yourself forced to sacrifice some external validity, it’s crucial that you always interpret your findings in context and realize that they may not stand true if the study were to be replicated in realistic conditions.
Poorly planned research will translate in results that are invalid. You may have potentially wasted time and money on running a study which doesn’t tell you anything about your product or your audience. Pay attention to your study’s internal and external validity — strive to recruit participants that are representative of your target audience and make sure that the study setup replicates how your users will use the system in real life and that it does not encourage any one behavior or response.