How does a model’s observed accuracy in practice affect people’s trust in the model?

Understanding the Effect of Accuracy on Trust in Machine Learning Models

Ming Yin Purdue University

mingyin@purdue.edu

Jennifer Wortman Vaughan Microsoft Research jenn@microsoft.com

Hanna Wallach Microsoft Research hanna@dirichlet.net

ABSTRACT We address a relatively under-explored aspect of human– computer interaction: people’s abilities to understand the relationship between a machine learning model’s stated per- formance on held-out data and its expected performance post deployment. We conduct large-scale, randomized human- subject experiments to examine whether laypeople’s trust in a model, measured in terms of both the frequency with which they revise their predictions to match those of the model and their self-reported levels of trust in the model, varies depend- ing on the model’s stated accuracy on held-out data and on its observed accuracy in practice. We find that people’s trust in a model is affected by both its stated accuracy and its observed accuracy, and that the effect of stated accuracy can change depending on the observed accuracy. Our work relates to re- cent research on interpretable machine learning, but moves beyond the typical focus on model internals, exploring a different component of the machine learning pipeline.

CCS CONCEPTS • Human-centered computing → Empirical studies in HCI; • Computing methodologies → Machine learn- ing.

KEYWORDS Machine learning, trust, human-subject experiments

ACM Reference Format: Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. 2019. Understanding the Effect of Accuracy on Trust in Machine Learn- ing Models. In CHI Conference on Human Factors in Computing Systems Proceedings (CHI 2019), May 4–9, 2019, Glasgow, Scotland

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. CHI 2019, May 4–9, 2019, Glasgow, Scotland UK © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-5970-2/19/05.. .$15.00 https://doi.org/10.1145/3290605.3300509

UK. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/ 3290605.3300509

1 INTRODUCTION Machine learning (ML) is becoming increasingly ubiquitous as a tool to aid human decision-making in diverse domains ranging from medicine to public policy and law. For exam- ple, researchers have trained deep neural networks to help dermatologists identify skin cancer [8], while political strate- gists regularly use ML-based forecasts to determine their next move [21]. Police departments have used ML systems to predict the location of human trafficking hotspots [28], while child welfare workers have used predictive modeling to strategically target services to the children most at risk [3]. This widespread applicability of ML has led to a move-

ment to “democratize machine learning” [12] by developing off-the-shelf models and toolkits that make it possible for anyone to incorporate ML into their own system or decision- making pipeline, without the need for any formal training. While this movement opens up endless possibilities for ML to have real-world impact, it also creates new challenges. Decision-makers may not be used to reasoning about the explicit forms of uncertainty that are baked into ML pre- dictions [27], or, because they do not need to understand the inner workings of an ML model in order to use it, they may misunderstand or mistrust its predictions [6, 16, 25]. Prompted by these challenges, as well as growing concerns that ML systems may inadvertently reinforce or amplify so- cietal biases [1, 2], researchers have turned their attention to the ways that humans interact with ML, typically focusing on people’s abilities and willingness to use, understand, and trust ML systems. This body of work often falls under the broad umbrella of interpretable machine learning [6, 16, 25].

To date, most work on interpretability has focused explic- itly on ML models, asking questions about people’s abilities to understand model internals or the ways that particular models map inputs to outputs [20, 24], as well as questions about the relationship between these abilities and people’s willingness to trust a model. However, the model is just one component of the ML pipeline, which spans data collection, model selection, training algorithms and procedures, model evaluation, and ultimately, deployment. It is therefore im- portant to study people’s interactions with each of these components—not just those that relate to model internals.

CHI 2019 Paper CHI 2019, May 4–9, 2019, Glasgow, Scotland, UK

Paper 279 Page

One particularly under-explored aspect of the evaluation and deployment components of the pipeline is the inter- pretability of performance metrics, such as accuracy, preci- sion, or recall. The democratization of ML means that it is increasingly common for a decision-maker to be presented with a “black-box” model along with some measure of its performance—most often accuracy—on held-out data. How- ever, a model’s stated performance may not accurately reflect its performance post deployment because the data on which the model was trained and evaluated may look very differ- ent from real-world use cases [15]. In deciding how much to trust the model, the decision-maker has little to go on besides this stated performance, her own limited observations of the model’s predictions in practice, and her domain knowledge.

This scenario raises a number of questions. To what extent do laypeople—who are increasingly often the end users of systems built using ML models—understand the relationship between a model’s stated performance on held-out data and its expected performance post deployment? How does their understanding influence their willingness to trust the model? For example, do people trust a model more if they are told that its accuracy on held-out data is 90% as compared with 70%? If so, will the model’s stated accuracy continue to influ- ence their trust in the model even after they are given the op- portunity to observe and interact with the model in practice? In this paper, we describe the results of a sequence of

large-scale, randomized, pre-registered human-subject exper- iments1 designed to investigate whether an ML model’s accu- racy affects laypeople’s willingness to trust the model. Specif- ically, we focus on the following three main questions: • Does a model’s stated accuracy on held-out data affect people’s trust in the model?

• If so, does it continue to do so after people have observed the model’s accuracy in practice?

• How does a model’s observed accuracy in practice affect people’s trust in the model?

In each of our experiments, subjects recruited on Amazon Mechanical Turk were asked to make predictions about the outcomes of speed dating events with the help of an ML model. Subjects were first shown information about a speed dating participant and his or her date, and then asked to predict whether or not the participant would want to see his or her date again. Finally, they were shown the model’s pre- diction and given the option of revising their own prediction.

In our first experiment, we focus on the first two questions above, investigating whether a model’s stated accuracy on held-out data affects laypeople’s trust in the model and, if so, whether it continues to do so after they have observed the model’s accuracy in practice. Subjects were randomized into one of ten treatments, which differed along two dimensions: 1All experiments were approved by the Microsoft Research IRB.

stated accuracy on held-out data and amount at stake. Some subjects were given no information about the model’s accu- racy on held-out data, while others were told that its accuracy was 60%, 70%, 90%, or 95%. Halfway through the experiment, each subject was given feedback on both their own accuracy and the model’s accuracy on the first half of the prediction tasks, which was 80% regardless of the treatment. Subjects in all treatments saw exactly the same speed dating events and exactly the same model predictions. This experimental de- sign allows us isolate the effect of stated accuracy on people’s trust, both before and after they observe the model’s accuracy in practice. As a robustness check, some subjects received a monetary bonus for each correct prediction, while others did not, allowing us to test whether the effect of stated accuracy on trust varies when people have more “skin in the game.”

We find that stated accuracy does have a significant effect on people’s trust in a model, measured in terms of both the frequency with which subjects adjust their predictions to match those of the model and their self-reported levels of trust in the model. We also find that the effect size is smaller after people observe the model’s accuracy in practice. We do not find that the amount at stake has a significant effect. In our second experiment, we test whether these results

are robust to different levels of observed accuracy by running two additional variations of our first experiment: one in which the observed accuracy of the model was low and one in which the observed accuracy of the model was high. We find that a model’s stated accuracy still has a significant effect on people’s trust even after observing a high accuracy (100%) in practice. However, if a model’s observed accuracy is low (55%), then after observing this accuracy, the stated accuracy has at most a very small effect on people’s trust in the model.

In our third experiment, we investigate the final question above—i.e., how does a model’s observed accuracy in prac- tice affect people’s trust in the model? The experimental design used in our first two experiments does not enable us to directly compare people’s trust between treatments with different levels of observed accuracy because the prediction tasks (i.e., speed dating events) and the model predictions differed between these treatments. Our third experiment was therefore carefully designed to enable us to make such com- parisons. We find that after observing a model’s accuracy in practice, people’s trust in the model is significantly affected by its observed accuracy regardless of its stated accuracy. Finally, via an exploratory analysis, we dig more deeply

into the question of how people update their trust after re- ceiving feedback on their own accuracy and the model’s accuracy in practice. We analyze differences in individual subjects’ trust in the model before and after receiving such feedback. Our experimental data support the conjecture that people compare their own accuracy to the model’s observed accuracy, increasing their trust in the model if the model’s

CHI 2019 Paper CHI 2019, May 4–9, 2019, Glasgow, Scotland, UK

Paper 279 Page 2

observed accuracy is higher than their own accuracy—except in the case where the model’s observed accuracy is substan- tially lower than its stated accuracy on held-out data. Taken together, our results show that laypeople’s trust

in an ML model is affected by both the model’s stated accu- racy on held-out data and its observed accuracy in practice. These results highlight the need for designers of ML systems to clearly and responsibly communicate their expectations about model performance, as this information shapes the extent to which people trust a model, both before and after they are able to observe and interact with it in practice. Our results also reveal the importance of properly communicat- ing the uncertainty that is baked into every ML prediction. Of course, proper caution should be used when generalizing our results to other settings. For example, although we do not find that the amount at stake has a significant effect, it is possible that there would be an effect when stakes are suf- ficiently high (e.g., doctors making life-or-death decisions).

Related Work Our research contributes to a growing body of experimental work on trust in algorithmic systems. As a few examples, Dzindolet et al. [7] and Dietvorst et al. [4] found that people stop trusting an algorithm after witnessing it make a mistake, even when the algorithm outperforms human predictions— a phenomenon known as algorithm aversion. Dietvorst et al. [5] found that people are more willing to rely on an algo- rithm’s predictions when they are given the ability to make minor adjustments to the predictions rather than accepting them as is. Yeomans et al. [30] found that people distrust automated recommender systems compared with human rec- ommendations in the context of predicting which jokes peo- ple will find funny—a highly subjective domain—even when the recommender system outperforms human predictions. In contrast, Logg et al. [17] found that people trust predictions more when they believe that the predictions come from an algorithm as opposed to a human expert when predicting mu- sic popularity, romantic matches, and other outcomes. This effect is diluted when people are given the choice between us- ing an algorithm’s prediction and using their own prediction (as opposed to a prediction from another human expert).

The relationship between interpretability and trust has been discussed in several recent papers [16, 22, 25]. Most related to our work, and an inspiration for our experimental design, Poursabzi-Sangdeh et al. [24] ran a sequence of ran- domized human-subject experiments and found no evidence that either the number of features used in an ML model nor the model’s level of transparency (clear or black box) have a significant impact on people’s willingness to trust the model’s predictions, although these factors do affect people’s abilities to detect when the model has made a mistake.

Kennedy et al. [14] touched on the relationship between stated accuracy and trust in the context of criminal recidi- vism prediction. They ran a conjoint experiment in which they presented subjects with randomly generated pairs of models and asked each subject which model they preferred. The models varied in terms of their stated accuracy, the size of the (fictitious) training data set, the number of features, and several other properties. The authors estimated the ef- fect of each property by fitting a hierarchical linear model and found that people generally focus most on the size of the training data set, the source of the algorithm, and the stated accuracy, while less often taking into account the model’s level of transparency or the relevance of the training data. Finally, a few studies from the human–computer interac-

tion community have examined the relationship between sys- tem performance and users’ trust in automated systems [31, 32], ubiquitous computing systems [13], recommender sys- tems [23], and robots [26]. For example, in a simulated ex- perimental environment in which users interacted with an automated quality monitoring system to identify faulty items in a fictional factory production line, Yu et al. [31, 32] ex- plored how users’ trust in the system varies with its accuracy. Unlike in our work, system accuracy was not explicitly com- municated to users. Instead, users “perceived” the accuracy by receiving feedback after interacting with the system. Yu et al. found that users are able to correctly perceive the accuracy and stabilize their trust to a level correlated with the accu- racy [31], though system failures have a stronger impact on trust than system successes [32]. In addition, Kay et al. [13] developed a survey tool through which they revealed that, for classifiers used in four hypothetical applications (e.g., elec- tricity monitoring and location tracking), users tend to put more weight on the classifiers’ recall rather than their pre- cision when deciding whether the classifiers’ performance is acceptable, with the weight varying across applications.

2 EXPERIMENT 1: DOES A MODEL’S STATED ACCURACY AFFECT LAYPEOPLE’S TRUST?

Our first experiment was designed to answer our first two main questions—i.e., does a model’s stated accuracy on held- out data affect laypeople’s trust in the model, and if so, does it continue to do so after they have observed the model’s ac- curacy in practice? In our experiment, each subject observed the model’s accuracy in practice via a feedback screen that was presented halfway through the experiment with infor- mation about the subject’s own accuracy and the model’s accuracy thus far, as described below. Before running the ex- periment, we posited and pre-registered two hypotheses de- rived from our questions, which we state informally here:2

2The pre-registration document is at https://aspredicted.org/uq3hi.pdf.

CHI 2019 Paper CHI 2019, May 4–9, 2019, Glasgow, Scotland, UK

Paper 279 Page 3

https://aspredicted.org/uq3hi.pdf

• [H1] The stated accuracy of a model has a significant effect on people’s trust in the model before seeing the feedback screen.

• [H2] The stated accuracy of a model has a significant effect on people’s trust in the model after seeing the feedback screen.

As a robustness check to guard against the potential criti- cism that any null results might be due to a lack of perfor- mance incentives, we randomly selected some subjects to receive a monetary bonus for each correct prediction. We also posited and pre-registered two additional hypotheses: • [H3] The amount at stake has a significant effect on peo- ple’s trust in a model before seeing the feedback screen.

• [H4] The amount at stake has a significant effect on people’s trust in a model after seeing the feedback screen.

Prediction Tasks We asked subjects to make predictions about the outcomes of forty speed dating events. The data came from real speed dat- ing participants and their dates via the experimental study of Fisman et al. [9]. Each speed dating participant indicated whether or not he or she wanted to see his or her date again, thereby giving us ground truth from which to compute accu- racy. We chose this application for two reasons: First, predict- ing romantic interest does not require specialized domain expertise. Second, this setting is plausibly one in which ML might be used given that many dating websites already rely on ML models to predict potential romantic partners [18, 29]. For each prediction task (i.e., speed dating event), each

subject was first shown a screen of information about the speed dating participant and his or her date, including: • The participant’s basic information: the gender, age, field of study, race, etc. of the participant.

• The date’s basic information: the gender, age, and race of the participant’s date.

• The participant’s preferences: the participant’s reported distribution of 100 points among six attributes (attrac- tiveness, sincerity, intelligence, fun, ambition, and shared interests), indicating how much he or she values each attribute in a romantic partner.

• The participant’s impression of the date: the participant’s rating of his or her date on the same six attributes us- ing a scale of one to ten, as well as scores (also using a scale of one to ten) indicating how happy the participant expected to be with his or her date and how much the participant liked his or her date.

The subject was then asked to follow a three step-procedure: First, they were asked to carefully review the information about the participant and his or her date and predict whether or not the participant would want to see his or her date

Figure 1: Screenshot of the prediction task interface.

again. Next, they were shown the model’s (binary) prediction. Finally, they were given the option of revising their own prediction. A screenshot of the interface is shown in Figure 1.

Experimental Treatments We randomized subjects into one of ten treatments arranged in a 5×2 design. The treatments differed along two dimen- sions: stated accuracy on held-out data and amount at stake. Subjects were randomly assigned to one of five accuracy

levels: none (the baseline), 60%, 70%, 90%, or 95%. Subjects assigned to an accuracy level of none were initially given no information about the model’s accuracy on held-out data. Subjects assigned to one of the other accuracy levels saw the following sentence in the instructions: “We previously evaluated this model on a large data set of speed dating participants and its accuracy was x%, i.e., the model’s predic- tions were correct on x% of the speed dating participants in this data set.” Throughout the experiment, we also reminded these subjects of the model’s stated accuracy on held-out data each time they were shown one of the model’s predictions.

We note that our sentence about accuracy was not a decep- tion. We developed four ML models (a rule-based classifier,

CHI 2019 Paper CHI 2019, May 4–9, 2019, Glasgow, Scotland, UK

Paper 279 Page 4

a support vector machine, a three-hidden-layer neural net- work, and a random forest) and evaluated them on a held-out data set of 500 speed dating participants, obtaining accuracies of 60%, 70%, 90%, and 95%. To keep the treatments as similar as possible, the models made exactly the same predictions for the forty speed dating events that were shown to subjects. Subjects were randomly assigned to either low or high

stakes. Subjects assigned to low stakes were paid a flat rate of $1.50 for completing the experiment. Subjects assigned to high stakes also received a monetary bonus of $0.10 for each correct (final) prediction3 in addition to the flat rate of $1.50.

Experimental Design We posted our experiment as a human intelligence task (HIT) on Amazon Mechanical Turk. The experiment was only open to workers in the U.S., and each worker could participate only once. In total, 1,994 subjects completed the experiment. Upon accepting the HIT, each subject was randomized

into one of the ten treatments described above. Each HIT consisted of exactly the same forty prediction tasks, grouped into two sets A and B of twenty tasks each. As described above, subjects in all ten treatments saw exactly the same model prediction for each task. The experiment was divided into two phases. To minimize differences between the phases, subjects were randomly assigned to see either the tasks in set A during Phase 1 and the tasks in set B during Phase 2, or vice versa; the order of the tasks was randomized within each phase. We chose the tasks in sets A and B so that the observed accuracy on the first twenty tasks would be 80% regardless of the ordering of sets A and B. This experimental design mini- mizes differences between treatments and allows us to draw causal conclusions about the effect of stated accuracy on people’s trust without worrying about confounding factors. Each subject was asked to make initial and final predic-

tions for each task, following the three-step procedure de- scribed above. The subjects were given no feedback on their own prediction or the model’s prediction for any individual task; however, after Phase 1, each subject was shown a feed- back screen with information about their own accuracy and the model’s accuracy (80% by design) on the tasks in Phase 1. A screenshot of the feedback screen is shown in Figure 2.

At the end of the HIT, each subject completed an exit sur- vey in which they were asked to report their level of trust in the model during each phase using a scale of one (“I didn’t trust it at all”) to ten (“I fully trust it”). Specifically, we asked subjects the following question: “How much did you trust our machine learning algorithm’s predictions on the first [last] twenty speed dating participants (that is, before [after]

3The highest possible bonus was 40×$0.10 = $4—i.e., substantially more than the flat rate of $1.50, thereby making the bonus salient [11].

Figure 2: Screenshot of the feedback screen shown between Phase 1 and Phase 2 (i.e., after the first twenty tasks).

you saw any feedback on your performance and the algo- rithm’s performance)?” We also collected basic demographic information (such as age and gender) about each subject. To quantify a subject’s trust in a model, we defined two

metrics, calculated separately for each phase, that capture how often the subject “followed” the model’s predictions: • Agreement fraction: the number of tasks for which the subject’s final prediction agreed with the model’s predic- tion, divided by the total number of tasks.

• Switch fraction: the number of tasks for which the sub- ject’s initial prediction disagreed with the model’s pre- diction and the subject’s final prediction agreed with the model’s prediction, divided by the total number of tasks for which the subject’s initial prediction disagreed with the model’s prediction.

We used these two metrics when formally stating all of our pre-registered hypotheses, while additionally pre-registering our intent to analyze subjects’ self-reported trust levels.

Analysis of Trust in Phase 1 (H1 and H3) We start by analyzing data from Phase 1 to see if subjects’ trust in a model is affected by the model’s stated accuracy and the amount at stake before they see the feedback screen. Figures 3a and 3b show subjects’ average agreement fraction and average switch fraction, respectively, in Phase 1, by treat- ment. Visually, stated accuracy appears to have a substantial effect on how often subjects follow the model’s predictions. Subjects’ final predictions agree with the model’s predictions more often when the model has a high stated accuracy. How- ever, the effect of the amount at stake is less apparent. To formally compare the treatments, we conduct a two-way ANOVA on subjects’ agreement fractions and, respectively, switch fractions in Phase 1. The results suggest a statistically significant main effect of stated accuracy on how often sub- jects follow the model’s predictions (effect size η2 = 0.036, p = 4.72 × 10−15 for agreement fraction, and η2 = 0.061, p = 5.62×10−26 for switch fraction) while the main effect of the amount at stake is insignificant (p = 0.30 and p = 0.11 for agreement fraction and switch fraction, respectively). We do not detect a significant interaction between the two

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Benefits of our college essay writing service

  • 80+ disciplines

    Buy an essay in any subject you find difficult—we’ll have a specialist in it ready

  • 4-hour deadlines

    Ask for help with your most urgent short tasks—we can complete them in 4 hours!

  • Free revision

    Get your paper revised for free if it doesn’t meet your instructions.

  • 24/7 support

    Contact us anytime if you need help with your essay

  • Custom formatting

    APA, MLA, Chicago—we can use any formatting style you need.

  • Plagiarism check

    Get a paper that’s fully original and checked for plagiarism

What the numbers say?

  • 527
    writers active
  • 9.5 out of 10
    current average quality score
  • 98.40%
    of orders delivered on time
error: