注释

注释　Notes

引言　偏差与噪声，人类判断的两类错误

射击场只是一种隐喻：1778年，瑞士数学家丹尼尔·伯努利（Daniel Bernoulli）在一篇关于评估问题的论文中使用弓和箭进行了相同的类比。Bernoulli, “The Most Probable Choice Between Several Discrepant Observations and the Formation Therefrom of the Most Likely Induction,” Biometrika 48, no. 1–2（June 1961）: 3–18。

儿童监护权决策：Joseph J. Doyle Jr., “Child Protection and Child Outcomes: Measuring the Effects of Foster Care,” American Economic Review 95, no. 5（December 2007）: 1583–1610。

预测中存在噪声：Stein Grimstad, Magne Jørgensen, “Inconsistency of Expert Judgment-Based Estimates of Software Development Effort,” Journal of Systems and Software 80, no. 11（2007）: 1770–1777。

庇护权决策：Andrew I. Schoenholtz, Jaya Ramji-Nogales, Philip G. Schrag, “Refugee Roulette: Disparities in Asylum Adjudication,” Stanford Law Review 60, no. 2（2007）。

专利权授予决策：Mark A. Lemley, Bhaven Sampat, “Examiner Characteristics and Patent Office Outcomes,” Review of Economics and Statistics 94, no. 3（2012）: 817–827; Iain Cockburn, Samuel Kortum, Scott Stern, “Are All Patent Examiners Equal? The Impact of Examiner Characteristics,”working paper 8980, June 2002; Michael D. Frakes, Melissa F. Wasserman, “Is the Time Allocated to Review Patent Applications Inducing Examiners to Grant Invalid Patents? Evidence from Microlevel Application Data,” Review of Economics and Statistics 99, no. 3（July 2017）: 550–563。

第一部分　寻找噪声

第1章　犯罪和充满噪声的判罚

帮助创立该组织的初衷：Marvin Frankel, Criminal Sentences: Law Without Order, 25 Inst. for Sci. Info. Current Contents / Soc. & Behavioral Scis.: This Week’s Citation Classic 14, 2A-6（June 23, 1986）。

几乎完全不受制衡的权力：Marvin Frankel, Criminal Sentences: Law Without Order（New York: Hill and Wang, 1973）, 5。

每天都在发生残酷专断的行为：Frankel, Criminal Sentences, 103。

法治而非人治的社会：Frankel, Criminal Sentence, 5。

独断专行的产物：Frankel, Criminal Sentences, 11。

某种形式的数字，或其他客观的评分：Frankel, Criminal Sentence, 114。

使用计算机作为量刑中有序思考的辅助工具：Frankel, Criminal Sentence, 115。

成立一个量刑委员会：Frankel, Criminal Sentence, 119。

缺少共识是常态：Anthony Partridge, William B. Eldridge, The Second Circuit Sentence Study: A Report to the Judges of the Second Circuit August 1974（Washington, DC: Federal Judicial Center, August 1974）: 9。

各种量刑之间的差异“令人震惊”：US Senate, “Comprehensive Crime Control Act of 1983: Report of the Committee on the Judiciary, United States Senate, on S. 1762, Together with Additional and Minority Views”（Washington, DC: US Government Printing Office, 1983）. Report No.98–225。

贩卖海洛因的毒品贩子：Anthony Partridge, Eldridge, Second Circuit Sentence Study, A-11。

银行抢劫犯：Partridge, Eldridge, Second Circuit Sentence Study, A-9。

敲诈勒索案：Partridge, Eldridge, A-5–A-7。

对47名法官进行了一项调查：William Austin, Thomas A. Williams Ⅲ, “A Survey of Judges’Responses to Simulated Legal Cases: Research Note on Sentencing Disparity,” Journal of Criminal Law & Criminology 68（1977）: 306。

一项更大规模的研究：John Bartolomeo et al., “Sentence Decisionmaking: The Logic of Sentence Decisions and the Extent and Sources of Sentence Disparity,” Journal of Criminal Law and Criminology 72, no. 2（1981）。完整讨论见第6章，也见Senate Report, 44。

如果法官处于饥饿状态：Shai Danziger, Jonathan Levav, Liora Avnaim-Pesso, “Extraneous Factors in Judicial Decisions,” Proceedings of the National Academy of Sciences of the United States of America 108, no. 17（2011）: 6889–6892。

青少年法庭判决：Ozkan Eren, Naci Mocan, “Emotional Judges and Unlucky Juveniles,” American Economic Journal: Applied Economics 10, no. 3（2018）: 171–205。

当本地足球队在周末输掉比赛后，法官在接下来的星期一会做出更严厉的判决：Daniel L. Chen, Markus Loecher, “Mood and the Malleability of Moral Reasoning: The Impact of Irrelevant Factors on Judicial Decisions,” SSRN Electronic Journal（September 21, 2019）: 1–70。

在自己的生日当天可能会更宽容：Daniel L. Chen, Arnaud Philippe, “Clash of Norms: Judicial Leniency on Defendant Birthdays,”（2020）。

甚至像外界温度这种无关紧要的信息也会影响法官的决策：Anthony Heyes, Soodeh Saberian, “Temperature and Decisions: Evidence from 207 000 Court Cases,” American Economic Journal: Applied Economics 11, no. 2（2018）: 238–265。

不受约束的自由裁量权：Senate Report, 38。

“过于悬殊”的量刑差异：Senate Report, 38。

美国最高法院法官斯蒂芬·布雷耶试图通过指出委员会内部存在的棘手分歧来为过去的做法辩护：这句话出自杰弗里·罗森（Jeffrey Rosen）的作品，“Breyer Restraint,” New Republic, July 11, 1994, at 19, 25。

必须向法院证明这样做的合理性：United States Sentencing Commission, Guidelines Manual（2018）。

它减少了由于量刑法官身份的偶然性而导致的判决中出现的净差异：James M. Anderson, Jeffrey R. Kling, Kate Stith, “Measuring Interjudge Sentencing Disparity: Before and After the Federal Sentencing Guidelines,” Journal of Law and Economics 42, no. S1（April 1999）: 271–308。

美国量刑委员会对量刑指南的效果进行了详尽的研究：US Sentencing Commission, The Federal Sentencing Guidelines: A Report on the Operation of the Guidelines System and Short-Term Impacts on Disparity in Sentencing, Use of Incarceration, and Prosecutorial Discretion and Plea Bargaining, vols. 1 & 2（Washington, DC: US Sentencing Commission, 1991）。

另一项研究表明，1986—1987年，法官之间在刑期长短上的预期差异为4.9个月；而1988—1993年，这一数字下降至3.9个月：Anderson, Kling, Stith, “Interjudge Sentencing Disparity”。

一项涵盖了不同时期数据的独立研究：Paul J. Hofer, Kevin R. Blackwell, R. Barry Ruback, “The Effect of the Federal Sentencing Guidelines on Inter-Judge Sentencing Disparity,” Journal of Criminal Law and Criminology 90（1999）: 239, 241。

我们不能对案件的细节视而不见，而是要有洞察力：Kate Stith, José Cabranes, Fear of Judging: Sentencing Guidelines in the Federal Courts, Chicago: University of Chicago Press, 1998：79。

直到2005年，美国最高法院才取消了该指南：543 U.S. 220（2005）。

75%的法官更喜欢建议性制度：US Sentencing Commission, “Results of Survey of United States District Judges, January 2010 through March 2010”（June 2010）（question 19, table 19）。

她的核心发现是，法官之间的差异在2005年后明显增加：Crystal Yang, “Have Interjudge Sentencing Disparities Increased in an Advisory Guidelines Regime? Evidence from Booker,” New York University Law Review 89（2014）: 1268–1342; p. 1278, 1334。

第2章　系统噪声，给人达成一致的错觉

我们称这个实验为噪声审查：该公司的高管们认真构建了几个具有代表性的案例描述，这些案例与员工每天都要处理的风险和索赔案例类似。他们将6个案例分别交给财产和意外伤害部门的理赔员处理，并将另外4个案例交给专门从事金融风险的核保员进行处理。在正常工作日中，员工获得了半天的时间来评估其中的两三个案例，为了检查他们判断中存在的变异性，研究人员事先并没有告知这些员工本研究的目的，并且每个人的评估工作都是独立进行的。我们一共得到了来自48位核保员的86个判断和来自68位理赔员的113个判断。

天真的现实主义：Dale W. Griffin, Lee Ross, “Subjective Construal, Social Inference, and Human Misunderstanding,” Advances in Experimental Social Psychology 24（1991）: 319–359; Robert J. Robinson, Dacher Keltner, Andrew Ward, Lee Ross, “Actual Versus Assumed Differences in Construal: ‘Naive Realism’ in Intergroup Perception and Conflict,” Journal of Personality and Social Psychology 68, no. 3（1995）: 404; Lee Ross, Andrew Ward, “Naive Realism in Everyday Life: Implications for Social Conflict and Misunderstanding,” Values and Knowledge（1997）。

第二部分　你的大脑是一种测量工具

标准差是测量差异的最常见指标：一组数字的标准差是由另一个被称为“变异性”的统计量演化而来的。要想计算变异性，我们首先需要获得各个数字偏离平均数的值的分布情况，然后对这些偏离值进行取平方操作。变异性就是这一组数字偏离值的平方的平均数，而标准差是变异性的平方根。

第4章　什么是判断

在葡萄酒比赛中，评委们对哪种葡萄酒应该获奖可能会分歧很大：R. T. Hodgson, “An Examination of Judge Reliability at a Major U.S. Wine Competition,” Journal of Wine Economics 3, no. 2（2008）: 105–113。

这种权衡是通过评估性判断实现的：决策领域的学者将决策定义为从不同选项中进行选择的过程，而将定量的判断作为决策的特例，因为此类判断中包含一系列连续可选的选项。从这一角度来看，判断是决策的特例。我们这里的看法有所不同：我们将从不同选项中做出选择的过程看作对每一个选项进行评估性判断的过程。换言之，我们将决策作为判断的特例。

第5章　测量误差，噪声与偏差的代价一样大

最小平方法是1795年由高斯发明的：最小平方法常被称作“最小二乘法”，由阿德里恩·玛里·埃·勒让德尔在1805年首次提出。高斯宣称，他自己早在10年前就首次使用过这个概念，随后他基于这一概念提出误差理论和以他自己的名字命名的正态误差曲线。关于是谁最早提出最小平方法，学界已经有大量讨论，历史学家们倾向于相信高斯的说法［Stephen M. Stigler, “Gauss and the Invention of Least Squares,” Annals of Statistics 9（1981）: 465–474; Stephen M. Stigler, The History of Statistics: The Measurement of Uncertainty Before 1900（Cambridge, MA: Belknap Press of Harvard University Press, 1986）］。

用简单的数学公式表示：我们将噪声定义为误差的标准差，因此噪声的平方就是误差的变异性。变异性的定义是“平方的平均数减去平均数的平方”。既然偏差是平均误差，“平均数的平方”就是偏差的平方。因此：噪声2=MSE -偏差2。

关于这一点人们的直觉恰恰相反：Berkeley J. Dietvorst, Soaham Bharti, “People Reject Algorithms in Uncertain Decision Domains Because They Have Diminishing Sensitivity to Forecasting Error,” Psychological Science 31, no. 10（2020）: 1302–1314。

第6章　噪声分析：所有判断都存在3类噪声

非常详细的噪声审查：Kevin Clancy, John Bartolomeo, David Richardson, Charles Wellford, “Sentence Decisionmaking: The Logic of Sentence Decisions and the Extent and Sources of Sentence Disparity,” Journal of Criminal Law and Criminology 72, no. 2（1981）: 524–554; INSLAW, Inc. et al., “Federal Sentencing: Towards a More Explicit Policy of Criminal Sanctions III-4,”（1981）。

研究人员向这些法官呈现16起案件的详细文件，并要求法官们做出判决：判刑可以包括监狱服刑时间、监视居住时间和罚款等的任意一种组合形式。简而言之，我们这里主要关注量刑的一种主要组成部分——入狱服刑时间，而不考察另外两个组成部分。

这种差异经常被称为“偏差”：在多重案例、多个判断者的情境中，我们在第5章介绍的误差方程的扩展版中引入了一个概念，它反映了这种变异性。具体而言，如果我们将总体偏差（grand bias）定义为所有案例的平均误差，并且这种误差在不同案例中并不相同，则存在案例偏差的变异。这个方程就变成了：MSE=总体偏差2+案例偏差的变异性＋系统噪声2。

平均刑期为7年：本章中所提及的数字来自原始研究中的如下部分。首先，作者称，被告所犯罪行和被告的主要影响占总体变异性的45%［John Bartolomeo et al., “Sentence Decisionmaking: The Logic of Sentence Decisions and the Extent and Sources of Sentence Disparity,” Journal of Criminal Law and Criminology 72, no. 2（1981）, table 6］。然而，我们这里更宽泛地关注每个案例的影响，包括向法官呈现的案例的所有特征，例如被告是否有犯罪记录，在犯罪过程中是否使用武器。根据我们的定义，所有这些特征都是真实的案例变异，而不是噪声。相应地，我们将案例变异中不同案例特征之间的交互作用也进行了重新整合（这些解释了11%的总体变异性；见Bartolomeo et al., table 10）。这样，案例变异占总体变异的56%，法官的主要影响（水平噪声）为21%，总体变异中的交互作用为23%。因此，总体变异中有44%是系统噪声。公正量刑的变异性可以通过表格（见Bartolomeo et al., 89）里每一个案例的平均刑期来计算：变异性为15。如果这占到总体变异的56%，那么总体变异为26.79，系统噪声为11.79。变异的平方根即所呈现案例的标准差，为3.4年。法官的主效应或称水平噪声占总体变异的21%。这一变异的平方根是法官水平噪声的标准差，为2.4年。

系统噪声是3.4年：这一数值是所有16起案件变异性的平均值的平方根。对它的计算见前一个注释。

简单的相加逻辑：相加性假设认为，法官的严厉程度会在总体刑期上增加一个常数。这一假设不太可能是正确的：法官的严厉程度更有可能增加一个与平均刑期成比例的数值。这一问题在原始报告中被忽略了，因而无法评估它的重要性。

审判中出现模式化的差别：Bartolomeo et al., “Sentence Decision-making,” 23。

模式噪声和水平噪声的贡献几乎相同：以下方程也成立：系统噪声2=水平噪声2+模式噪声2。图6-1显示，系统噪声是3.4年，水平噪声是2.4年。这意味着模式噪声也约为2.4年。这一计算只是作为一个例子——四舍五入产生的误差导致实际的值会略有不同。

第7章　情境噪声，无时无刻不在影响着我们的判断

专家先后两次品尝了同一种葡萄酒：R. T. Hodgson, “An Examination of Judge Reliability at a Major U.S. Wine Competition,” Journal of Wine Economics 3, no. 2（2008）: 105–113.

经验丰富的软件顾问：Stein Grimstad, Magne Jorgensen, “Inconsistency of Expert Judgment-Based Estimates of Software Development Effort,” Journal of Systems and Software 80, no. 11（2007）: 1770–1777。

他们倾向于与自己保持一致：Robert H. Ashton, “A Review and Analysis of Research on the Test-Retest Reliability of Professional Judgment,” Journal of Behavioral Decision Making 294, no. 3（2000）: 277–294。作者偶然注意到，这41项研究中，没有一项是被设计用于检测情境噪声的，“在所有研究中，关于稳定性的测量都是其他研究目标的副产品”（Ashton, 279）。这句话表明，直到近期人们才开始关注情境噪声。

正确答案是约32%：Central Intelligence Agency, The World Factbook（Washington, DC: Central Intelligence Agency, 2020）。这一数值包括所有能从空中识别的机场、铺设或未铺设的跑道、关闭或废弃的设施。

爱德华·沃尔和哈罗德·帕什勒：Edward Vul, Harold Pashler, “Crowd Within: Probabilistic Representations Within Individuals”。

第一次的答案比第二次的答案更接近真实值：James Surowiecki, The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies, and Nations（New York: Doubleday, 2004）。

但它会产生更少的噪声：平均判断的标准差（我们对噪声的测量）与判断者人数的平方根同比降低。

你对同一个问题做出两次回答带来的好处，是向另一独立个体寻求建议时所获好处的1/10：Vul and Pashler, “Crowd Within,” 646。

斯蒂芬·赫佐格和拉尔夫·赫维格：Stefan M. Herzog, Ralph Hertwig, “Think Twice and Then: Combining or Choosing in Dialectical Bootstrapping?,” Journal of Experimental Psychology: Learning, Memory, and Cognition 40, no. 1（2014）: 218–232。

被试的反应是从一个内部的概率分布中抽取的：Vul, Pashler, “Measuring the Crowd Within,”647。

约瑟夫·福加斯：Joseph P. Forgas, “Affective Influences on Interpersonal Behavior,”Psychological Inquiry 13, no. 1（2002）: 1–28。

对于同一个微笑，拥有积极情绪的人看到友好：Forgas, “Affective Influences,” 10。

谈判过程中从情绪良好转向愤怒的谈判者也可能会获得更好的结果：A. Filipowicz, S. Barsade, S. Melwani, “Understanding Emotional Transitions: The Interpersonal Consequences of Changing Emotions in Negotiations,” Journal of Personality and Social Psychology 101, no. 3（2011）: 541–556。

实验人员要求参与者阅读一篇简短的哲学论文：Joseph P. Forgas, “She Just Doesn’t Look like a Philosopher...? Affective Influences on the Halo Effect in Impression Formation,” European Journal of Social Psychology 41, no. 7（2011）: 812–817。

戈登·彭尼库克及其同事开展了一系列研究，来考察人们对毫无意义、看似深奥实则虚假的陈述的反应：Gordon Pennycook, James Allan Cheyne, Nathaniel Barr, Derek J. Koehler, Jonathan A. Fugelsang, “On the Reception and Detection of Pseudo-Profound Bullshit,” Judgment and Decision Making 10, no. 6（2015）: 549–563。

自哈里·法兰克福之后，废话已经成为一个术语：Harry Frankfurt, On Bullshit（Princeton, NJ: Princeton University Press, 2005）。

他们可能会被看似令人印象深刻的断言所打动：Pennycook et al., “Pseudo-Profound Bullshit,”549。

诱发良好的情绪会让人们更容易接受废话，以及更容易上当受骗：Joseph P. Forgas, “Happy Believers and Sad Skeptics? Affective Influences on Gullibility,” Current Directions in Psychological Science 28, no. 3（2019）: 306–313。

处于不良情绪中的目击者在看到这些具有误导性的信息时，更有可能忽视它们，从而避免做出虚假指证：Joseph P. Forgas, “Mood Effects on Eyewitness Memory: Affective Influences on Susceptibility to Misinformation,” Journal of Experimental Social Psychology 41, no. 6（2005）: 574–588。

天桥难题：Piercarlo Valdesolo, David Desteno, “Manipulations of Emotional Context Shape Moral Judgment,” Psychological Science 17, no. 6（2006）: 476–477。

医生在漫长的一天结束时开阿片类药物的可能性显著增加：Hannah T. Neprash, Michael L. Barnett, “Association of Primary Care Clinic Appointment Time with Opioid Prescribing,” JAMA Network Open 2, no. 8（2019）; Lindsey M. Philpot, Bushra A. Khokhar, Daniel L. Roellinger, Priya Ramar, Jon O. Ebbert, “Time of Day Is Associated with Opioid Prescribing for Low Back Pain in Primary Care,” Journal of General Internal Medicine 33（2018）: 1828。

在一天将要结束时，医生开抗生素的可能性更大：Jeffrey A. Linder, Jason N. Doctor, Mark W. Friedberg, Harry Reyes Nieva, Caroline Birks, Daniella Meeker, Craig R. Fox, “Time of Day and the Decision to Prescribe Antibiotics,” JAMA Internal Medicine 174, no. 12（2014）: 2029–2031。

开流感疫苗的可能性较小：Rebecca H. Kim, Susan C. Day, Dylan S. Small, Christopher K. Snider, Charles A. L. Rareshide, Mitesh S. Patel, “Variations in Influenza Vaccination by Clinic Appointment Time and an Active Choice Intervention in the Electronic Health Record to Increase Influenza Vaccination,” JAMA Network Open 1, no. 5（2018）: 1–10。

不好的天气与记忆力的增强有一定的相关性：Joseph P. Forgas, Liz Goldenberg, Christian Unkelbach, “Can Bad Weather Improve Your Memory? An Unobtrusive Field Study of Natural Mood Effects on Real-Life Memory,” Journal of Experimental Social Psychology 45, no. 1（2008）: 254–257。阳光明媚的天气会影响股市走向：David Hirshleifer, Tyler Shumway, “Good Day Sunshine: Stock Returns and the Weather,” Journal of Finance 58, no. 3（2003）: 1009–1032。

云让书呆子看起来不错：Uri Simonsohn, “Clouds Make Nerds Look Good: Field Evidence of the Impact of Incidental Factors on Decision Making,” Journal of Behavioral Decision Making 20, no. 2（2007）: 143–152。

赌徒谬误：Daniel Chen et al., “Decision Making Under the Gambler’s Fallacy: Evidence from Asylum Judges, Loan Officers, and Baseball Umpires,” Quarterly Journal of Economics 131, no. 3（2016）: 1181–1242。

在美国，当前面两起案件获得庇护法官的批准时，下一个庇护申请获得批准的可能性会降低19%：Jaya Ramji-Nogales, Andrew I. Schoenholtz, Philip Schrag, “Refugee Roulette: Disparities in Asylum Adjudication,” Stanford Law Review 60, no. 2（2007）。

迈克尔·卡哈纳及其同事研究了记忆的表现：Michael J. Kahana et al., “The Variability Puzzle in Human Memory,” Journal of Experimental Psychology: Learning, Memory, and Cognition 44, no. 12（2018）: 1857–1863。

第8章　群体是如何放大噪声的

马修·萨尔加尼克和他的合作者开展了一项大型音乐下载的研究：Matthew J. Salganik, Peter Sheridan Dodds, Duncan J. Watts, “Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market,” Science 311（2006）: 854–856。另见：Matthew Salganik, Duncan Watts, “Leading the Herd Astray: An Experimental Study of Self-Fulfilling Prophecies in an Artificial Cultural Market,” Social Psychology Quarterly 71（2008）: 338–355; Matthew Salganik, Duncan Watts, “Web-Based Experiments for the Study of Collective Social Dynamics in Cultural Markets,” Topics in Cognitive Science 1（2009）: 439–468。

流行程度会自我强化：Salganik, Watts, “Leading the Herd Astray”。

在其他领域也出现了类似的结果：Michael Macy et al., “Opinion Cascades and the Unpredictability of Partisan Polarization,” Science Advances（2019）: 1–8。另见：Helen Margetts et al., Political Turbulence（Princeton: Princeton University Press, 2015）。

社会学家迈克尔·梅西：Michael Macy et al., “Opinion Cascades”。

人们在网上如何对各种评论做出判断：Lev Muchnik et al., “Social Influence Bias: A Randomized Experiment,” Science 341, no. 6146（2013）: 647–651。

有些研究已经表明了这一点：Jan Lorenz et al., “How Social Influence Can Undermine the Wisdom of Crowd Effect,” Proceedings of the National Academy of Sciences 108, no. 22（2011）: 9020–9025。

我们来看一个实验，该实验比较了现实世界中的陪审团和“统计中的陪审团”：Daniel Kahneman, David Schkade, Cass R. Sunstein, “Shared Outrage and Erratic Awards: The Psychology of Punitive Damages,” Journal of Risk and Uncertainty 16（1998）: 49–86.

由他们组成500多个6人一组的陪审团：David Schkade, Cass R. Sunstein, Daniel Kahneman,“Deliberating about Dollars: The Severity Shift,” Columbia Law Review 100（2000）: 1139–1175.

第三部分　预测性判断中的噪声

一致性比率（PC）：一致性比率与肯德尔和谐系数（Kendall’s W）关联密切。

成年男性脚的尺码与身高的PC值为71%：Kanwal Kamboj et al., “A Study on the Correlation Between Foot Length and Height of an Individual and to Derive Regression Formulae to Estimate the Height from Foot Length of an Individual,” International Journal of Research in Medical Sciences 6, no. 2（2018）: 528。

相关系数和一致性比率（PC）的对应关系：一致性比率的计算基于这样的假设：联合分布（joint distribution）是二元正态分布。表1中的数值就是基于该假设的近似值。感谢朱利安·帕里斯绘制表格。

第9章　判断与模型，简单的模型普遍优于人类判断

一项关于绩效预测的真实研究：Martin C. Yu, Nathan R. Kuncel, “Pushing the Limits for Judgmental Consistency: Comparing Random Weighting Schemes with Expert Judgments,” Personnel Assessment and Decisions 6, no. 2（2020）: 1–10。本书所描述的专家们的相关系数0.15是指3个实验样本（总共847个样本）的相关系数的非加权平均值。真实实验中的相关系数与此处的这一简化的相关系数在某些方面有所不同。

它是对各种预测因素的平均值进行加权后获得预测分数的方法：对平均值进行加权计算的先决条件是所有预测因素都必须用可比较的测量单位测量。我们介绍的这个例子是满足这个先决条件的，所有的预测因素都是用0到10分的量表测量的。但在一些情况下，这一先决条件可能无法满足。比如，绩效表现的预测因素包括面试官给的0到10分的评分、相关工作经验的时长，以及工作技能的笔试成绩。多元回归分析程序在整合这些变量之前，会把这些变量先转换为标准分。标准分测量的是一个观测值与群体平均值之间的距离，它以标准差为单位。例如，如果技能测试的平均分是55分，标准差是8分，那么标准分为+1.5就意味着技能测试为67分。需要注意的是，把每个个体数据标准化会消除所有平均值之间的误差痕迹，或者说会消除个体判断变异中的误差痕迹。

你可能认为，与目标变量相关性越密切的预测因素，其权重也应该越大：多元回归最主要的特征是每个预测因素的最优权重都依赖于其他预测因素。如果一个预测因素与另一个预测因素高相关，两个预测因素的权重就不会都很大——这就跟变量被重复计算了一样。

判断和决策研究的主力军：Robin M. Hogarth, Natalia Karelaia, “Heuristic and Linear Models of Judgment: Matching Rules and Environments,” Psychological Review 114, no. 3（2007）: 734。

二者都具有如下一些简单的结构：在这里的讨论主要基于透镜判断模型的研究框架，该模型被广泛用于人员测评的情境中。参见：Kenneth R. Hammond, “Probabilistic Functioning and the Clinical Method,” Psychological Review 62, no. 4（1955）: 255–262; Natalia Karelaia, Robin M. Hogarth, “Determinants of Linear Judgment: A Meta-Analysis of Lens Model Studies,” Psychological Bulletin 134, no. 3（2008）: 404–426。

保罗·梅尔：Paul E. Meehl, Clinical Versus Statistical Prediction: A Theoretical Analysis and a Review of the Evidence（Minneapolis: University of Minnesota Press, 1954）。

弗洛伊德的照片：Paul E. Meehl, Clinical Versus Statistical Prediction: A Theoretical Analysis and a Review of the Evidence（Northvale, NJ: Aronson, 1996）序言部分。

梅尔不仅是一位学术研究人员，还是一位有着丰富临床经验的精神分析学派的心理咨询师：“Paul E. Meehl,” in Ed Lindzey（Ed.）, A History of Psychology in Autobiography, 1989。

“大量且一致”：“Paul E. Meehl,” in A History of Psychology in Autobiography, ed. Ed Lindzey（Washington, DC: American Psychological Association, 1989）, 362。

一项发表于2000年的对136项研究的综述：William M. Grove et al., “Clinical Versus Mechanical Prediction: A Meta-Analysis,” Psychological Assessment 12, no. 1（2000）: 19–30。

人类在判断时还具有不对等的优势，因为他们可以获取未提供给计算机模型的“私人”信息：William M. Grove, Paul E. Meehl, “Comparative Efficiency of Informal（Subjective, Impressionistic）and Formal（Mechanical, Algorithmic）Prediction Procedures: The Clinical-Statistical Controversy,” Psychology, Public Policy, and Law 2, no. 2（1996）: 293–323。

20世纪60年代后期，戈德堡基于霍夫曼的早期工作，开始研究用于描述个体判断行为的统计模型：Lewis Goldberg, “Man Versus Model of Man: A Rationale, plus Some Evidence, for a Method of Improving on Clinical Inferences,” Psychological Bulletin 73, no. 6（1970）: 422–432。

台球专家们在描述某一杆如何进球时，表现得就好像他们解开了复杂的方程一样，然而实际上他们并未真的这样做：Milton Friedman, Leonard J. Savage, “The Utility Analysis of Choices Involving Risk,” Journal of Political Economy 56, no. 4（1948）: 279–304。

尽管不是完全相关，但这种相关性已经足以支持所谓的“假设”理论了：Karelaia, Hogarth, “Determinants of Linear Judgment,” 411, table 1。

早期一项关于预测学生毕业成绩的研究证实了戈德堡的结论：Nancy Wiggins, Eileen S. Kohen, “Man Versus Model of Man Revisited: The Forecasting of Graduate School Success,” Journal of Personality and Social Psychology 19, no. 1（1971）: 100–106。

一项对近50年研究成果的综述性研究也得出同样的结论：Karelaia, Hogarth, “Determinants of Linear Judgment”。

从你的判断中消除噪声通常会提高你的预测准确性：可以对相关系数进行校正以解决预测因素信度不高的问题，即，衰减校正。公式是：校正的rxy=rxy/√ rxx，其中rxx是信度系数（预测因素的真实变异占其观测变异的比例）。

马丁·于和内森·昆塞尔报告了一项比戈德堡更激进的研究：Yu, Kuncel, “Judgmental Consistency”。

随机公式：我们将在下一章更详细地讨论等权模型与随机权重模型。权重会被限定在一定的数字范围内，并且会被限定使用相应的符号。

第10章　无噪声的规则

这些均等权重模型的准确性与最合适的回归模型差不多，且远胜于诊断性判断：Robyn M. Dawes, Bernard Corrigan, “Linear Models in Decision Making,” Psychological Bulletin 81, no. 2（1974）: 95–106。罗宾·道斯和伯纳德·科里根也提议使用随机权重。我们在第9章中介绍的管理绩效预测的研究也体现了这一思想。

“与统计直觉相悖”：Jason Dana, “What Makes Improper Linear Models Tick?,” in Rationality and Social Responsibility: Essays in Honor of Robyn M. Dawes, ed. Joachim I. Krueger, 71–89（New York: Psychology Press, 2008）, 73。

许多其他研究也得到了相似的结果：Jason Dana, Robyn M. Dawes, “The Superiority of Simple Alternatives to Regression for Social Sciences Prediction,” Journal of Educational and Behavior Statistics 29（2004）: 317–331; Dana, “What Makes Improper Linear Models Tick?”。

它并不重要：Howard Wainer, “Estimating Coefficients in Linear Models: It Don’t Make No Nevermind,” Psychological Bulletin 83, no. 2（1976）: 213–217。

我们不需要比我们的测量更精确的模型：Dana, “What Makes Improper Linear Models Tick?,” 72。

它与结果的相关系数将为0.25（PC=58%）：Martin C. Yu, Nathan R. Kuncel, “Pushing the Limits for Judgmental Consistency: Comparing Random Weighting Schemes with Expert Judgments,” Personnel Assessment and Decisions 6, no. 2（2020）: 1–10。与前一章一样，这里报告的相关系数是3个实验样本的相关系数的非加权平均值。在3个实验样本中，这一比较关系依然成立，临床专家判断的效度分别是0.17、0.16和0.13，而等权模型的效度分别是0.19、0.33和0.22。

稳定之美：Robyn M. Dawes, “The Robust Beauty of Improper Linear Models in Decision Making,”American Psychologist 34, no. 7（1979）: 571–582。

应用均等权重模型所需的全部技巧是决定要关注哪些变量，并知道如何将这些变量进行叠加：Dawes, Corrigan, “Linear Models in Decision Making,” 105。

一个研究团队于2020年发表了一项研究成果。他们将简约模型应用于一系列现实问题：Jongbin Jung, Conner Concannon, Ravi Shroff, Sharad Goel, Daniel G. Goldstein, “Simple Rules to Guide Expert Classifications,” Journal of the Royal Statistical Society, Statistics in Society, no. 183（2020）: 771–800。

另外一个研究小组研究了一个与上述案例相似但有所不同的司法问题：Julia Dressel, Hany Farid, “The Accuracy, Fairness, and Limits of Predicting Recidivism,” Science Advances 4, no. 1（2018）: 1–6。

使用仅两个输入变量的模型就可以达到与使用137个变量的模型相同的预测效度：这两个例子都是基于极小的一组变量的线性模型（并且，在保释金的模型中，线性模型权重系数的近似值是基于舍入法求得的，它将模型转换成了一种粗略的运算）。另一个“并非最合适的模型”是单变量法则，它只考虑一个预测因素，而忽略所有其他预测因素。详见：Peter M. Todd, Gerd Gigerenzer, “Précis of Simple Heuristics That Make Us Smart,” Behavioral and Brain Sciences 23, no. 5（2000）: 727–741。

而大量证据表明，它们与犯罪行为也是紧密相关的：P. Gendreau, T. Little, C. Goggin, “A Meta-Analysis of the Predictors of Adult Offender Recidivism: What Works!,” Criminology 34（1996）。

海量数据集：这里的“量”，应该理解为观测样本量与预测因素的比值。要实现罗宾·道斯所说的“稳定之美”，意味着这一比例应该至少高于15:1，或者是20:1，才能使最优权重在交叉验证时表现得比单位权重好。通过对多个案例进行研究，罗宾·道斯和贾森·达纳认为，要“体现简单方案的优势”，这一比值应该要高于100:1。

由塞德希尔·穆来纳森领导的另一个团队：J. Kleinberg, H. Lakkaraju, J. Leskovec, J. Ludwig, S. Mullainathan, “Human Decisions and Machine Predictions,” Quarterly Journal of Economics 133（2018）: 237–293.

研究人员利用这一数据来训练一个机器学习算法：利用一部分训练数据训练该算法，然后用该算法去预测一个随机选定的新数据集的结果，以评估该算法的性能。

机器学习算法在变量组合中，发现了一些会被线性模型遗漏的重要信息：Kleinberg et al., “Human Decisions,” 16。

系统噪声包括一些水平噪声，即平均严厉程度之间的差异，但其中大多数（79%）是模式噪声：Gregory Stoddard、Jens Ludwig和Sendhil Mullainathan于2020年6月至7月与本书作者通过电子邮件进行过交流。

哥伦比亚商学院教授博·考吉尔考察了一家大型科技公司招聘软件工程师的情况：B. Cowgill, “Bias and Productivity in Humans and Algorithms: Theory and Evidence from Résumé Screening,”Smith Entrepreneurship Research Conference, College Park, MD, April 21, 2018。

1996年的一篇论文：William M. Grove, Paul E. Meehl, “Comparative Efficiency of Informal（Subjective, Impressionistic）and Formal（Mechanical, Algorithmic）Prediction Procedures: The Clinical-Statistical Controversy,” Psychology, Public Policy, and Law 2, no. 2（1996）: 293–323。

当从人类的建议和算法的建议之间进行选择时，人们通常会选择后者：Jennifer M. Logg, Julia A. Minson, Don A. Moore, “Algorithm Appreciation: People Prefer Algorithmic to Human Judgment,”Organizational Behavior and Human Decision Processes 151（April 2018）: 90–103。

而一旦发现它会犯错误，就不会再信任它：B. J. Dietvorst, J. P. Simmons, C. Massey, “Algorithm Aversion: People Erroneously Avoid Algorithms After Seeing Them Err,” Journal of Experimental Psychology General 144（2015）: 114–126。另见：A. Prahl, L. Van Swol, “Understanding Algorithm Aversion: When Is Advice from Automation Discounted?,” Journal of Forecasting 36（2017）: 691–702。

我们希望机器是完美的，如果机器不完美，那就丢弃它：M. T. Dzindolet, L. G. Pierce, H.P. Beck, L. A. Dawe, “The Perceived Utility of Human and Automated Aids in a Visual Detection Task,” Human Factors: The Journal of the Human Factors and Ergonomics Society 44, no. 1（2002）: 79–94; K. A. Hoff, M. Bashir, “Trust in Automation: Integrating Empirical Evidence on Factors That Influence Trust,” Human Factors: The Journal of the Human Factors and Ergonomics Society 57, no. 3（2015）: 407–434; P. Madhavan, D. A. Wiegmann, “Similarities and Differences Between Human–Human and Human–Automation Trust: An Integrative Review,” Theoretical Issues in Ergonomics Science 8, no. 4（2007）: 277–301。

第11章　哪里有预测，哪里就有客观无知

有关管理决策的研究：E. Dane, M. G. Pratt, “Exploring Intuition and Its Role in Managerial Decision Making,” Academy of Management Review 32, no. 1（2007）: 33–54; Cinla Akinci, Eugene Sadler-Smith, “Intuition in Management Research: A Historical Review,” International Journal of Management Reviews 14（2012）: 104–122; Gerard P. Hodgkinson et al., “Intuition in Organizations: Implications for Strategic Management,” Long Range Planning 42（2009）: 277–297。

管理决策中的直觉的综述性文章，将直觉定义为对给定行动方案的一种判断：Hodgkinson et al., “Intuition in Organizations,” 279。

近期一篇有关人才选拔的报告：Nathan Kuncel et al., “Mechanical Versus Clinical Data Combination in Selection and Admissions Decisions: A Meta-Analysis,” Journal of Applied Psychology 98, no. 6（2013）: 1060–1072。有关个人决策的更多内容，参见第24章。

过分自信是已经被大量研究证明了的一种认知偏差：Don A. Moore, Perfectly Confident: How to Calibrate Your Decisions Wisely（New York: HarperCollins, 2020）。

对政治和经济趋势发表评论或提供建议：Philip E. Tetlock, Expert Political Judgment: How Good Is It? How Can We Know?（Princeton, NJ: Princeton University Press, 2005）, 239, 233。

对136项研究进行了回顾：William M. Grove et al., “Clinical Versus Mechanical Prediction: A Meta-Analysis,” Psychological Assessment 12, no. 1（2000）: 19–30。

由塞德希尔·穆来纳森和齐亚德·欧博迈亚完成的另一项研究对心脏病诊断进行了建模：Sendhil Mullainathan, Ziad Obermeyer, “Who Is Tested for Heart Attack and Who Should Be: Predicting Patient Risk and Physician Error,” 2019. NBER Working Paper 26168, National Bureau of Economic Research。

在充满无知的情况下，否认无知就显得更加诱人：Weston Agor, “The Logic of Intuition: How Top Executives Make Important Decisions,” Organizational Dynamics 14, no. 3（1986）: 5–18; Lisa A. Burke, Monica K. Miller, “Taking the Mystery Out of Intuitive Decision Making,” Academy of Management Perspectives 13, no. 4（1999）: 91–99。

当算法能够获得更高的准确性时，人们会更愿意相信算法：Poornima Madhavan, Douglas A. Wiegmann, “Effects of Information Source, Pedigree, and Reliability on Operator Interaction with Decision Support Systems,” Human Factors: The Journal of the Human Factors and Ergonomics Society 49, no.5（2007）。

第12章　常态谷：事情虽无法预测，但可以被理解

一篇非同寻常的论文：Matthew J. Salganik et al., “Measuring the Predictability of Life Outcomes with a Scientific Mass Collaboration,” Proceedings of the National Academy of Sciences 117, no. 15（2020）: 8398–8403。

在比赛的第一阶段，参赛团队可以使用一半样本对应的所有数据，其中也包括6个结果：这项研究只使用了4242个家庭的数据。由于隐私问题，有些脆弱家庭研究的家庭数据在分析时被删除了。

在预测“流离失所”这一事件时，最佳模型的相关系数仅为0.22（PC=57%）：为了确保分数的准确性，这场比赛的举办方用了均方误差（MSE）进行测量，这和我们在第一部分使用的指标是一样的。为了更好地进行对比，该研究还设定了一个基准模型。其他每个模型的MSE都可以与基准模型进行对比，基准模型是一种“无用的”预测策略：单一值拟合所有数据的预测策略，也就是说，每个样本的预测结果都是训练数据的平均数。为方便起见，我们将它的结果转换成了相关性。MSE和相关性之间的关系是：r2=（Var（Y）- MSE）/ Var（Y），其中，Var（Y）是结果变量的方差，（Var（Y）- MSE）是可以预测的结果的方差。

一项社会心理学方面的回顾性研究：F. D. Richard et al., “One Hundred Years of Social Psychology Quantitatively Described,” Review of General Psychology 7, no. 4（2003）: 331–663。

对行为和认知科学的708项研究进行的回顾性研究发现，只有3%的研究所报告的相关系数大于0.5：Gilles E. Gignac, Eva T. Szodorai, “Effect Size Guidelines for Individual Differences Researchers,” Personality and Individual Differences 102（2016）: 74–78。

“研究人员必须认识到，虽然他们了解了脆弱家庭的生活轨迹，但每一项预测都不够准确”：注意，该研究刻意使用了一个已有的描述性大数据集，该数据集虽然很大，但它并不是为了预测特定结果而产生的。这一点与泰特洛克的研究不同，泰特洛克的研究数据使用了所有看起来可能会有预测力的变量。它可能会找到一些不在已有数据集里但能想到的预测因素。因此，基于已有数据集，该研究无法证明无家可归在本质上是不可预测的，也无法证明其他结果是不可预测的。

理解就是描述因果关系，而预测能力就是衡量这一因果关系是否成立的指标：Jake M. Hofman et al., “Prediction and Explanation in Social Systems,” Science 355（2017）: 486–488; Duncan J. Watts et al., “Explanation, Prediction, and Causality: Three Sides of the Same Coin?,” October 2018: 1–14。

有一种思维模式会自发地出现在我们的脑海里：与之对应的是外延性思维（extensional thinking）或非外延性思维，后者也叫意向性思维（intentional thinking）。Amos Tversky, Daniel Kahneman, “Extensional Versus Intuitive Reasoning: The Conjunction Fallacy in Probability Judgment,” Psychological Review 4（1983）: 293–315。

这是因为理解现实的过程是回溯性的：Daniel Kahneman, Dale T. Miller, “Norm Theory: Comparing Reality to Its Alternatives,” Psychological Review 93, no. 2（1986）: 136–153。

“后见之明”经典的研究：Baruch Fischhoff, “An Early History of Hindsight Research,” Social Cognition 25, no. 1（2007）: 10–13; Baruch Fischhoff, “Hindsight Is Not Equal to Foresight: The Effect of Outcome Knowledge on Judgment Under Uncertainty,” Journal of Experimental Psychology: Human Perception and Performance 1, no. 3（1975）: 288。

与因果思维不同，统计思维通常是费力的，它需要的注意力资源只有当系统2思维（缓慢而审慎的思维模式）发挥作用时才能满足：Daniel Kahneman, Thinking, Fast and Slow. New York: Farrar, Straus and Giroux, 2011。

第四部分　噪声是如何产生的

第13章　启发式、偏差与噪声

《思考，快与慢》一书对该研究项目前40年的研究内容进行了回顾，探讨了能够解释“直觉思维的奇妙与缺陷”的心性机制：Daniel Kahneman, Thinking, Fast and Slow（New York: Farrar, Straus and Giroux, 2011）。

证据表明，两组人所估计的概率相差很小，小到可以忽略不计：我们在这里提醒一下，研究判断偏差的心理学家所做的实验中，每个实验组不会只有5个被试（见图6-2），因为判断是有噪声的，不同实验组的结果并不都像图10-1那样聚集。人们对每种偏差所持的怀疑态度也是不同的，他们并不会彻底忽视掉重要的变量。比如，你可以确定，有很多被试会出现范围不敏感偏差，人们估计甘巴迪3年内仍能继续留任的可能性只比两年内仍能继续留任的可能性高一点点。将这一现象描述为范围不敏感是合适的，因为两者的差异远远小于它应有的水平。

已经有许多实验表明，人们在这两类问题上会给出相同的答案：Daniel Kahneman et al., eds., Judgment Under Uncertainty: Heuristics and Biases（New York: Cambridge University Press, 1982）第6章；Daniel Kahneman, Amos Tversky, “On the Psychology of Prediction,” Psychological Review 80, no. 4（1973）: 237–251。

美国企业的CEO离职率大约为每年15%，详见：Steven N. Kaplan, Bernadette A. Minton, “How Has CEO Turnover Changed?,” International Review of Finance 12, no. 1（2012）: 57–87; Dirk Jenter, Katharina Lewellen, “Performance-Induced CEO Turnover,” Harvard Law School Forum on Corporate Governance, September 2, 2020。

在写《星球大战》第三部的电影剧本《绝地归来》的关键时期，该系列电影的制作人乔治·卢卡斯与他出色的合作者劳伦斯·卡斯丹展开了激烈辩论：J. W. Rinzler, The Making of Star Wars Return of the Jedi: The Definitive Story（New York: Del Rey, 2013）, 64。

这个例子说明了另一种类型的偏差，我们称之为结论偏差或者预判：Cass Sunstein, The World According to Star Wars（New York: HarperCollins, 2016）。

在这种情况下，证据就是有选择性且失真的：我们这里所介绍的是一个简单的、有预判的例子，实际上，即使没有预判，随着证据的累积，也可能会在某一个结论上产生偏差，因为我们总是偏好简洁性和连贯性。当一个暂时性的结论生成后，证实偏差就会使我们更倾向于搜集或解释那些支持该结论的新证据。

你总会倾向于接受任何看起来支持该信念的论点，即使推理是错误的：这一现象也被称作信念偏差。详见：J. St. B. T. Evans, Julie L. Barson, Paul Pollard, “On the Conflict between Logic and Belief in Syllogistic Reasoning,” Memory & Cognition 11, no. 3（1983）: 295–306。

在一个典型的演示实验中，你可能会看到许多不容易猜出价格的物品：Dan Ariely, George Loewenstein, Drazen Prelec, “‘Coherent Arbitrariness’: Stable Demand Curves Without Stable Preferences,” Quarterly Journal of Economics 118, no. 1（2003）: 73–105。

锚定效应是一种非常强大的效应，谈判中会经常被刻意用到：Adam D. Galinsky, T. Mussweiler, “First Offers as Anchors: The Role of Perspective-Taking and Negotiator Focus,” Journal of Personality and Social Psychology 81, no. 4（2001）: 657–669。

这个实验说明了过度一致性偏差：Solomon E. Asch, “Forming Impressions of Personality,”Journal of Abnormal and Social Psychology 41, no. 3（1946）: 258–290，这篇论文最早使用一系列排列顺序不同的形容词来说明这个现象。

有一项研究能给我们一些启示：Steven K. Dallas et al., “Don’t Count Calorie Labeling Out: Calorie Counts on the Left Side of Menu Items Lead to Lower Calorie Food Choices,” Journal of Consumer Psychology 29, no. 1（2019）: 60–69。

第14章　匹配，找到与你的预测最精准匹配的共识

将两个完全无关维度的强度相匹配：S. S. Stevens, “On the Operation Known as Judgment,”American Scientist 54, no. 4（December 1966）: 385–401。我们这里使用的匹配一词，其含义比史蒂文斯这篇研究中该词的含义更广泛，后者只是用来指代等比量表。我们将在15章讨论等比量表。

系统性判断错误：这个例子首次出现于丹尼尔·卡尼曼的《思考，快与慢》一书中。

你得到的答案都会是完全相同的两个数字：Daniel Kahneman, Amos Tversky, “On the Psychology of Prediction,” Psychological Review 80（1973）: 237–251。

《神奇的数字7，加2或减2》：G. A. Miller, “The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information,” Psychological Review（1956）: 63–97.

即便如此，强制进行比较判断还是可能减少噪声：R. D. Goffin, J. M. Olson, “Is It All Relative? Comparative Judgments and the Possible Improvement of Self-Ratings and Ratings of Others,” Perspectives on Psychological Science 6（2011）: 48–60。

第15章　选取精确的量表，并多用相对判断

在1998年报告过的一项研究：Daniel Kahneman, David Schkade, Cass Sunstein, “Shared Outrage and Erratic Awards: The Psychology of Punitive Damages,” Journal of Risk and Uncertainty 16（1998）: 49–86; Cass Sunstein, Daniel Kahneman, David Schkade, “Assessing Punitive Damages（with Notes on Cognition and Valuation in Law）,” Yale Law Journal 107, no. 7（May 1998）: 2071–2153。该研究的费用由Exxon公司一次性承担。不过Exxon公司并未对研究者支付报酬，也未控制研究数据、研究结果或文章发表等流程。

高度怀疑：A. Keane, P. McKeown, The Modern Law of Evidence（New York: Oxford University Press, 2014）。

不太可能发生：Andrew Mauboussin, Michael J. Mauboussin, “If You Say Something Is ‘Likely,’How Likely Do People Think It Is?,” Harvard Business Review, July 3, 2018。

一家汽车经销商被处以400万美元的惩罚性损害赔偿，理由是该公司未告知原告，他们售卖的新宝马车是重新喷过漆的：BMW vs. Gore, 517 U.S. 559（1996）。

这种高相关性支持了愤怒假设，即愤怒情绪是惩罚倾向的主要决定因素。要进一步了解情绪在道德判断中的作用，参见：J. Haidt, “The Emotional Dog and Its Rational Tail: A Social Intuitionist Approach to Moral Judgment,” Psychological Review 108, no. 4（2001）: 814–834; Joshua Greene, Moral Tribes: Emotion, Reason, and the Gap Between Us and Them（New York: Penguin Press, 2014）。

图15-1显示了分析结果：愤怒程度和惩罚倾向之间的高相关性（相关系数为0.98）支持了愤怒假设，这可能会让你感到很困惑，因为评分中存在大量的噪声。不过，当你想起来相关系数计算的是判断的平均分时，你就不会那么困惑了。如果是100个人判断的平均值，其噪声（判断的标准差）就会变为原来的1/10。如果许多判断被整合在一起，噪声就不会产生很大的影响了，详见第21章。

人们对许多主观体验和态度的强度比例都有着强烈的直觉：S. S. Stevens, Psychophysics: Introduction to Its Perceptual, Neural and Social Prospects（New York: John Wiley & Sons, 1975）。

任意连贯性：Dan Ariely, George Loewenstein, Drazen Prelec, “‘Coherent Arbitrariness’: Stable Demand Curves Without Stable Preferences,” Quarterly Journal of Economics 118, no. 1（2003）: 73–106, 197。

将赔偿金额转换为排序：转换成排序的同时必然会损失信息，因为判断间的距离信息并未被保留。假设只有3个案件，一位陪审员建议3个案件的损害赔偿金额分别为1000万美元、200万美元和100万美元。很显然，这位陪审员计划中对第一个案件和第二个案件的赔偿金额之间的差异要远远高于第二个和第三个案件赔偿金额之间的差异。但是一旦被转换成排名，上述差异就变得一样了——两者都只差一名。如果是将判断转换成标准分，就可以解决这一问题。

第16章　模式噪声的构成

大量证据已经证明，确实存在这样一种知觉过程：R. Blake, N. K. Logothetis, “Visual competition,” Nature Reviews Neuroscience 3（2002）: 13–21; M. A. Gernsbacher, M. E. Faust, “The Mechanism of Suppression: A Component of General Comprehension Skill,” Journal of Experimental Psychology: Learning, Memory, and Cognition 17（March 1991）: 245–262; M. C. Stites, K. D. Federmeier, “Subsequent to Suppression: Downstream Comprehension Consequences of Noun/Verb Ambiguity in Natural Reading,” Journal of Experimental Psychology: Learning, Memory, and Cognition 41（September 2015）: 1497–1515。

但大多数时候我们的信心都高于应有的程度：D. A. Moore, D. Schatz, “The three faces of overconfidence,” Social and Personality Psychology Compass 11, no. 8（2017）, e12331。

不同的专业人士将会考虑不同方面并相互补充：S. Highhouse, A. Broadfoot, J. E. Yugo, S. A. Devendorf, “Examining Corporate Reputation Judgments with Generalizability Theory,” Journal of Applied Psychology 94（2009）: 782–789。我们要感谢斯科特·海浩斯和艾莉森·布罗德福特，他们为我们提供了原始数据，还要感谢朱利安·帕里斯帮我们做了进一步分析。

早期研究尝试在字典中搜寻能够描述一个人的词语：阿尔波特和奥德贝特（1963）有关人格特质的英文词典的研究，引用于Oliver P. John, Sanjay Strivastava, “The Big-Five Trait Taxonomy: History, Measurement, and Theoretical Perspectives,” Handbook of Personality: Theory and Research, 2nd ed., ed. L. Pervin, Oliver P. John（New York: Guilford, 1999）。

已经很高了：Ian W. Eisenberg, Patrick G. Bissett, A. Zeynep Enkavi et al., “Uncovering the structure of self-regulation through data-driven ontology discovery,” Nature Communications 10（2019）: 2319。

在身体受到威胁时：Walter Mischel, “Toward an Integrative Science of the Person,” Annual Review of Psychology 55（2004）: 1–22。

第17章　噪声源，偏差是引人注目的图形，而噪声是不受我们关注的背景

现在，你可以看到均方误差如何被分解为偏差以及我们曾讨论过的3种噪声成分的平方：如何分解偏差和噪声没有一个固定的原则，图17-1粗略地展示了某些我们讨论过的真实案例或虚拟案例的分解成分。具体而言，在该图中，偏差和噪声是等量的，就像GoodSell对销售量的预测那样。水平噪声的平方是系统噪声平方的37%，就像在惩罚性损害赔偿案例中那样。如图17-1所示，情境噪声的平方约为模式噪声平方的35%。

针对专利局的研究发现：参见引言中的参考文献，Mark A. Lemley, Bhaven Sampat, “Examiner Characteristics and Patent Office Outcomes,” Review of Economics and Statistics 94, no. 3（2012）: 817–827。也可参见：Iain Cockburn, Samuel Kortum, Scott Stern, “Are All Patent Examiners Equal? The Impact of Examiner Characteristics,” Working paper 8980, June 2002; Michael D. Frakes, Melissa F. Wasserman, “Is the Time Allocated to Review Patent Applications Inducing Examiners to Grant Invalid Patents? Evidence from Microlevel Application Data,” Review of Economics and Statistics 99, no. 3（July 2017）: 550–563。

儿童保护部门的官员决定将儿童送到寄养家庭或寄养机构的倾向性也不同：Joseph J. Doyle Jr., “Child Protection and Child Outcomes: Measuring the Effects of Foster Care,” American Economic Review 95, no. 5（December 2007）: 1583–1610。

法官在是否提供庇护的裁决中出现的令人震惊的巨大变异：Andrew I. Schoenholtz, Jaya Ramji-Nogales, Philip G. Schrag, “Refugee Roulette: Disparities in Asylum Adjudication,” Stanford Law Review 60, no. 2（2007）。

同一法官在不同情境中对同一案件量刑的平均差异达到2.8年左右：在第6章中，我们估计的交互项的方差占总方差的23%，据此估计出了这一判刑时间。因为，假设判刑时间服从正态分布，两个随机选择出来的观测值之间的绝对差异应该是1.128个标准差。

亚历山大·托多罗夫带领普林斯顿大学的一组研究人员设计了一个巧妙的实验范式：J. E. Martinez, B. Labbree, S. Uddenberg, A. Todorov, “Meaningful ‘Noise’: Comparative Judgments Contain Stable Idiosyncratic Contributions”（未发表的手稿）。

噪声的最主要成分还是稳定的模式噪声：Gregory Stoddard, Jens Ludwig, Sendhil Mullaina than，他们在2020年6月至7月与本书作者有邮件往来。

对保释法官的大规模研究：J. Kleinberg, H. Lakkaraju, J. Leskovec, J. Ludwig, S. Mullainathan, “Human Decisions and Machine Predictions,” Quarterly Journal of Economics 133（2018）: 237–293。

运用这些模型来模拟法官对141 833个案例做出的判决：该模型不仅生成了每一位法官对141 833个案件判决的排序，还生成了每一位法官是否判决保释的一个阈值。水平噪声就是这个阈值的变异性，而模式噪声就是案件排序的变异性。

菲尔·罗森茨威格坚定地认为：Phil Rosenzweig. Left Brain, Right Stuff: How Leaders Make Winning Decisions（New York: Public Affairs, 2014）。

第五部分　决策卫生，提升五大人类判断力

第18章　卓越的判断者，卓越的判断力

该群体由高能力个体组成：Albert E. Mannes et al., “The Wisdom of Select Crowds,” Journal of Personality and Social Psychology 107, no. 2（2014）: 276–299; Jason Dana et al., “The Composition of Optimally Wise Crowds,” Decision Analysis 12, no. 3（2015）: 130–143。

自信启发式：Briony D. Pulford, Andrew M. Colmna, Eike K. Buabang, Eva M. Krockow, “The Persuasive Power of Knowledge: Testing the Confidence Heuristic,” Journal of Experimental Psychology: General 147, no. 10（2018）: 1431–1444。

也与更好的工作绩效相关：Nathan R. Kuncel, Sarah A. Hezlett, “Fact and Fiction in Cognitive Ability Testing for Admissions and Hiring Decisions,” Current Directions in Psychological Science 19, no. 6（2010）: 339–345。

人们对智力的本质也一直存在误解：Kuncel, Hezlett, “Fact and Fiction”。

就像一篇评论文章所指出的：Frank L. Schmidt, John Hunter, “General Mental Ability in the World of Work: Occupational Attainment and Job Performance,” Journal of Personality and Social Psychology 86, no. 1（2004）: 162。

责任心和毅力：Angela L. Duckworth, David Weir, Eli Tsukayama, David Kwok, “Who Does Well in Life? Conscientious Adults Excel in Both Objective and Subjective Success,” Frontiers in Psychology 3（September 2012）。关于毅力，请参阅：Angela L. Duckworth, Christopher Peterson, Michael D. Matthews, Dennis Kelly, “Grit:Perseverance and Passion for Long-Term Goals,” Journal of Personality and Social Psychology 92, no. 6（2007）: 1087–1101。

流体智力：Richard E. Nisbett et al., “Intelligence: New Findings and Theoretical Developments,”American Psychologist 67, no. 2（2012）: 130–159, 229。

GMA的预测效力比心理学研究中的大部分测量方法都好：Schmidt and Hunter, “Occupational Attainment”, 162。

标准化测验分数与工作绩效之间的相关系数达到0.5（PC=67%）：Kuncel, Hezlett, “Fact and Fiction”。

按照社会科学的标准，相关系数达到0.5代表非常强的预测力：这些相关性来自元分析（meta-analyses），修正了在一般和受限情况下观测到的测量误差的相关性。对于这些修正是否夸大了GMA的预测价值，研究者之间存在一些争论。然而，由于这些方法论上的争论也适用于其他预测因素，因此专家们普遍同意GMA（连同工作样本测试，见第24章）是衡量工作成功与否的最佳预测指标。参见Kuncel, Hezlett, “Fact and Fiction”。

在律师、化学家或工程师这些职业中，几乎没有GMA低于平均水平的人：Schmidt, Hunter, “Occupational Attainment,” 162。

即使认知能力的测试成绩处于前1%的群体，他们能获得的突出成就也与GMA高度相关：David Lubinski, “Exceptional Cognitive Ability：The Phenotype,” Behavior Genetics 39, no. 4（2009）: 350–358。

2013年的一项研究重点调查了《财富》500强企业的CEO：Jonathan Wai, “Investigating America’s Elite: Cognitive Ability, Education, and Sex Differences,” Intelligence 41, no. 4（2013）: 203–211。

研究人员建议使用的其他测量问题包括：Keela S. Thomson, Daniel M. Oppenheimer, “Investigating an Alternate Form of the Cognitive Reflection Test,” Judgment and Decision Making 11, no. 1（2016）: 99–113。

低CRT得分与现实生活中的一些判断和信念有关：Gordon Pennycook et al., “Everyday Consequences of Analytic Thinking,” Current Directions in Psychological Science 24, no. 6（2015）: 425–432。

CRT得分还可以预测人们是否会因为明显不准确的“假信息”而上当：Gordon Pennycook, David G. Rand, “Lazy, Not Biased:Susceptibility to Partisan Fake News Is Better Explained by Lack of Reasoning than by Motivated Reasoning,” Cognition 188（June 2018）: 39–50。

该测试的得分甚至与人们使用智能手机的程度有关：Nathaniel Barr et al., “The Brain in Your Pocket: Evidence That Smartphones Are Used to Supplant Thinking,” Computers in Human Behavior 48（2015）: 473–480。

人们是否会习惯性地运用反射性或冲动性思维过程：Niraj Patel, S. Glenn Baker, Laura D. Scherer, “Evaluating the Cognitive Reflection Test as a Measure of Intuition/Reflection, Numeracy, and Insight Problem Solving, and the Implications for Understanding Real-World Judgments and Beliefs,” Journal of Experimental Psychology: General 148, no. 12（2019）: 2129–2153。

认知需求量表：John T. Cacioppo, Richard E. Petty, “The Need for Cognition,” Journal of Personality and Social Psychology 42, no. 1（1982）: 116–131。

认知需求高的人不太容易出现已知的认知偏差：Stephen M. Smith, Irwin P. Levin, “Need for Cognition and Choice Framing Effects,” Journal of Behavioral Decision Making 9, no. 4（1996）: 283–290。

那些在认知需求量表上得分低的人，更偏爱剧透：Judith E. Rosenbaum, Benjamin K. Johnson, “Who’s Afraid of Spoilers? Need for Cognition, Need for Affect, and Narrative Selection and Enjoyment,” Psychology of Popular Media Culture 5, no. 3（2016）: 273–289。

成人决策能力量表：Wandi Bruine De Bruin et al., “Individual Differences in Adult Decision–Making Competence,” Journal of Personality and Social Psychology 92, no. 5（2007）: 938–956。

哈尔彭批判性思维测试：Heather A. Butler, “Halpern Critical Thinking Assessment Predicts Real-World Outcomes of Critical Thinking,” Applied Cognitive Psychology 26, no. 5（2012）: 721–729。

可以作为人的预测能力指标的认知风格：Uriel Haran, Ilana Ritov, Barbara Mellers, “The Role of Actively Open-Minded Thinking in Information Acquisition, Accuracy, and Calibration,” Judgment and Decision Making 8, no. 3（2013）: 188–201。

积极开放性思维：Haran, Ritov, Mellers, “Role of Actively Open-Minded Thinking”。

开放性思维是一种可习得的技能：J. Baron, “Why Teach Thinking? An Essay,” Applied Psychology: An International Review 42（1993）: 191–214；J. Baron, The Teaching of Thinking: Thinking and Deciding, 2nd ed.（New York: Cambridge University Press, 1994）, 127–148。

第19章　消除偏差与决策卫生

他们的核心发现：Jack B. Soll et al., “A User’s Guide to Debiasing,” in The Wiley Blackwell Handbook of Judgment and Decision Making, ed. Gideon Keren, George Wu, vol. 2（New York: John Wiley & Sons, 2015）, 684。

《绿皮书》：HM Treasury, The Green Book: Central Government Guidance on Appraisal and Evaluation（London: UK Crown, 2018）。

助推：Richard H. Thaler，Cass R. Sunstein, Nudge: Improving Decisions about Health, Wealth, and Happiness（New Haven, CT: Yale University Press, 2008）。

助力：Ralph Hertwig，Till Grüne-Yanoff, “Nudging and Boosting:Steering or Empowering Good Decisions,” Perspectives on Psychological Science 12, no. 6（2017）。

教育人们克服偏差是一项崇高的事业，而且很有用：Geoffrey T. Fong et al., “The Effects of Statistical Training on Thinking About Everyday Problems”, Cognitive Psychology 18, no. 3（1986）: 253–292。

当被问及常识性问题时，他们可能和其他人一样过分自信：Willem A. Wagenaar and Gideon B. Keren, “Does the Expert Know? The Reliability of Predictions and Confidence Ratings of Experts,” Intelligent Decision Support in Process Environments（1986）: 87–103。

这些游戏都使得参与者被问及类似问题时的犯错次数降低：Carey K. Morewedge et al., “Debiasing Decisions: Improved Decision Making with a Single Training Intervention,” Policy Insights from the Behavioral and Brain Sciences 2, no. 1（2015）: 129–140。

应用学到的知识来解决商业问题：Anne-Laure Sellier et al., “Debiasing Training Transfers to Improve Decision Making in the Field,” Psychological Science 30, no. 9（2019）: 1371–1379。

偏差盲点：Emily Pronin et al., “The Bias Blind Spot: Perceptions of Bias in Self Versus Others,” Personality and Social Psychology Bulletin 28, no. 3（2002）: 369–381。

可能影响提案产生过程的偏差：Daniel Kahneman, Dan Lovallo, Olivier Sibony, “Before You Make That Big Decision...,” Harvard Business Review 89, no. 6（June 2011）: 50–60。

该清单对于提升高风险环境中的决策有悠久的历史：Atul Gawande, Checklist Manifesto: How to Get Things Right（New York: Metropolitan Books, 2010）。

一份简单的检查清单：Office of Information and Regulatory Affairs, “Agency Checklist: Regulatory Impact Analysis,” no date。

我们在附录2中展示了一个偏差检查清单：该清单部分改编自丹尼尔·卡尼曼等人，“Before You Make That Big Decision”, Harvard Business Review。

便于应用：Gawande, Checklist Manifesto。

第20章　司法科学，信息排序是最大的噪声

错误是人为所致：R. Stacey, “A Report on the Erroneous Fingerprint Individualisation in the Madrid Train Bombing Case,” Journal of Forensic Identification 54（2004）: 707–718。

当时的FBI网站就坚称：Michael Specter, “Do Fingerprints Lie?,” The New Yorker, May 27, 2002。

正如德鲁尔所说：I. E. Dror, R. Rosenthal, “Meta–analytically Quantifying the Reliability and Biasability of Forensic Experts,” Journal of Forensic Science 53（2008）: 900–903。

在第一项研究中：I. E. Dror, D. Charlton, A. E. Péron, “Contextual Information Renders Experts Vulnerable to Making Erroneous Identifications,” Forensic Science International 156（2006）: 74–78。

在第二项研究中：I. E. Dror, D. Charlton, “Why Experts Make Errors,” Journal of Forensic Identification 56（2006）: 600–616。

指纹鉴定专家往往是根据背景环境做出决策的：I. E. Dror, S. A. Cole, “The Vision in ‘Blind’ Justice: Expert Perception, Judgment, and Visual Cognition in Forensic Pattern Recognition,” Psychonomic Bulletin and Review 17（2010）: 161–167, 165. See also I. E. Dror, “A Hierarchy of Expert Performance（HEP）,” Journal of Applied Research in Memory and Cognition（2016）: 1–6。

在另一项独立研究中：I. E. Dror et al., “Cognitive Issues in Fingerprint Analysis: Inter–and Intra-Expert Consistency and the Effect of a ‘Target’ Comparison,” Forensic Science International 208（2011）: 10–17。

随后的一项独立研究：B. T. Ulery, R. A. Hicklin, M. A. Roberts, J. A. Buscaglia, “Changes in Latent Fingerprint Examiners’ Markup Between Analysis and Comparison,” Forensic Science International 247（2015）: 54–61。

即使是被普遍视为司法科学新黄金标准的DNA分析，也容易受到证实性偏差的影响：I. E. Dror, G. Hampikian, “Subjectivity and Bias in Forensic DNA Mixture Interpretation,” Science and Justice 51（2011）: 204–208。

鉴定人员经常会在随证据一起提交给他们的传送信函中收到此类信息：M. J. Saks, D. M. Risinger, R. Rosenthal, W. C. Thompson, “Context Effects in Forensic Science: A Review and Application of the Science of Science to Crime Laboratory Practice in the United States,” Science Justice Journal of Forensic Science Society 43（2003）: 77–90。

执行核实工作的鉴定人员知道最初的结论：President’s Council of Advisors on Science and Technology（PCAST）, Report to the President: Forensic Science in Criminal Courts:Ensuring Scientific Validity of Feature-Comparison Methods（Washington, DC: Executive Office of the President, PCAST, 2016）。

针对这一错误展开的调查：Stacey, “Erroneous Fingerprint”。

即使是一位备受尊敬的独立专家：Dror, Cole, “Vision in ‘Blind’ Justice”。

一系列偏差：I. E. Dror, “Biases in Forensic Experts,” Science 360（2018）: 243。

他们有时也会对自己先前见过的一组指纹改变看法：Dror, Charlton, “Why Experts Make Errors”。

2012年，FBI委托进行的一项研究：B. T. Ulery, R. A. Hicklin, J. A. Buscaglia, M. A. Roberts, “Repeatability and Reproducibility of Decisions by Latent Fingerprint Examiners,” PLoS One 7（2012）。

无辜者计划：Innocence Project, “Overturning Wrongful Convictions Involving Misapplied Forensics,”Misapplication of Forensic Science（2018）: 1–7。还可参见S. M. Kassin, I. E. Dror, J. Kukucka, L. Butt, “The Forensic Confirmation Bias: Problems, Perspectives, and Proposed Solutions,” Journal of Applied Research in Memory and Cognition 2（2013）: 42–52。

对刑事法庭中的司法鉴定进行了全面回顾：PCAST, Report to the President。

大规模指纹识别准确性研究：B. T. Ulery, R. A. Hicklin, J. Buscaglia, M. A. Roberts, “Accuracy and Reliability of Forensic Latent Fingerprint Decisions,” Proceedings of the National Academy of Sciences 108（2011）: 7733–7738。

这一比例要比普通公众乃至大部分陪审员认为的高很多：（PCAST）, Report to the President, p. 95。

在佛罗里达州进行的一项后续研究：Igor Pacheco, Brian Cerchiai, Stephanie Stoiloff, “Miami-Dade Research Study for the Reliability of the ACE-V Process:Accuracy & Precision in Latent Fingerprint Examinations,” final report, Miami-Dade Police Department Forensic Services Bureau, 2014。

在大多数案件中，“排除”与“无法确认”对案件本身产生的影响是一样的：B. T. Ulery, R.A. Hicklin, M. A. Roberts, J. A. Buscaglia, “Factors Associated with Latent Fingerprint Exclusion Determinations,” Forensic Science International 275（2017）: 65–75。

鉴定人员做出的假阳性判断（错误识别）也要少得多：R. N. Haber, I. Haber, “Experimental Results of Fingerprint Comparison Validity and Reliability: A Review and Critical Analysis,” Science & Justice 54（2014）: 375–389。

他们容易受到偏差的影响：Dror, “Hierarchy of Expert Performance,” 3。

他应该去迪士尼工作：M. Leadbetter, letter to the editor, Fingerprint World 33（2007）: 231。

而不会真正改变他们的判断：L. Butt, “The Forensic Confirmation Bias: Problems, Perspectives and Proposed Solutions—Commentary by a Forensic Examiner,” Journal of Applied Research in Memory and Cognition 2（2013）: 59–60。

就连FBI在梅菲尔德案的内部调查中都强调：Stacey, “Erroneous Fingerprint,” 713。

一项对21个国家400名鉴定专家展开的调查：J. Kukucka, S. M. Kassin, P. A. Zapf, I. E. Dror, “Cognitive Bias and Blindness: A Global Survey of Forensic Science Examiners,” Journal of Applied Research in Memory and Cognition 6（2017）。

线性序列揭露：I. E. Dror et al., letter to the editor: “Context Management Toolbox: A Linear Sequential Unmasking（LSU）Approach for Minimizing Cognitive Bias in Forensic Decision Making,” Journal of Forensic Science 60（2015）: 1111–1112。

第21章　甄选与汇总，超级预测的两大策略

官方机构在对预算进行预测时，会表现出不切实际的乐观：Jeffrey A. Frankel, “Over-optimism in Forecasts by Official Budget Agencies and Its Implications,” working paper 17239, National Bureau of Economic Research, December 2011。

预测者往往过于自信：H. R. Arkes, “Overconfidence in Judgmental Forecasting,” in Principles of Forecasting: A Handbook for Researchers and Practitioners,ed. Jon Scott Armstrong, vol. 30, International Series in Operations Research & Management Science（Boston: Springer, 2001）。

一项正在进行的季度调查：Itzhak Ben-David, John Graham, Campell Harvey, “Managerial Miscalibration,” The Quarterly Journal of Economics 128, no. 4（November 2013）: 1547–1584。

不可靠性也是判断预测的误差来源之一：T. R. Stewart, “Improving Reliability of Judgmental Forecasts,” in Principles of Forecasting: A Handbook for Researchers and Practitioners, ed. Jon Scott Armstrong, vol. 30, International Series in Operations Research & Management Science（Boston: Springer, 2001）（以下简称Principles of Forecasting）, 82。

让法学教授预测最高法院的裁决：Theodore W. Ruger, Pauline T. Kim, Andrew D. Martin, Kevin M. Quinn, “The Supreme Court Forecasting Project: Legal and Political Science Approaches to Predicting Supreme Court Decision-Making,” Columbia Law Review 104（2004）: 1150–1209。

空气污染管理制度：Cass Sunstein, “Maximin,” Yale Journal of Regulation（草稿；May 3, 2020）。

许多存在噪声的关于预测的例子：Armstrong, Principles of Forecasting。

对多次预测取平均值会大大提高预测的准确性：Jon Scott Armstrong, “Combining Forecasts,”in Principles of Forecasting, 417–439。

一组预测者的未加权平均值优于大多数个体及至所有个体的预测：T. R. Stewart, “Improving Reliability of Judgmental Forecasts,” in Principles of Forecasting, 95。

综合预测平均减少了12.5%的误差：Armstrong, “Combining Forecasts”。

根据近期判断的准确性来选择最好的判断者：Albert E. Mannes et al., “The Wisdom of Select Crowds,” Journal of Personality and Social Psychology 107, no.2（2014）: 276–299。

预测市场的表现非常好：Justin Wolfers, Eric Zitzewitz, “Prediction Markets,” Journal of Economic Perspectives 18（2004）: 107–126。

利用预测市场来汇总不同的观点：Cass R. Sunstein, Reid Hastie, Wiser: Getting Beyond Group-think to Make Groups Smarter（Boston: Harvard Business Review Press, 2014）。

德尔菲法：Gene Rowe, George Wright, “The Delphi Technique as a Forecasting Tool: Issues and Analysis,” International Journal of Forecasting 15（1999）: 353–375。另参见Dan Bang, Chris D. Frith, “Making Better Decisions in Groups,” Royal Society Open Science 4, no. 8（2017）。

实施起来却有一定的挑战性：R. Hastie, “Review Essay: Experimental Evidence on Group Accuracy,”in B. Grofman & G. Guillermo, eds., Information Pooling and Group Decision Making（Greenwich, CT: JAI Press, 1986）, 129–157。

迷你德尔菲法：Andrew H. Van De Ven, André L. Delbecq, “The Effectiveness of Nominal, Delphi, and Interacting Group Decision Making Processes,” Academy of Management Journal 17, no. 4（2017）。

好于能够阅读情报和其他秘密数据的情报界分析师的平均水平：Superforecasting, 95。

超级预测者：Superforecasting, 231。

尝试，失败，分析：Superforecasting, 273。

一种复杂的统计技术：Ville A. Satopää, Marat Salikhov, Philip E. Tetlock, Barb Mellers, “Bias, Information, Noise: The BIN Model of Forecasting,” February 19, 2020, 23。

干预措施提高准确性：Satopää et al., “Bias,Information,Noise,” 23.

与培训方式不同的是，通过团队合作……预测者可以利用这些信息：Satopää et al., 22。

“超级预测者”的成功主要归功于他们在控制测量误差方面出色的能力：Satopää et al., 24。

通过汇总既独立又互补的判断，我们可以获得准确度上的进一步提高：Clintin P. Davis-Stober, David V. Budescu, Stephen B. Broomell, Jason Dana.“The composition of optimally wise crowds.” Decision Analysis 12, no. 3（2015）: 130–143。

第22章　医疗决策，用科学的诊断指南减少噪声

量化肌腱退化的程度时，医生的诊断产生的噪声就很小：Laura Horton et al., “Development and Assessment of Inter- and Intra-Rater Reliability of a Novel Ultrasound Tool for Scoring Tendon and Sheath Disease: A Pilot Study,” Ultrasound 24, no. 3（2016）: 134。

当病理学家评估乳腺病灶的穿刺活检结果时：Laura C. Collins et al., “Diagnostic Agreement in the Evaluation of Image-guided Breast Core Needle Biopsies,” American Journal of Surgical Pathology 28（2004）: 126。

即便有快速抗原检测结果：Julie L. Fierro et al., “Variability in the Diagnosis, Treatment of Group A Streptococcal Pharyngitis by Primary Care Pediatricians,” Infection Control and Hospital Epidemiology 35, no. S3（2014）: S79。

就会被诊断为患有糖尿病：Diabetes Tests, Centers for Disease Control and Prevention。

在有些医院里，第二诊疗意见是必须给出的：Joseph D. Kronz et al., “Mandatory Second Opinion Surgical Pathology at a Large Referral Hospital,” Cancer 86（1999）: 2426。

达特茅斯·阿特拉斯项目：大部分相关材料可以在网上找到。达特茅斯医学院有像一本书那样长的大纲，The Quality of Medical Care in the United States: A Report on the Medicare Program；the Dartmouth Atlas of Health Care 1999（American Hospital Publishers, 1999）。

许多国家也存在医疗资源分配不均的情况：OECD, Geographic Variations in Health Care:What Do We Know and What Can Be Done to Improve Health System Performance?（Paris: OECD Publishing, 2014）, 137–169; Michael P. Hurley et al., “Geographic Variation in Surgical Outcomes and Cost Between the United States and Japan,” American Journal of Managed Care 22（2016）: 600; John Appleby, Veena Raleigh, Francesca Frosini, Gwyn Bevan, Haiyan Gao, Tom Lyscom, Variations in Health Care: The Good, the Bad and the Inexplicable（London: The King’s Fund, 2011）。

一项针对放射科医生做出肺炎诊断的研究发现：David C. Chan Jr. et al., “Selection with Variation in Diagnostic Skill: Evidence from Radiologists,” National Bureau of Economic Research, NBER Working Paper No. 26467, November 2019。

在医疗领域也是如此：P. J. Robinson, “Radiology’s Achilles’Heel:Error and Variation in the Interpretation of the Röntgen Image,” British Journal of Radiology 70（1997）: 1085。与之相关的一个研究是Yusuke Tsugawa et al., “Physician Age and Outcomes in Elderly Patients in Hospital in the US: Observational Study,” BMJ 357（2017）。研究结果发现，医生们越不训练，表现就越糟糕，因此，对于医生来说，在积累实践经验和熟悉最新的证据和指南之间存在一种权衡。这项研究发现，结束住院医师训练后头几年的医生表现最好，因为他们还记得这些证据。

如放射学和病理学：Robinson, “Radiology’s Achilles’ Heel”。

kappa统计量：与相关系数类似，kappa值可以是负的，不过在实践中很少出现。以下是不同kappa统计量的一个特征：轻度吻合（κ=0～0.2），一般吻合（κ=0.21～0.4），中等吻合（κ=0.41～0.6），高度吻合（κ=0.61～0.8），近乎完全吻合（κ＞0.8）。参见Ron Wald, Chaim M. Bell, Rosane Nisenbaum, Samuel Perrone, Orfeas Liangos, Andreas Laupacis, Bertrand L. Jaber, “Interobserver Reliability of Urine Sediment Interpretation,” Clinical Journal of the American Society of Nephrology 4, no. 3（March 2009）: 567–571。

药物之间的相互作用：Howard R. Strasberg et al., “Inter-Rater Agreement Among Physicians on the Clinical Significance of Drug-Drug Interactions,” AMIA Annual Symposium Proceedings （2013）: 1325。

肾病专家们在基于肾病患者的标准化检测结果做出诊断时：Wald et al., “Interobserver Reliability of Urine Sediment Interpretation”。

乳腺病变是否为癌变：Juan P. Palazzo et al., “Hyperplastic Ductal and Lobular Lesions and Carcinomas in Situ of the Breast: Reproducibility of Current Diagnostic Criteria Among Community-and Academic-Based Pathologists,” Breast Journal 4（2003）: 230。

在诊断乳腺增生病变时：Rohit K. Jain et al., “Atypical Ductal Hyperplasia: Interobserver and Intraobserver Variability,” Modern Pathology 24（2011）: 917。

判断椎管的狭窄程度：Alex C. Speciale et al., “Observer Variability in Assessing Lumbar Spinal Stenosis Severity on Magnetic Resonance Imaging and Its Relation to Cross-Sectional Spinal Canal Area,”Spine 27（2002）: 1082。

在美国，心脏病是男性和女性的主要致死原因：Centers for Disease Control and Prevention, “Heart Disease Facts,” accessed June 16, 2020。

这种情况发生的可能性为31%：Timothy A. DeRouen et al., “Variability in the Analysis of Coronary Arteriograms,” Circulation 55（1977）: 324。

这些医生在判断子宫内膜异位病灶的数量和位置时，产生了很大分歧：Olaf Buchweltz et al., “Interobserver Variability in the Diagnosis of Minimal and Mild Endometriosis,” European Journal of Obstetrics & Gynecology and Reproductive Biology 122（2005）: 213。

肺结核诊断中依然存在显著的变异性：Jean-Pierre Zellweger et al., “Intra-observer and Overall Agreement in the Radiological Assessment of Tuberculosis,” International Journal of Tuberculosis & Lung Disease 10（2006）: 1123。关于“一般”诊断一致性的内容，参见Yanina Balabanova et al., “Variability in Interpretation of Chest Radiographs Among Russian Clinicians and Implications for Screening Programmes:Observational Study,” BMJ 331（2005）: 379。

不同国家的放射科医生在肺结核的诊断上也存在差异：Shinsaku Sakurada et al., “Inter-Rater Agreement in the Assessment of Abnormal Chest X-Ray Findings for Tuberculosis Between Two Asian Countries,” BMC Infectious Diseases 12, article 31（2012）。

研究人员要求8位病理学家对每个病例进行诊断：Evan R. Farmer et al., “Discordance in the Histopathologic Diagnosis of Melanoma and Melanocytic Nevi Between Expert Pathologists,” Human Pathology 27（1996）: 528。

还有一项研究发现：Alfred W. Kopf, M. Mintzis, and R. S. Bart，“Diagnostic Accuracy in Malignant Melanoma,” Archives of Dermatology 111（1975）: 1291。

这项研究的作者总结道：Maria Miller，A. Bernard Ackerman, “How Accurate Are Dermatologists in the Diagnosis of Melanoma? Degree of Accuracy and Implications,” Archives of Dermatology 128（1992）: 559。

假阳性率也为1%～64%：Craig A. Beam et al., “Variability in the Interpretation of Screening Mammograms by US Radiologists,” Archives of Internal Medicine 156（1996）: 209。

有时候，放射科医生两次评估同一张影像片子时会给出不同的意见：P. J. Robinson et al., “Variation Between Experienced Observers in the Interpretation of Accident and Emergency Radiographs,” British Journal of Radiology 72（1999）: 323。

血管造影显示的血管阻塞程度：Katherine M. Detre et al., “Observer Agreement in Evaluating Coronary Angiograms,” Circulation 52（1975）: 979。

在那些标准模糊和判断情境复杂的领域中：Horton et al., “Inter- and Intra-Rater Reliability”;and Megan Banky et al., “Inter- and Intra-Rater Variability of Testing Velocity When Assessing Lower Limb Spasticity,” Journal of Rehabilitation Medicine 51（2019）。

另一项不涉及诊断的研究：Esther Y. Hsiang et al., “Association of Primary Care Clinic Appointment Time with Clinician Ordering and Patient Completion of Breast and Colorectal Cancer Screening,” JAMA Network Open 51（2019）。

还有一个例子也能说明临床医生会受到疲劳的影响：Hengchen Dai et al., “The Impact of Time at Work and Time Off from Work on Rule Compliance: The Case of Hand Hygiene in Health Care,”Journal of Applied Psychology 100（2015）: 846。

对医学领域意义重大：Ali S. Raja, “The HEART Score Has Substantial Interrater Reliability,”NEJM J Watch, December 5, 2018 [reviewing Colin A. Gershon et al., “Inter-rater Reliability of the HEART Score,” Academic Emergency Medicine 26（2019）: 552]。

我们在前面提到，培训可以提高医生的技能：Jean-Pierre Zellweger et al., “Intra-observer and Overall Agreement in the Radiological Assessment of Tuberculosis,” International Journal of Tuberculosis & Lung Disease 10（2006）: 1123；Ibrahim Abubakar et al., “Diagnostic Accuracy of Digital Chest Radiography for Pulmonary Tuberculosis in a UK Urban Population,” European Respiratory Journal 35（2010）: 689。

汇总多个专家的判断也能减少噪声：Michael L. Barnett et al., “Comparative Accuracy of Diagnosis by Collective Intelligence of Multiple Physicians vs Individual Physicians,” JAMA Network Open 2（2019）: e19009；Kimberly H. Allison et al., “Understanding Diagnostic Variability in Breast Pathology: Lessons Learned from an Expert Consensus Review Panel,” Histopathology 65（2014）: 240。

最好的算法：Babak Ehteshami Bejnordi et al., “Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women with Breast Cancer,” JAMA 318（2017）: 2199。

深度学习算法：Varun Gulshan et al., “Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs,” JAMA 316（2016）: 2402。

几乎和放射科医生一样出色：Mary Beth Massat, “A Promising Future for AI in Breast Cancer Screening”, Applied Radiology 47（2018）: 22；Alejandro Rodriguez-Ruiz et al., “Stand-Alone Artificial Intelligence for Breast Cancer Detection in Mammography: Comparison with 101 Radiologists,” Journal of the National Cancer Institute 111（2019）: 916。

表22-1，阿普加评分指南：Medline Plus（last accessed February 4, 2020）。

阿普加评分产生的噪声很小：L. R. Foster et al., “The Interrater Reliability of Apgar Scores at 1 and 5 Minutes,” Journal of Investigative Medicine 54, no. 1（2006）: 293。

使用该量表进行评估和评分相对直接：Warren J. McIsaac et al., “Empirical Validation of Guidelines for the Management of Pharyngitis in Children and Adults,” JAMA 291（2004）: 1587。

一项研究发现，BI-RADS提升了乳房X线片的评估者之间的一致性：Emilie A. Ooms et al., “Mammography: Interobserver Variability in Breast Density Assessment,” Breast 16（2007）: 568。

在病理学领域：Frances P. O’Malley et al., “Interobserver Reproducibility in the Diagnosis of Flat Epithelial Atypia of the Breast,” Modern Pathology 19（2006）: 172。

至少从20世纪40年代起，减少噪声就成为精神病学界的头等大事：Ahmed Aboraya et al., “The Reliability of Psychiatric Diagnosis Revisited,” Psychiatry（Edgmont） 3（2006）: 41。有关概述，详见N. Kreitman, “The Reliability of Psychiatric Diagnosis,” Journal of Mental Science 107（1961）: 876–886。

1964年，一项针对91名患者和10名有经验的精神科医生的研究：Aboraya et al., “Reliability of Psychiatric Diagnosis Revisited,” 43。

为了初步揭示其中的原因：C. H. Ward et al., “The Psychiatric Nomenclature: Reasons for Diagnostic Disagreement,” Archives of General Psychiatry 7（1962）: 198。

具有生物医学背景的另外一位临床医生：Aboraya et al., “Reliability of Psychiatric Diagnosis Revisited”。

DSM Ⅲ促使大量的研究关注诊断是否存在噪声：Samuel M. Lieblich, David J. Castle, Christos Pantelis, Malcolm Hopwood, Allan Hunter Young, and Ian P. Everall, “High Heterogeneity and Low Reliability in the Diagnosis of Major Depression Will Impair the Development of New Drugs,” British Journal of Psychiatry Open 1（2015）: e5–e7。

但这本手册远没有达到完美：Lieblich et al., “High Heterogeneity”。

即使在2000年对第4版——DSM Ⅳ进行了重大修订之后：Elie Cheniaux et al., “The Diagnoses of Schizophrenia, Schizoaffective Disorder, Bipolar Disorder and Unipolar Depression: Interrater Reliability and Congruence Between DSM IV and ICD 10,” Psychopathology 42（2009）: 296–298,特别是293页；and Michael Chmielewski et al., “Method Matters:Understanding Diagnostic Reliability in DSM IV and DSM Ⅴ,” Journal of Abnormal Psychology 124（2015）: 764, 768–769。

提高精神疾病诊断的可靠性：Aboraya et al., “Reliability of Psychiatric Diagnosis Revisited,” 47。

存在着一个严重的风险：Aboraya et al., 47。

该手册的第5版：Chmielewski et al., “Method Matters”。

美国精神病学学会：Helena Chmura Kraemer et al., “DSM–5: How Reliable Is Reliable Enough?,”American Journal of Psychiatry 169（2012）: 13–15。

精神科医生的诊断仍然表现出明显的噪声：Lieblich et al., “High Heterogeneity”。

精神科医生就患者是否患有重度抑郁症很难达成一致：Lieblich et al., “High Heterogeneity,”e–5。

DSM Ⅴ的现场试验发现：Lieblich et al.,e–5。

另外一些现场试验表明：Lieblich et al., e–6。

使用指南之所以很难取得成功：Aboraya et al., “Reliability of Psychiatric Diagnosis Revisited,”47。

明确诊断标准，舍弃模糊标准：Aboraya et al。

一位观察者曾说：Aboraya et al。

医学界需要更多的指南：一些有价值的注意事项参见Christopher Worsham, Anupam B. Jena, “The Art of Evidence-Based Medicine,” Harvard Business Review, January 30, 2019。

第23章　绩效评估，用基于外部视角的共识框架做出量化判断

正如有家报纸的标题所示：Jena McGregor, “Study Finds That Basically Every Single Person Hates Performance Reviews,” Washington Post, January 27, 2014。

以判断为基础的绩效评估无处不在：许多组织正在经历的数字化转型，在这个过程中，以判断为基础的绩效评估或许能够帮助创造新的可能性。从理论上讲，公司现在可以收集大量的关于每位员工绩效的详细的、实时的信息，基于这些数据，完全依据算法对一些职位进行绩效评估将成为可能。然而，我们此处关注的职位是那些不能完全从对其进行绩效评估的过程中排除判断的类型。E. D. Pulakos, R. Mueller-Hanson, S. Arad, “The Evolution of Performance Management: Searching for Value,” Annual Review of Organizational Psychology and Organizational Behavior 6（2018）: 249–271。

其中大多数人都发现这些评估充满了噪声：S. E. Scullen, M. K. Mount, M. Goff, “Understanding the Latent Structure of Job Performance Ratings,” Journal of Applied Psychology 85（2000）: 956–970。

其余70%～80%的差异是系统噪声：有一小部分（在有些研究中占总方差的10%）是来自研究者所说的评估者视角或者层级效应（level effect），指在组织中等级的意义上，而不是我们在这里定义的水平噪声。评估者视角反映了：在对同一个人进行评价时，老板和同事有系统性的差异，同事和下属也是如此。从善意的角度来解读360度评级系统的结果，人们可能会认为这不是噪声。如果组织中不同层级的人在看待同一个人的绩效表现时，普遍会看到不同的方面，他们对此人的判断应该有系统性的差异，而且他们的评估也应该表明这一点。

多项关于工作绩效评估变异性的研究：Scullen, Mount, Goff, “Latent Structure”；C. Viswesvaran, D. S. Ones, F. L. Schmidt, “Comparative Analysis of the Reliability of Job Performance Ratings,” Journal of Applied Psychology 81（1996）: 557–574；G. J. Greguras, C. Robie, “A New Look at Within-Source Interrater Reliability of 360-Degree Feedback Ratings,” Journal of Applied Psychology 83（1998）: 960–968；G. J. Greguras, C. Robie, D. J. Schleicher, M. A. Goff, “A Field Study of the Effects of Rating Purpose on the Quality of Multisource Ratings,” Personnel Psychology 56（2003）: 1–21；C. Viswesvaran, F. L. Schmidt, D. S. Ones, “Is There a General Factor in Ratings of Job Performance? A Meta-Analytic Framework for Disentangling Substantive and Error Influences,” Journal of Applied Psychology 90（2005）: 108–131; B. Hoffman, C. E. Lance, B. Bynum, W. A. Gentry, “Rater Source Effects Are Alive and Well After All,” Personnel Psychology 63（2010）: 119–151。

员工真实的工作绩效和对工作绩效进行的评估之间的关系：K. R. Murphy, “Explaining the Weak Relationship Between Job Performance and Ratings of Job Performance,” Industrial and Organizational Psychology 1（2008）: 148–160尤其是151页。

员工的真实绩效：在讨论噪声的来源时，我们忽略了评价某些员工或员工类别的系统性偏差所引起案例噪声（case noise）的可能性。我们能够找到的关于绩效评估变异性的研究，没有一项与外部评估的“真实”绩效进行过比较。

“策略性地”对员工进行评估：E. D. Pulakos, R. S. O’Leary, “Why Is Performance Management Broken?,” Industrial and Organizational Psychology 4（2011）: 146–164；M. M. Harris, “Rater Motivation in the Performance Appraisal Context: A Theoretical Framework,” Journal of Management 20（1994）: 737–756; K. R. Murphy, J. N. Cleveland, Understanding Performance Appraisal: Social, Organizational, and Goal-Based Perspectives（Thousand Oaks, CA: Sage, 1995）。

为了帮助一个一直在寻求晋升机会的人：Greguras et al., “Field Study”。

它可以对客观、可量化的绩效做出预测：P. W. Atkins, R. E. Wood, “Self-Versus Others’ Ratings as Predictors of Assessment Center Ratings:Validation Evidence for 360-Degree Feedback Programs,” Personnel Psychology（2002）。

过度设计的调查问卷：Atkins, Wood, “Self-Versus Others’ Ratings”。

98%的管理者：Olson, Davis, cited in Peter G. Dominick, “Forced Ranking:Pros, Cons and Practices,”in Performance Management:Putting Research into Action, ed. James W. Smither, Manuel London（San Francisco: Jossey-Bass, 2009）, 411–443。

强制排名：Dominick, “Forced Ranking”。

这种关系也被证明适用于绩效评估：Barry R. Nathan, Ralph A. Alexander, “A Comparison of Criteria for Test Validation: A Meta-Analytic Investigation,” Personnel Psychology 41, no. 3（1988）: 517–535。

图23-1：Adapted from Richard D. Goffin, James M. Olson, “Is It All Relative? Comparative Judgments and the Possible Improvement of Self-Ratings and Ratings of Others,” Perspectives on Psychological Science 6, no. 1（2011）: 48–60。

德勤经过计算发现：M. Buckingham, A. Goodall, “Reinventing Performance Management,” Harvard Business Review, April 1, 2015, 1–16, doi: ISSN: 0017-8012。

一项研究发现：Corporate Leadership Council, cited in S. Adler et al., “Getting Rid of Performance Ratings: Genius or Folly? A Debate,” Industrial and Organizational Psychology 9（2016）: 219–252。

正如一篇评论文章中总结的那样：Pulakos, Mueller-Hanson, and Arad, “Evolution of Performance Management”, 250。

绩效管理革命：A. Tavis, P. Cappelli, “The Performance Management Revolution,” Harvard Business Review, October 2016, 1–17。

有证据表明，行为锚定评估量表不足以消除噪声：Frank J. Landy, James L. Farr. “Performance Rating,” Psychological Bulletin 87, no. 1（1980）: 72–107。

通过视频中的案例来练习进行绩效评估：D. J. Woehr and A.I. Huffcutt, “Rater Training for Performance Appraisal:A Quantitative Review,” Journal of Occupational and Organizational Psychology 67（1994）: 189–205；S. G. Roch, D. J. Woehr,V. Mishra, U. Kieszczynska, “Rater Training Revisited:An Updated Meta-Analytic Review of Frame-of-Reference Training,” Journal of Occupational and Organizational Psychology 85（2012）: 370–395; M. H. Tsai, S. Wee, B. Koh, “Restructured Frame-of-Reference Training Improves Rating Accuracy,” Journal of Organizational Behavior（2019）: 1–18。

图23-2：图左改编自Richard Goffin, James M. Olson, “Is It All Relative? Comparative Judgments and the Possible Improvement of Self-Ratings and Ratings of Others,” Perspectives on Psychological Science 6, no. 1（2011）: 48–60。

大多数关于参照框架培训的研究：Roch et al., “Rater Training Revisited”。

超级人才：Ernest O’Boyle, Herman Aguinis, “The Best and the Rest:Revisiting the Norm of Normality of Individual Performance,” Personnel Psychology 65, no. 1（2012）: 79–119；以及Herman Aguinis, Ernest O’Boyle, “Star Performers in Twenty-First Century Organizations,” Personnel Psychology 67, no.2（2014）: 313–350。

第24章　人员招聘，以结构化指标衡量人才

很少有人可以不经过面试就被录用：A. I. Huffcutt, S. S. Culbertson, “Interviews,” in S. Zedeck, ed., APA Handbook of Industrial and Organizational Psychology（Washington, DC: American Psychological Association, 2010）, 185–203。

在某种程度上依赖直觉性判断：N. R. Kuncel, D. M. Klieger, D. S. Ones, “In Hiring, Algorithms Beat Instinct,” Harvard Business Review 92, no. 5（2014）: 32。

首要问题：R. E. Ployhart, N. Schmitt, N. T. Tippins, “Solving the Supreme Problem: 100 Years of Selection and Recruitment at the Journal of Applied Psychology,” Journal of Applied Psychology 102（2017）: 291–304。

其他研究报告的相关系数为0.2～0.33：M. McDaniel, D. Whetzel, F. L. Schmidt, S. Maurer, “Meta Analysis of the Validity of Employment Interviews,” Journal of Applied Psychology 79（1994）: 599–616；A. Huffcutt, W. Arthur, “Hunter and Hunter（1984）Revisited:Interview Validity for Entry-Level Jobs,” Journal of Applied Psychology 79（1994）: 2；F. L. Schmidt, J. E. Hunter, “The Validity and Utility of Selection Methods in Personnel Psychology: Practical and Theoretical Implications of 85 Years of Research Findings,” Psychology Bulletin 124（1998）: 262–274; F. L. Schmidt, R. D. Zimmerman, “A Counterintuitive Hypothesis About Employment Interview Validity and Some Supporting Evidence,” Journal of Applied Psychology 89（2004）: 553–561。请注意，当考虑某些研究的子集时有效性更高，如果研究使用专门为此目的创建的绩效评估，而不是现有的管理评估，结论更是如此。

客观无知：S. Highhouse, “Stubborn Reliance on Intuition and Subjectivity in Employee Selection,”Industrial and Organizational Psychology 1（2008）: 333–342；D. A. Moore, “How to Improve the Accuracy and Reduce the Cost of Personnel Selection,” California Management Review 60（2017）: 8–17。

有相似的文化背景或共同之处：L. A. Rivera, “Hiring as Cultural Matching:The Case of Elite Professional Service Firms,” American Sociology Review 77（2012）: 999–1022。

两位面试官对同一位应聘者的评分的相关系数：Schmidt, Zimmerman, “Counterintuitive Hypothesis”；Timothy A. Judge, Chad A. Higgins, Daniel M. Cable, “The Employment Interview: A Review of Recent Research and Recommendations for Future Research,” Human Resource Management Review 10（2000）: 383–406；以及A. I. Huffcutt, S. S. Culbertson, W. S. Weyhrauch, “Employment Interview Reliability: New Meta-Analytic Estimates by Structure and Format,”International Journal of Selection and Assessment 21（2013）: 264–276。

第一印象非常重要：M. R. Barrick et al., “Candidate Characteristics Driving Initial Impressions During Rapport Building: Implications for Employment Interview Validity”, Journal of Occupational and Organizational Psychology 85（2012）: 330–352；M. R. Barrick, B. W. Swider, G. L. Stewart, “Initial Evaluations in the Interview: Relationships with Subsequent Interviewer Evaluations and Employment Offers,” Journal of Applied Psychology 95（2010）: 1163。

握手的感觉：G. L. Stewart, S. L. Dustin, M. R. Barrick, T. C. Darnold, “Exploring the Handshake in Employment Interviews,” Journal of Applied Psychology 93,（2008）: 1139–1146。

积极的第一印象：T. W. Dougherty, D. B. Turban, J. C. Callender, “Confirming First Impressions in the Employment Interview: A Field Study of Interviewer Behavior,” Journal of Applied Psychology 79（1994）: 659–665。

一个让人难以置信的实验：J. Dana, R. Dawes, N. Peterson, “Belief in the Unstructured Interview: The Persistence of an Illusion,” Judgment and Decision Making 8（2013）: 512–520。

绝大多数人力资源专业人士都更赞成使用诊断性汇总：Nathan R. Kuncel et al., “Mechanical versus Clinical Data Combination in Selection and Admissions Decisions:A Meta-Analysis,” Journal of Applied Psychology 98, no. 6（2013）: 1060–1072。

相关度为0：Laszlo Bock, interview with Adam Bryant, The New York Times, June 19, 2013。另见Laszlo Bock, Work Rules!: Insights from Inside Google That Will Transform How You Live and Lead（New York: Hachette, 2015）。

一位知名的猎头：C. Fernández-Aráoz, “Hiring Without Firing,” Harvard Business Review, July 1, 1999。

结构化的行为面试：有关结构化内面试的指南，参见Michael A. Campion, David K. Palmer, James E. Campion, “Structuring Employment Interviews to Improve Reliability, Validity and Users’Reactions,” Current Directions in Psychological Science 7, no. 3（1998）: 77–82。

究竟什么样的面试才算得上是结构化面试：J. Levashina, C. J. Hartwell, F. P. Morgeson, M. A. Campion, “The Structured Employment Interview: Narrative and Quantitative Review of the Research Literature,” Personnel Psychology 67（2014）: 241–293。

结构化面试比传统的非结构化面试更能预测应聘者未来的表现：McDaniel et al., “Meta Analysis”; Huffcutt, Arthur, “Hunter and Hunter（1984）Revisited”；Schmidt, Hunter, “Validity and Utility”；Schmidt, Zimmerman, “Counterintuitive Hypothesis”。

工作样本测试：Schmidt, Hunter, “Validity and Utility”。

以色列军队：Kahneman, Thinking, Fast and Slow, 229。

实用性的建议和指导：Kuncel, Klieger, Ones, “Algorithms Beat Instinct”。另见Campion, Palmer, and Campion, “Structuring Employment Interviews”。

错觉的持续：Dana, Dawes, Peterson, “Belief in the Unstructured Interview”。

第25章　中介评估法，做出明智决策的核心方法

中介评估法：Daniel Kahneman, Dan Lovallo, Olivier Sibony, “A Structured Approach to Strategic Decisions: Reducing Errors in Judgment Requires a Disciplined Process,” MIT Sloan Management Review 60（2019）: 67–73.

评估-讨论-评估法：Andrew H. Van De Ven and André Delbecq, “The Effectiveness of Nominal, Delphi, and Interacting Group Decision Making Processes,” Academy of Management Journal 17, no. 4（1974）: 605–621。另见本书第21章。

第六部分　最佳的噪声水平

在他们看来，没有任何机械性的解决方案可以满足司法要求：Kate Stith, José A. Cabranes, Fear of Judging: Sentencing Guidelines in the Federal Courts（Chicago: University of Chicago Press, 1998）, 177。

第26章　减少噪声的成本

这样的改革可能适得其反：Albert O. Hirschman, The Rhetoric of Reaction: Perversity, Futility, Jeopardy（Cambridge, MA: Belknap Press, 1991）。

瓦茨拉夫·哈韦尔：Stith, Cabranes, Fear of Judging。

三振出局：见斯坦福大学法学院三振出局的基本要素。

伍德森诉北卡罗来纳州案：428 U. S. 280（1976）。

依靠大数据和借助算法来做决策可能会产生偏见：Cathy O’Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy（New York: Crown, 2016）。

存在潜在偏差的数学模型正在重塑我们的生活：Will Knight, “Biased Algorithms Are Everywhere, and No One Seems to Care,” MIT Technology Review, July 12, 2017。

ProPublica：Jeff Larson, Surya Mattu, Lauren Kirchner, Julia Angwin, “How We Analyzed the COMPAS Recidivism Algorithm,” ProPublica, May 23, 2016。这个例子中的偏见主张是有争议的，对偏见的不同定义可能会导致相反的结论。有关此案例的观点以及与算法偏差的定义和测量相关的讨论，请参阅下文的注释“究竟如何测试”。

预测性警务：Aaron Shapiro, “Reform Predictive Policing,” Nature 541, no. 7638（2017）: 458–460。

事实上，就这一点而言，算法可能更糟：虽然这种担忧重新引起人们的注意是在基于人工智能模型的领域中，但它并不限于人工智能领域。早在1972年，保罗·斯洛维奇就指出对直觉进行建模会保留、加强甚至可能会放大现有的认知偏见。Paul Slovic, “Psychological Study of Human Judgment: Implications for Investment Decision Making,” Journal of Finance 27（1972）: 779。

究竟如何测试：关于在COMPAS惯犯预测算法存在争议的背景下对这场辩论的介绍，参见Larson et al., “COMPAS Recidivism Algorithm”；William Dieterich et al., “COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity,” Northpointe, Inc., July 8, 2016；Julia Dressel，Hany Farid, “The Accuracy, Fairness,and Limits of Predicting Recidivism,” Science Advances 4, no. 1（2018）: 1–6；Sam Corbett-Davies et al., “A Computer Program Used for Bail and Sentencing Decisions Was Labeled Biased Against Blacks. It’s Actually Not That Clear,” Washington Post, October 17, 2016；Alexandra Chouldechova, “Fair Prediction with Disparate Impact:A Study of Bias in Recidivism Prediction Instruments,” Big Data 153（2017）: 5；Jon Kleinberg, Sendhil Mullainathan, Manish Raghavan, “Inherent Trade-Offs in the Fair Determination of Risk Scores,”Leibniz International Proceedings in Informatics, January 2017。

第27章　尊严，人之为人的重要价值观

他们可能知道，自己的做法充满噪声：Tom R. Tyler, Why People Obey the Law, 2nd ed.（New Haven, CT: Yale University Press, 2020）。

美国宪法中有一项以令人费解著称的裁决，可以帮我们理解这一点：Cleveland Bd. of Educ. v. LaFleur, 414 U. S. 632（1974）。

当时有影响力的评论员为法院的裁决辩护：Laurence H. Tribe, “Structural Due Process,”Harvard Civil Rights-Civil Liberties Law Review 10, no. 2（spring 1975）: 269。

回想一下，我们在前文中论述过的一些法官对量刑指南的强烈反对：Stith, Cabranes, Fear of Judging, 177。

一系列颇具影响力的作品：请参阅Philip K. Howard, The Death of Common Sense: How Law Is Suffocating America（New York:Random House, 1995）；以及Philip K. Howard, Try Common Sense: Replacing the Failed Ideologies of Right and Left（New York: W. W. Norton & Company, 2019）。

第28章　规则还是标准

脸书在2020年社区标准中规定的这些内容：Hate Speech, Facebook: Community Standards。

《纽约客》杂志：Andrew Marantz, “Why Facebook Can’t Fix Itself,” The New Yorker, October 12, 2020。

官僚正义：Jerry L. Mashaw, Bureaucratic Justice, New Haven, CT: Yale University Press（1983）。

恰恰相反，法官应是根据每个人的性格和具体情况，或根据公正性和具体结果的适当性来进行评判的：David M. Trubek, “Max Weber on Law and the Rise of Capitalism,” Wisconsin Law Review 720（1972）: 733, n. 22［引自Max Weber, The Religion of China（1951）, 149］。

(1)　我们怀疑，如果判决日恰好是法官的生日，他们也可能会更宽容，但这一假设尚未得到验证。

(2)　这个典故出自英国作家罗伯特·骚塞（Robert Soutney）的童话故事《三只小熊》（The Story of Three Bears）。迷了路的金发姑娘误入了三只小熊的家，并且喝了小碗里的粥，坐了最小的椅子，睡在了最小的床上，因为那些是最适合她的。因此，“金发姑娘价格”是指最合适的价格。——编者注

(3)　NBA职业运动员罚球命中率最高数据为略高于90%。在撰写本书时，命中率前三名分别为斯蒂芬·库里（Stephen Curry）、史蒂夫·纳什（Steve Nash）和马克·普莱斯（Mark Price）。最低的命中率约为50%，美国著名篮球运动员沙奎尔·奥尼尔（Shaquille O’Neal）的命中率只有约53%。

(4)　技术术语为“重测信度”（test-retest reliability），简称“信度”。

(5)　1磅约等于453.59克。——编者注

(6)　哈佛大学终身教授、“麦克阿瑟天才奖”获得者穆来纳森和普林斯顿大学心理学教授埃尔德·沙菲尔（Eldar Shafir）推出了行为经济学的重磅著作《稀缺》，首度提出“带宽＝认知能力＋执行控制力”，并指出处于稀缺状态中的人的大脑会被稀缺心态俘获，认知能力与执行控制力会变得低下。同时，书中还提出了一些改善稀缺心态的方法。该书中文简体字版由湛庐引进、浙江人民出版社2018年出版。——编者注

(7)　文氏图是用于表示不同集合之间大致关系的一种“草图”。——编者注

(8)　一个网站，人们可以通过在上面提供一些短期服务，如填答问卷，来获取报酬。

(9)　目前，GMA比IQ更常用。

(10)　这类似于教师给学生60以上的分数可以避免出题补考的麻烦。——译者注

(11)　下面这段文字选自莎士比亚的《威尼斯商人》，朱生豪译，译林出版社2018年出版。——编者注

(12)　此处的缺少共性意指残留的误差不像偏差那样指向同一方向。——译者注