by Michele Loi and Markus Christen (University of Zurich)
The use of machine learning in decision-making has triggered an intense debate about “fair algorithms”. Given that fairness intuitions differ and can led to conflicting technical requirements, there is a pressing need to integrate ethical thinking into research and design of machine learning. We outline a framework showing how this can be done.
One of the worst things to be accused of these days is discrimination. This is evident in some of the fierce responses on social media to recently published reports about the increasing proliferation of “discriminatory” classification and decision support algorithms; for example, the debate about the COMPAS prognosis instrument used to assess the relapse risk of prisoners .
This shows that researching and designing machine learning models for supporting or even replacing human decision-making in socially appropriate situations, such as hiring decisions, credit rating or criminal justice, is not solely a technical endeavor. The COMPAS case illustrates this exemplarily, as pointed out by the US-based investigative journalism NGO ProPublica. ProPublica showed that COMPAS made racist predictions, even though information about the race of the offender is not included in the calculation: Based on their risk scores, African Americans charged with a crime who do not reoffend within two years (out of prison) are predicted to be significantly at higher risk than whites who also do not reoffend within two years.. The algorithm thus violates the following idea of fairness: individuals who do not actually relapse should have the same probability that they will (unjustly?) be denied parole.
A standard research ethics answer to such a diagnosis would be the following: The result may point to discriminatory practice in designing COMPAS. One ethics answer would then be to increase the ethical integrity of the machine learning experts e.g. through better training or by increasing diversity in the team.
Unfortunately, the story is more complicated. The mathematicians who analysed the problem after the ProPublica revelations showed that a form of discriminatory distortion is inevitable – even if you program with a clear conscience and the data is free from bias . Indeed, COMPAS was tested for discrimination – but another criterion for fairness was met: people who are denied (or granted) parole should have the same likelihood of relapse. This is achieved by “calibrated risk-scores” and using the same risk-score threshold for deciding whom to release. (Notice that using different thresholds for different groups also seems discriminatory.) The result is an algorithm that achieves “predictive value parity”, i.e. the ratio of false positives and false negatives to predicted positives and predicted negatives (also known as Conditional Use Error ) is the same for both groups. This also seems intuitively required by fairness.
It turned out that it is mathematically impossible (except for irrelevant borderline cases) to meet both fairness conditions simultaneously . In other words, you can either ensure that the people you release on parole are equally likely to commit crimes again, regardless of their race (“COMPAS-Fairness”). Or you can ensure that those who do not commit crimes are equally likely to be released from prison, regardless of their race (“ProPublica Fairness”).
From a research ethics point-of-view this means that teaching the norm “avoid discrimination” to machine learning researchers would not work – as it is inevitable. A further difficulty in the assessment of fairness norms is that an algorithm fulfills a clear fairness constraint for individual decisions may have unexpected implications in the context, e.g. how do judges respond to algorithmic recommendation if they know that predictive value parity is not obtained? How do risk-scores evolve dynamically if more minority citizens are given loans that they are not able to repay?
The role of ethics in such a setting thus goes beyond transmitting norms about what is the right thing to do; it concerns increasing the moral sensitivity of the involved machine learning researchers such that they can identify broader effects of the systems they create. This task cannot be outsourced to those researchers. Rather, we should create working environments (“labs”), where computer scientists collaborate more closely with ethicists and domain experts. The rational analysis of the conceptual relationship between commonsense ideas of fairness and statistical properties of predictions is anything but a trivial task. And we need to answer questions like: What new skills do such ethicists need?
 A. Chouldechova. “Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments.” ArXiv:1610.07524 [Cs, Stat]. http://arxiv.org/abs/1610.07524, 2016.
 R. Berk, H. Heidari, S. Jabbari, M. Kearns, and A. Roth. “Fairness in Criminal Justice Risk Assessments: The State of the Art.” ArXiv:1703.09207 [Stat], March. http://arxiv.org/abs/1703.09207, 2017.
 J. Kleinberg, S. Mullainathan, and M. Raghavan. “Inherent Trade-Offs in the Fair Determination of Risk Scores.” ArXiv:1609.05807 [Cs, Stat]. http://arxiv.org/abs/1609.05807, 2016.
Michele Loi and Markus Christen
DSI Digital Ethics Lab, University of Zurich, Switzerland