The ineradicable bias at the heart of algorithm design

In-depth

Last year, a study conducted by research firm Pew found that 58% of Americans believe that computer programs will always reflect the biases of their designers. It seems the public has cottoned on to the fact that artificial intelligences are only as woke as the societies from which they emerge.

Headline-grabbing stories of facial recognition software with seemingly racist blind spots, apparently sexist CV-screening programs and predictive risk models which tend to penalise the poor are fuelling a widespread skepticism about the use of automated decision-making systems.

As far back as 2004, Google had to issue an explanation when the algorithm that delivers its search engine results returned neo-Nazi sites in prominent positions for the search query ‘Jew’. Back then, the Mountain View company was still insisting that its search engine results were “generated completely objectively and are independent of the beliefs and preferences of those who work at Google”.

In 2009, Hewlett Packard made the news when a YouTube video appeared to show that one of its webcams, which was supposed to move with the user’s face to keep it in the centre of the frame, couldn’t see black faces.

In 2015, Google Photos ran into controversy when its auto-tagging feature, which makes us of a technology called deep convolutional neural networks to tag users’ images with meaningful metadata, mistakenly labelled some images of black people with the tag ‘Gorillas’.

In 2017, Google Translate was discovered to display a gender bias when translating third-person pronouns from languages where gender isn’t specified into the English language. In one example, researchers found that the translation algorithm prefered ‘he’ when coupled with the adjective ‘hardworking’ and ‘she’ when paired with ‘lazy’, prompting some hard-hitting poetry.

Last year we heard how machine-learning specialists at Amazon noticed that their internal candidate-screening system for job applicants was downgrading CVs which indicated that the candidate was female. After scrapping the project, the tech giant claimed that the tool “was never used by Amazon recruiters to evaluate candidates”, making everyone wonder why they built it.

It’s no surprise then, that the newly-forming consensus is that algorithms (increasingly used as a catch-all term for systems which make use of machine learning) can’t be trusted not to discriminate. And where majority opinion plants its flag, the opinions of politicians are sure to follow. Hot on the heels of publication of the Pew study, newly-elected congresswoman for New York Alexandria Ocasio-Cortez told an audience at a Martin Luther King Jr. Day event on January 21st that she also believes that the racial biases of algorithm creators are always encoded in their products:

“Algorithms are still made by human beings, and those algorithms are still pegged to basic human assumptions,” said Ocasio-Cortez in a panel discussion at the annual MLK Now event. “They’re just automated assumptions. And if you don’t fix the bias, then you are just automating the bias.”

So, with the argument about the existence of algorithmic bias apparently settled, the next logical questions are: how do such biases creep, undetected, into the work of presumably well-meaning data scientists and programmers; and can such biases be engineered out?

What we talk about when we talk about algorithms

In simplest possible terms, an algorithm is merely a step-by-step formula for solving a particular class of problems. They’ve been around since the classical age, when Greek mathematicians used algorithms for finding prime numbers or the greatest common divisor of two different numbers. They get their name from the seminal Persian mathematician Muḥammad ibn Mūsā al-Khwārizmī (or Algorithmi in Latin), who was chief astronomer to the Abbasid Caliphate in the 9th century and wrote a popular treatise on algebra which included (what we now call) algorithms for solving linear and quadratic equations.

Bias can easily creep into seemingly objective algorithms due to the selective nature of the training data

In the current vernacular though, when people talk about algorithms, especially in reference to algorithmic bias, they’re normally referring software which is used to return lists of personalised recommendations to the user, populate a list of search results, or calculate a some other important metric – like a credit score – based on a large number of constantly changing variables.

These kinds of algorithms are increasingly based on machine learning technology, where a program is ‘taught’ how to solve a class of problems (like identifying human faces in images or producing an accurate translation) through a massively accelerated process of trial and error, modifying its own formula after each mistake to increase its success rate in future attempts. Depending on the particular techniques used, this kind of technology might also be called deep learning, backpropagation, or the aforementioned convolutional neural networks, but they all have one thing in common: they rely on large quantities of training data which the software uses to optimise its algorithm to successfully solve the problem at hand.

Imagine a program designed to recognise human faces in photographs, which gives a simple ‘yes’ or ‘no’ answer when fed an image. During training, such a program might be fed hundreds of thousands, even millions, of pictures which are known to contain human faces, and an equal number of images which are known not to. The software doesn’t know anything about the images it’s fed except their basic informational content (e.g. the hue, saturation and brightness value of each pixel). It doesn’t know anything about human skin tones, or eye colours, or that a nose generally has two nostrils. It may eventually encode rules or subroutines which kind of, sort of, map onto these real world concepts, but in reality, the software is only writing a set of abstract rules that it can apply to the raw data which will output a correct answer with as high a success rate as possible.

So while an ‘algorithm’ may theoretically be based entirely on mathematics, and therefore believed to output objectively true answers, bias can easily creep in due to the selective nature of the data that the program was fed during the training stage. Returning to our hypothetical facial recognition software, imagine that the data scientists who were training it fed it only images of white people as examples of images which contained human faces. It is likely that the software would have much higher rate of error when it was then asked to correctly label images which depicted people with different skin tones.

That’s a bias that could, in theory, easily be corrected by retraining the software on an appropriate mix of images containing faces from a wide variety of ethnic backgrounds. But what if the data available for training is itself biased in ways that aren’t immediately apparent? That, as it turns out, happens to be the case in many realms where software is being used to automate human decision-making processes, often with severely negative consequences.

Past performance is not an indicator of future outcomes

This sort of bias, the kind introduced by biased training data, is illustrated by two examples, both from the US, where systems that were already biased towards certain marginal groups produced training data for automated software that contained the same bias.

Training data which represents the history of a racially-biased system could only ever produce an algorithm that perpetuates this bias

The first story, reported by ProPublica in 2016, is set in the criminal justice system, where software is increasingly used by courts and prisons to carry out ‘risk assessments’ on suspected and convicted criminals. The programs, often built using machine learning technology, assign scores to an individual indicating the probability that they will commit future crimes – an important consideration when sentencing or determining bail conditions. The scores are based on large numbers of data points gleaned from the person’s criminal record or from their answers to a questionnaire.

After US Attorney General Eric Holder warned that risk assessments might be biased, ProPublica conducted a study of risk scores assigned to more than 7,000 people arrested in Broward County, Florida, in 2013 and 2014. They checked to see how many were charged with new crimes over the next two years, the same benchmark used by the creators of the risk-scoring software. In addition to finding that risk scores were only “somewhat more accurate than a coin flip”, the study also uncovered a distinct racial bias. The algorithm was twice as likely to wrongly label black defendants as future offenders, and was more likely to wrongly categorise white defendants as low risk. This was not a disparity that could be explained by the individuals prior crimes or the types of crimes they’d been arrested for:

“We ran a statistical test that isolated the effect of race from criminal history and recidivism, as well as from defendants’ age and gender. Black defendants were still 77 percent more likely to be pegged as at higher risk of committing a future violent crime and 45 percent more likely to be predicted to commit a future crime of any kind.”

Proponents of risk scoring argue, with some evidence to back them up, that the use of risk scoring algorithms reduces the rate of incarceration by helping flag low-risk convicts and diverting them to community-based supervision rather than prison. They also argue that judges might already have biases based on race, class, gender or sexuality, and that, since judges can overrule any AI tools, those programs don’t make pre-existing biases in the system worse. But risk scores hide their in-built biases behind a quantitative and seemingly objective assessment of the likelihood that someone will go on to commit further crimes, providing a veneer of scientific rigour to the reproduction of already-existing racial biases.  In a country which already locks up African Americans at five times the rate of white people, this is deeply troubling.

The for-profit company which provides the software under scrutiny, Northpointe, isn’t sharing its code, so we can only speculate about how this racial bias emerged. Race isn’t an input into the risk-scoring algorithm, but it’s likely that a number of inputs do correlate with race (e.g. the answer to the question “Was one of your parents ever sent to jail or prison?”). Training data which represents the history of a racially-biased criminal justice system could only ever produce an algorithm that perpetuates this bias.

The second illustration of biases introduced by the selection of training data comes from Allegheny County, Pennsylvania, where the Office of Children, Youth and Families (CYF) child hotline uses a predictive risk model called AFST (Allegheny Family Screening Tool) to determine the likelihood that a child will be the victim of future abuse or neglect. It’s a sophisticated statistical model which takes as its training data Pennsylvania’s vast data warehouse. As Wired reported in 2018:

“The warehouse contains more than a billion records—an average of 800 for every resident of the county—provided by regular data extracts from a variety of public agencies, including child welfare, drug and alcohol services, Head Start, mental health services, the county housing authority, the county jail, the state’s Department of Public Welfare, Medicaid, and the Pittsburgh public schools.”

The tool, like many such algorithms, assigns a weighting to each of these factors to generate a risk score. These weightings are determined during the algorithm’s training phase, where they are fine-tuned so that, applied to the historical data, they would have predicted prior positives as accurately as possible. Statistically speaking, the algorithm is watertight. But when the team at CYF started using it on real, live cases they found that it often contradicted their own views, based on decades of experience and actual contact with their subjects. It frequently gave high risk scores to children who the team deemed to be in no imminent danger and low scores to children who the team suspected were already victims of abuse.

The algorithm was clearly biased in a number of ways. As Wired explains, one of these reasons was that the availability of data on actual cases of neglect and abuse was poor. Instead of using actual harm as the yardstick against which the algorithm was measured, the system instead used proxy variables, like a second call about the same child within a two-year period, or placement with a foster family. But perhaps more important was the social bias inherent in the makeup of the training data:

“[T]he system can only model outcomes based on the data it collects… A quarter of the variables that the AFST uses to predict abuse and neglect are direct measures of poverty: they track use of means-tested programs such as TANF, Supplemental Security Income, SNAP, and county medical assistance. Another quarter measure interaction with juvenile probation and CYF itself, systems that are disproportionately focused on poor and working-class communities, especially communities of color. Though it has been billed as a crystal ball for predicting child harm, in reality the AFST mostly just reports how many public resources families have consumed.

 “Allegheny County has an extraordinary amount of information about the use of public programs. But the county has no access to data about people who do not use public services. Parents accessing private drug treatment, mental health counseling, or financial support are not represented in DHS data. Because variables describing their behavior have not been defined or included in the regression, crucial pieces of the child maltreatment puzzle are omitted from the AFST.”

 There is an old joke, popular in the scientific community, about a man searching for his keys under a lamppost at night. When asked where he dropped them, the man points off into the darkness. When asked why he is looking for them under the lamppost, the man replies “because this is where the light is”. An algorithm constructed by machine learning techniques might represent a very powerful set of eyes, but it can only envisage answers in areas that are illuminated by its training data. While it might see false positives under the lamppost, it will never find the keys lost in the darkness.

Thankfully, at least at the time the Wired article was written, the team at CYF were only using the AFST system alongside their tried and trusted methodologies for identifying at-risk children. While the biases of the system against poor families and families of colour are again masked by a seemingly objective and mathematically sound risk score, they can be ignored by human call screeners using their experience and intuition. Worryingly, such a fail-safe system does not seem to figure in the plans of at least five local authorities in the UK who are developing their own predictive analytics systems for child safeguarding. As the Guardian reported, this is happening at a time when, under austerity policies which have seen huge cuts to local authority funding, councils are slashing their budgets for services for children and vulnerable adults, raising fears that biased algorithms could soon be making life-altering decisions without human intervention.

Combating biased algorithms via disparate impact

Biases introduced by training data aren’t the only kind. In the construction of Allegheny County’s screening tool the decision to use proxy variables instead of actual cases of neglect and abuse was a decision made by a human, and countless similar decisions will be made during the design of any reasonably complex algorithm. Because of their all-too-human nature, the people designing algorithms will have their own, unrecognised cognitive biases which, through no fault of their own, will creep into their model. They and their supervisors may be under pressure to produce results quickly, tempting them to take shortcuts and design a model which seems to output sensible answers in a demo, but which wreaks havoc when used at scale. Or, they might be susceptible to confirmation bias, and be convinced of the ability of their model to solve problems to which it is really not suited. As Nobel Laureate psychologist Daniel Kahneman notes in his book Thinking Fast and Slow:

“I have always believed that scientific research is another domain where a form of optimism is essential to success: I have yet to meet a successful scientist who lacks the ability to exaggerate the importance of what he or she is doing, and I believe that someone who lacks a delusional sense of significance will wilt in the face of repeated experiences of multiple small failures and rare successes, the fate of most researchers.”

With automation and personalisation making rapid progress into almost every aspect of human society, a number of parties are working on approaches for mitigating the potential social costs of hidden algorithmic bias. These approaches fall into two main camps: those which aim to make the process of algorithm design more inclusive so that less harmful bias makes its way into software in the first place; and those which aim to regulate algorithms based on their real-world performance.

Joy Buolamwini, a research assistant at MIT Media Lab, founded the Algorithmic Justice League after working with facial recognition software that failed to ‘see’ her darker skin tones, despite being able to detect the faces of lighter-skinned colleagues without too much trouble. AJL is on a mission to educate people about bias in algorithms, help people voice their concerns and share their experiences, and “develop practices for accountability during the design, development, and deployment phases of coded systems”. Buolamwini is also trying to gather support for an Inclusive Coding movement. This aims to ensure that the people coding algorithms come from a wider variety of backgrounds so they can check each other’s blind spots, that they factor in fairness as they develop systems and that algorithm design in general promotes social change. As she said in a TEDx talk in 2016:

“We’ve used tools of computational creation to unlock immense wealth. We now have the opportunity to unlock even greater equality, if we make social change a priority and not an afterthought.”

 While improving the diversity and intentions of those coding algorithms would be a step in the right direction, the examples above demonstrate that not only is deep learning capable of conjuring biases out of seemingly innocuous data, but that those biases can remain hidden from those in control of their development. As the authors of Big Data’s Disparate Impact point out:

“Even in situations where data miners are extremely careful, they can still affect discriminatory results with models that, quite unintentionally, pick out proxy variables for protected classes.”

 Remember, race wasn’t an input into the algorithm that assigned risk scores to defendants on trial in Broward County, Florida, but the algorithm nevertheless managed to construct some internal proxy for it. The purported purpose of the algorithm was to reduce incarceration of ‘low-risk’ defendants, so it arguably meets Buolamwini’s test of promoting social change. And it’s not clear that a more diverse team of algorithm designers would have avoided the same pitfalls.

The concept of ‘disparate impact’ in US employment law refers to situations where rules, practices and policies that are, on the surface, neutral, lead to outcomes which discriminate against protected groups. In the UK ‘indirect discrimination’, enshrined in the Equality Act 2010, is roughly equivalent. These concepts could be of great benefit to those seeking redress from discrimination due to automated systems that incorporate hidden, unintentional bias, but we’ve yet to see any major test cases against large tech firms or government agencies. When they arrive, they will be hotly contested, much talked about, and could even rewrite fundamental concepts like ‘discrimination’ and ‘fairness’. The battle for the soul of artificial intelligence is just getting started.