How SNCF Réseau And Olexya Migrated A Caffe2 Vision Pipeline To Managed Spot Training In Amazon SageMaker

“Being good is easy, what is difficult is being just.” ― Victor Hugo

“We need to defend the interests of those whom we’ve never met and never will.” ― Jeffrey D. Sachs

Note: This article is intended for a general audience to try and elucidate the complicated nature of unfairness in machine learning algorithms. As such, I have tried to explain concepts in an accessible way with minimal use of mathematics, in the hope that everyone can get something out of reading this.

Supervised machine learning algorithms are inherently discriminatory. They are discriminatory in the sense that they use information embedded in the features of data to separate instances into distinct categories — indeed, this is their designated purpose in life. This is reflected in the name for these algorithms which are often referred to as discriminative algorithms (splitting data into categories), in contrast to generative algorithms (generating data from a given category). When we use supervised machine learning, this “discrimination” is used as an aid to help us categorize our data into distinct categories within the data distribution, as illustrated below.

Illustration of discriminative vs. generative algorithms. Notice that generative algorithms draw data from a probability distribution constrained to a specific category (for example, the blue distribution), whereas discriminative algorithms aim to discern the optimal boundary between these distributions. Source: Stack Overflow

Whilst this occurs when we apply discriminative algorithms — such as support vector machines, forms of parametric regression (e.g. vanilla linear regression), and non-parametric regression (e.g. random forest, neural networks, boosting) — to any dataset, the outcomes may not necessarily have any moral implications. For example, using last week’s weather data to try and predict the weather tomorrow has no moral valence attached to it. However, when our dataset is based on information that describes people — individuals, either directly or indirectly, this can inadvertently result in discrimination on the basis of group affiliation.

Clearly then, supervised learning is a dual-use technology. It can be used to our benefits, such as for information (e.g. predicting the weather) and protection (e.g. analyzing computer networks to detect attacks and malware). On the other hand, it has the potential to be weaponized to discriminate at essentially any level. This is not to say that the algorithms are evil for doing this, they are merely learning the representations present in the data, which may themselves have embedded within them the manifestations of historical injustices, as well as individual biases and proclivities. A common adage in data science is “garbage in = garbage out” to refer to models being highly dependent on the quality of the data supplied to them. This can be stated analogously in the context of algorithmic fairness as “bias in = bias out”.

If these in-depth educational content is useful for you, you can subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new research updates.

Data Fundamentalism

Some proponents believe in data fundamentalism, that is to say, that the data reflects the objective truth of the world through empirical observations.

“with enough data, the numbers speak for themselves.” — Former Wired editor-in-chief Chris Anderson (a data fundamentalist)

Data and data sets are not objective; they are creations of human design. We give numbers their voice, draw inferences from them, and define their meaning through our interpretations. Hidden biases in both the collection and analysis stages present considerable risks, and are as important to the big-data equation as the numbers themselves. — Kate Crawford, principal researcher at Microsoft Research Social Media Collective

Superficially, this seems like a reasonable hypothesis, but Kate Crawford provides a good counterargument in a Harvard Business Review article:

Boston has a problem with potholes, patching approximately 20,000 every year. To help allocate its resources efficiently, the City of Boston released the excellent StreetBump smartphone app, which draws on accelerometer and GPS data to help passively detect potholes, instantly reporting them to the city. While certainly a clever approach, StreetBump has a signal problem. People in lower income groups in the US are less likely to have smartphones, and this is particularly true of older residents, where smartphone penetration can be as low as 16%. For cities like Boston, this means that smartphone data sets are missing inputs from significant parts of the population — often those who have the fewest resources. — Kate Crawford, principal researcher at Microsoft Research

Essentially, the StreetBump app picked up a preponderance of data from wealthy neighborhoods and relatively little from poorer neighborhoods. Naturally, the first conclusion you might draw from this is that the wealthier neighborhoods had more potholes, but in reality, there was just a lack of data from poorer neighborhoods because these people were less likely to have smartphones and thus have downloaded the SmartBump app. Often, it is data that we do not have in our dataset that can have the biggest impact on our results. This example illustrates a subtle form of discrimination on the basis of income. As a result, we should be cautious when drawing conclusions such as these from data that may suffer from a ‘signal problem’. This signal problem is often characterized as sampling bias.

Another notable example is the “Correctional Offender Management Profiling for Alternative Sanctions” algorithm or COMPAS for short. This algorithm is used by a number of states across the United States to predict recidivism — the likelihood that a former criminal will re-offend. Analysis of this algorithm by ProPublica, an investigative journalism organization, sparked controversy when it seemed to suggest that the algorithm was discriminating on the basis of race — a protected class in the United States. To give us a better idea of what is going on, the algorithm used to predict recidivism looks something like this:

Recidivism Risk Score = (age*−w)+(age-at-first-arrest*−w)+(history of violence*w) + (vocation education * w) + (history of noncompliance * w)

It should be clear that race is not one of the variables used as a predictor. However, the data distribution between two given races may be significantly different for some of these variables, such as the ‘history of violence’ and ‘vocation education’ factors, based on historical injustices in the United States as well as demographic, social, and law enforcement statistics (which are often another target for criticism since they often use algorithms to determine which neighborhoods to patrol). The mismatch between these data distributions can be leveraged by an algorithm, leading to disparities between races and thus to some extent a result that is moderately biased towards or against certain races. These entrenched biases will then be operationalized by the algorithm and continue to persist as a result, leading to further injustices. This loop is essentially a self-fulfilling prophecy.

Historical Injustices → Training Data → Algorithmic Bias in Production

This leads to some difficult questions — do we remove these problematic variables? How do we determine whether a feature will lead to discriminatory results? Do we need to engineer a metric that provides a threshold for ‘discrimination’? One could take this to the extreme and remove almost all variables, but then the algorithm would be of no use. This paints a bleak picture, but fortunately, there are ways to tackle these issues that will be discussed later in this article.

These examples are not isolated incidents. Even breast cancer prediction algorithms show a level of unfair discrimination. Deep learning algorithms to predict breast cancer from mammograms are much less accurate for black women than white women. This is partly because the dataset used to train these algorithms is predominantly based on mammograms of white women, but also because the data distribution for breast cancer between black women and white women likely has substantial differences. According to the Center for Disease Control (CDC) “Black women and white women get breast cancer at about the same rate, but black women die from breast cancer at a higher rate than white women”.

Motives

These issues raise questions about the motives of algorithmic developers — did the individuals that designed these models do so knowingly? Do they have an agenda they are trying to push and trying to hide it inside gray box machine learning models?

Although these questions are impossible to answer with certainty, it is useful to consider Hanlon’s razor when asking such questions:

Never attribute to malice that which is adequately explained by stupidity — Robert J. Hanlon

In other words, there are not that many evil people in the world (thankfully), and there are certainly less evil people in the world than there are incompetent people. On average, we should assume that when things go wrong it is more likely attributable to incompetence, naivety, or oversight than to outright malice. Whilst there are likely some malicious actors who would like to push discriminative agendas, these are likely a minority.

Based on this assumption, what could have gone wrong? One could argue that statisticians, machine learning practitioners, data scientists, and computer scientists are not adequately taught how to develop supervised learning algorithms that control and correct for prejudicial proclivities.

Why is this the case?

In truth, techniques that achieve this do not exist. Machine learning fairness is a young subfield of machine learning that has been growing in popularity over the last few years in response to the rapid integration of machine learning into social realms. Computer scientists, unlike doctors, are not necessarily trained to consider the ethical implications of their actions. It is only relatively recently (one could argue since the advent of social media) that the designs or inventions of computer scientists were able to take on an ethical dimension.

This is demonstrated in the fact that most computer science journals do not require ethical statements or considerations for submitted manuscripts. If you take an image database full of millions of images of real people, this can without a doubt have ethical implications. By virtue of physical distance and the size of the dataset, computer scientists are so far removed from the data subjects that the implications on any one individual may be perceived as negligible and thus disregarded. In contrast, if a sociologist or psychologist performs a test on a small group of individuals, an entire ethical review board is set up to review and approve the experiment to ensure it does not transgress across any ethical boundaries.

On the bright side, this is slowly beginning to change. More data science and computer science programs are starting to require students to take classes on data ethics and critical thinking, and journals are beginning to recognize that ethical reviews through IRBs and ethical statements in manuscripts may be a necessary addition to the peer-review process. The rising interest in the topic of machine learning fairness is only strengthening this position.

Fairness in Machine Learning

Machine learning fairness has become a hot topic in the past few years. Image Source: CS 294: Fairness in Machine Learning course taught at UC Berkley.

As mentioned previously, widespread adoption of supervised machine learning algorithms has raised concerns about algorithmic fairness. The more these algorithms are adopted, and the increasing control they have on our lives will only exacerbate these concerns. The machine learning community is well aware of these challenges and algorithmic fairness is now a rapidly developing subfield of machine learning with many excellent researchers such as Moritz Hardt, Cynthia Dwork, Solon Barocas, and Michael Feldman.

That being said, there are still major hurdles to overcome before we can achieve truly fair algorithms. It is fairly easy to prevent disparate treatment in algorithms — the explicit differential treatment of one group over another, such as by removing variables that correspond to these attributes from the dataset (e.g. race, gender). However, it is much less easy to prevent disparate impact —implicit differential treatment of one group over another, usually caused by something called redundant encodings in the data.

Illustration of disparate impact — in this diagram the data distribution of two groups is very different, which leads to differences in the output of the algorithm without any explicit association of the groups. Source: KdNuggets

A redundant encoding tells us information about a protected attribute, such as race or gender, based on features present in our dataset that correlate with these attributes. For example, buying certain products online (such as makeup) may be highly correlated with gender, and certain zip codes may have different racial demographics that an algorithm might pick up on.

Although an algorithm is not trying to discriminate along these lines, it is inevitable that data-driven algorithms that supersede human performance on pattern recognition tasks might pick up on these associations embedded within data, however small they may be. Additionally, if these associations were non-informative (i.e. they do not increase the accuracy of the algorithm) then the algorithm would ignore them, meaning that some information is clearly embedded in these protected attributes. This raises many challenges to researchers, such as:

Is there a fundamental tradeoff between fairness and accuracy? Are we able to extract relevant information from protected features without them being used in a discriminatory way?
What is the best statistical measure to embed the notion of ‘fairness’ within algorithms?
How can we ensure that governments and companies produce algorithms that protect individual fairness?
What biases are embedded in our training data and how can we mitigate their influence?

We will touch upon some of these questions in the remainder of the article.

The Problem with Data

In the last section, it was mentioned that redundant encodings can lead to features correlating with protected attributes. As our data set scales in size, the likelihood of the presence of these correlations scales accordingly. In the age of big data, this presents a big problem: the more data we have access to, the more information we have at our disposal to discriminate. This discrimination does not have to be purely race- or gender-based, it could manifest as discrimination against individuals with pink hair, against web developers, against Starbucks coffee drinkers, or a combination of all of these groups. In this section, several biases present in training data and algorithms are presented that complicate the creation of fair algorithms.

The Majority Bias

Algorithms have no affinity to any particular group, however, they do have a proclivity for the majority group due to their statistical basis. As outlined by Professor Moritz Hardt in a Medium article, classifiers generally improve with the number of data points used to train them since the error scales with the inverse square root of the number of samples, as shown below.

The error of a classifier often decreases as the inverse square root of the sample size. Four times as many samples means halving the error rate.

This leads to an unsettling reality that since there will, by definition, always be less data available about minorities, our models will tend to perform worse on those groups than on the majority. This assumption is only true if the majority and minority groups are drawn from separate distributions, if they are drawn from a single distribution then increasing sample size will be equally beneficial to both groups.

An example of this is the breast cancer detection algorithms we discussed previously. For this deep learning model, developed by researchers at MIT, of the 60,000 mammogram images in the dataset used to train the neural network, only 5% were mammograms of black women, who are 43% more likely to die from breast cancer. As a result of this, the algorithm performed more poorly when tested on black women, and minority groups in general. This could partially be accounted for because breast cancer often manifests at an earlier age among women of color, which indicates a disparate impact because the probability distribution of women of color was underrepresented.

This also presents another important question. Is accuracy a suitable proxy for fairness? In the above example, we assumed that a lower classification accuracy on a minority group corresponds to unfairness. However, due to the widely differing definitions and the somewhat ambiguous nature of fairness, it can sometimes be difficult to ensure that the variable we are measuring is a good proxy for fairness. For example, our algorithm may have 50% accuracy for both black and white women, but if there 30% false positives for white women and 30% false negatives for black women, this would also be indicative of disparate impact.

From this example, it seems almost intuitive that this is a form of discrimination since there is differential treatment on the basis of group affiliation. However, there are times when this group affiliation is informative to our prediction. For example, for an e-commerce website trying to decide what content to show its users, having an idea of the individual’s gender, age, or socioeconomic status is incredibly helpful. This implies that if we merely remove protected fields from our data, we will decrease the accuracy (or some other performance metric) of our model. Similarly, if we had sufficient data on both black and white women for the breast cancer model, we could develop an algorithm that used race as one of the inputs. Due to the differences in data distributions between the races, it is likely that the accuracy would have increased for both groups.

Thus, the ideal case would be to have an algorithm that contains these protected features and uses them to make algorithmic generalizations but is constrained by fairness metrics to prevent the algorithm from discriminating.

This is an idea proposed by Moritz Hardt and Eric Price in ‘Equality of Opportunity in Supervised Learning’. This has several advantages over other metrics, such as statistical parity and equalized odds, but we will discuss all three of these methods in the next section.

Definitions of Fairness

In this section we analyze some of the notions of fairness that have been proposed by machine learning fairness researchers. Namely, statistical parity, and then nuances of statistical parity such as equality of opportunity and equalized odds.

Statistical Parity

Statistical parity is the oldest and simplest method of enforcing fairness. It is expanded upon greatly in the arXiv article “Algorithmic decision making and the cost of fairness” The formula for statistical parity is shown below.

The formula for statistical parity. In words, this describes that the outcome y is independent of parameter p — it has no impact on the outcome probability.

For statistical parity, the outcome will be independent of my group affiliation. What does this mean intuitively? It means that the same proportion of each group will be classified as positive or negative. For this reason, we can also describe statistical parity as demographic parity. For all demographic groups subsumed within p, statistical parity will be enforced.

For a dataset that has not had statistical parity applied, we can measure how far our predictions deviate from statistical parity by calculating the statistical parity distance shown below.

The statistical parity distance can be used to quantify the extent to which a prediction deviates from statistical parity.

This distance can provide us with a metric for how fair or unfair a given dataset is based on the group affiliation p.

What are the tradeoffs of using statistical parity?

Statistical parity doesn’t ensure fairness.

As you may have noticed though, statistical parity says nothing about the accuracy of these predictions. One group may be much more likely to be predicted as positive than another, and hence we might obtain large disparities between the false positive and true positive rates for each group. This itself can cause a disparate impact as qualified individuals from one group (p=0) may be missed out in favor of unqualified individuals from another group (p=1). In this sense, statistical parity is more akin to equality of outcome.

The figures below illustrate this nicely. If we have two groups — one with 10 people (group A=1), and one with 5 people (group A=0) — and we determine that 8 people (80%) in group A=1 achieved a score of Y=1, then 4 people (80%) in group A=0 would also have to be given a score of Y=1, regardless of other factors.

Illustration of statistical parity. Source: Duke University Privacy & Fairness in Data Science Lecture Notes

Statistical parity reduces algorithmic accuracy

The second problem with statistical parity is that a protected class may provide some information that would be useful for a prediction, but we are unable to leverage that information because of the strict rule imposed by statistical parity. Gender might be very informative for making predictions about items that people might buy, but if we are prevented from using it, our model becomes weaker and accuracy is impacted. A better method would allow us to account for the differences between these groups without generating disparate impact. Clearly, statistical parity is misaligned with the fundamental goal of accuracy in machine learning — the perfect classifier may not ensure demographic parity.

For these reasons, statistical parity is no longer considered a credible option by several machine learning fairness researchers. However, statistical parity is a simple and useful starting point that other definitions of fairness have built upon.

There are slightly more nuanced versions of statistical parity, such as true positive parity, false positive parity, and positive rate parity.

True Positive Parity (Equality of Opportunity)

This is only possible for binary predictions and performs statistical parity on true positives (the prediction output was 1 and the true output was also 1).

Equality of opportunity is the same as equalized odds, but is focused on the y=1 label.

It ensures that in both groups, of all those who qualified (Y=1), an equal proportion of individuals will be classified as qualified (C=1). This is useful when we are only interested in parity over the positive outcome.

Illustration of true positive parity. Notice that in the first group, all those with Y=1 (blue boxes) were classified as positives (C=1). Similarly, in the second group, all those classified as Y=1 were also classified as positive, but there was an additional false positive. This false positive was not considered in the definition of statistical parity. Source: Duke University Privacy & Fairness in Data Science Lecture Notes

False Positive Parity

This is also only applicable to binary predictions and focuses on false positives (the prediction output was 1 but the true output was 0). This is analogous to the true positive rate but provides parity across false positive results instead.

Positive Rate Parity (Equalized Odds)

This is a combination of statistical parity for true positives and false positives simultaneously and is also know as equalized odds.

Illustration of positive rate parity (equalized odds). Notice that in the first group, all those with Y=1 (blue boxes) were classified as positives (C=1). Similarly, in the second group, all those classified as Y=1 were also classified as positive. Of the population in A=1 that obtained Y=0, one of these was classified as C=1, giving a 50% false positive rate. Similarly, in the second group, two of these individuals are given C=1, corresponding to a 50% false positive rate. Source: Duke University Privacy & Fairness in Data Science Lecture Notes

Notice that for equal opportunity, we relax the condition of equalized odds that odds must be equal in the case that Y=0. Equalized odds and equality of opportunity are also more flexible and able to incorporate some of the information from the protected variable without resulting in disparate impact.

Notice that whilst all of these provide some form of a solution that can be argued to be fair, none of these are particularly satisfying. One reason for this is that there are many conflicting definitions of what fairness entails, and it is difficult to capture these in algorithmic form. These are good starting points but there is still much room for improvement.

Other Methods to Increase Fairness

Statistical parity, equalized odds, and equality of opportunity are all great starting points, but there are other things we can do to ensure that algorithms are not used to unduly discriminate individuals. Two such solutions which have been proposed are human-in-the-loop and algorithmic transparency.

Human-in-the-Loop

This sounds like some kind of rollercoaster ride, but it merely refers to a paradigm whereby a human oversees the algorithmic process. Human-in-the-loop is often implemented in situations that have high risks if the algorithm makes a mistake. For example, missile detection systems that inform the military when a missile is detected allow individuals to review the situation and decide how to respond — the algorithm does not respond without human interaction. Just imagine the catastrophic consequences of running nuclear weapon systems with AI that had permission to fire when they detected a threat — one false positive and the entire world would be doomed.

Another example of this is the COMPAS system for recividism — the system does not categorize you as a recidivist and make a legal judgment. Instead, the judge reviews the COMPAS score and uses this as a factor in their evaluation of the circumstance. This raises new questions such as how humans interact with the algorithmic system. Studies using Amazon Mechanical Turk have shown that some individuals will follow the algorithm’s judgment wholeheartedly, as they perceive it to have greater knowledge than a human is likely to, other individuals take its output with a pinch of salt, and some ignore it completely. Research into human-in-the-loop is relatively novel but we are likely to see more of it as machine learning becomes more pervasive in our society.

Another important and similar concept is human-on-the-loop. This is similar to human-in-the-loop, but instead of the human being actively involved in the process, they are passively involved in the algorithm’s oversight. For example, a data analyst might be in charge of monitoring sections of an oil and gas pipeline to ensure that all of the sensors and processes are running appropriately and there are no concerning signals or errors. This analyst is in an oversight position but is not actively involved in the process. Human-on-the-loop is inherently more scalable than human-in-the-loop since it requires less manpower, but it may be untenable in certain circumstances — such as looking after those nuclear missiles!

Algorithmic Transparency

The dominant position in the legal literature for fairness is through algorithmic interpretability and explainability via transparency. The argument is that if an algorithm is able to be viewed publicly and analyzed with scrutiny, then it can be ensured with a high level of confidence that there is no disparate impact built into the model. Whilst this is clearly desirable on many levels, there are some downsides to algorithmic transparency.

Proprietary algorithms by definition cannot be transparent.

From a commercial standpoint, this idea is untenable in most circumstances — trade secrets or proprietary information may be leaked if algorithms and business processes are provided for all to see. Imagine Facebook or Twitter being asked to release their algorithms to the world so they can be scrutinized to ensure there are no biasing issues. Most likely I could download their code and go and start my own version of Twitter or Facebook pretty easily. Full transparency is only really an option for algorithms used in public services, such as by the government (to some extent), healthcare, the legal system, etc. Since legal scholars are predominantly concerned with the legal system, it makes sense that this remains the consensus at the current time.

In the future, perhaps regulations on algorithmic fairness may be a more tenable solution than algorithmic transparency for private companies that have a vested interest to keep their algorithms from the public eye. Andrew Tutt discusses this idea in his paper “An FDA For Algorithms”, which focused on the development of a regulatory body similar to the FDA to regulate algorithms. Algorithms could be submitted to the regulatory body, or perhaps third party auditing services, and analyzed to ensure they are suitable to be used without resulting in disparate impact.

Clearly, such an idea would require large amounts of discussion, money, and expertise to implement, but this seems like a potentially workable solution from my perspective. There is still a long way to go to ensure our algorithms are free of both disparate treatment and disparate impact. With a combination of regulations, transparency, human-in-the-loop, human-on-the-loop, and new and improved variations of statistical parity, we are part of the way there, but this field is still young and there is much work to be done — watch this space.

Final Comments

In this article, we have discussed at length multiple biases present within training data due to the way in which it is collected and analyzed. We have also discussed several ways in which to mitigate the impact of these biases and to help ensure that algorithms remain non-discriminatory towards minority groups and protected classes.

Although machine learning, by its very nature, is always a form of statistical discrimination, the discrimination becomes objectionable when it places certain privileged groups at a systematic advantage and certain unprivileged groups at a systematic disadvantage. Biases in training data, due to either prejudice in labels or under-/over-sampling, yields models with unwanted bias.

Some might say that these decisions were made on less information and by humans, which can have many implicit and cognitive biases influencing their decision. Automating these decisions provides more accurate results and to a large degree limits the extent of these biases. The algorithms do not need to be perfect, just better than what previously existed. The arc of history curves towards justice.

Some might say that algorithms are being given free rein to allow inequalities to be systematically instantiated, or that data itself is inherently biased. That variables related to protected attributes should be removed from data to help mitigate these issues, and any variable correlated with the variables removed or restricted.

Both groups would be partially correct. However, we should not remain satisfied with unfair algorithms, there is also room for improvement. Similarly, we should not waste all of this data we have and remove all variables, as this would make systems perform much worse and would render them much less useful. That being said, at the end of the day, it is up to the creators of these algorithms and oversight bodies, as well as those in charge of collecting data, to try to ensure that these biases are handled appropriately.

Data collection and sampling procedures are often glazed over in statistics classes, and not understood well by the general public. Until such a time as a regulatory body appears, it is up to machine learning engineers, statisticians, and data scientists to ensure the equality of opportunity is embedded in our machine learning practices. We must be mindful of where our data comes from and what we do with it. Who knows who our decisions might impact in the future?

“The world isn’t fair, Calvin.”
“I know Dad, but why isn’t it ever unfair in my favor?”
― Bill Watterson, The Essential Calvin and Hobbes: A Calvin and Hobbes Treasury

Generative Data Intelligence

How SNCF Réseau and Olexya migrated a Caffe2 vision pipeline to Managed Spot Training in Amazon SageMaker

Data Fundamentalism

Motives

Fairness in Machine Learning