Breaking down Machine Bias#
This notebook was created by Jonathan Stray for the Algorithms course in 2017's summer Lede Program. The repository for the course is located here
This notebook explores the classic ProPublica story Machine Bias. It uses the original data that the reporters collected for the story, through FOIA requests to Broward County, Florida.
The COMPAS score uses answers to 137 questions to assign a risk score to defendants -- essentially a probability of re-arrest. The actual output is two-fold: a risk rating of 1-10 and a "low", "medium", or "high" risk label
This analysis is based on ProPublica's original notebook
There has been a lot of discussion about this story and its particular definition of fairness. The best overall reference is the Fairness in Machine Learning NIPS 2017 Tutorial by Solon Barocas and Moritz Hardt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from pandas.plotting import scatter_matrix
from sklearn import metrics
%matplotlib inline
This notebook is designed to let you select between data on arrests for non-violent or violent crimes. This allows quick comparisons of the difference between these two data sets.
There is some reason to suspect that arrest data for violent crime is both more accurate and less biased than non-violent crime data. See e.g. Skeem and Lowenkamp. Also, we do get more accurate predictors with the violent data (including COMPAS).
violent = False
if violent:
fname ='data/compas-scores-two-years-violent.csv'
decile_col = 'v_decile_score'
score_col = 'v_score_text'
else:
fname ='data/compas-scores-two-years.csv'
decile_col = 'decile_score'
score_col = 'score_text'
cv = pd.read_csv(fname)
cv.head()
cv.columns
Following ProPublica, we filter out certain rows which are missing data. As they put it:
- If the charge date of a defendants Compas scored crime was not within 30 days from when the person was arrested, we assume that because of data quality reasons, that we do not have the right offense.
- We coded the recidivist flag -- is_recid -- to be -1 if we could not find a compas case at all.
- In a similar vein, ordinary traffic offenses -- those with a c_charge_degree of 'O' -- will not result in Jail time are removed
- We filtered the underlying data from Broward county to include only those rows representing people who had either recidivated in two years, or had at least two years outside of a correctional facility.
cv = cv[
(cv.days_b_screening_arrest <= 30) &
(cv.days_b_screening_arrest >= -30) &
(cv.is_recid != -1) &
(cv.c_charge_degree != 'O') &
(cv[score_col] != 'N/A')
]
cv.reset_index(inplace=True, drop=True) # renumber the rows from 0 again
cv.shape
1. A first look at the data#
Let's do some basic analysis on the demographics
# age value coutns
cv.age_cat.value_counts()
# race value counts
cv.race.value_counts()
The COMPAS model predictions are in v_decile_score
from 1 to 10, and low/med/high in v_score_text
# COMPAS decile score value counts
cv[decile_col].value_counts()
# COMPAS text score value counts
cv[score_col].value_counts()
We can look at the decile scores white and black to get our first look at how the COMPAS algorithm handles different races.
# Histogram of decile scores for White
cv[cv.race == 'Caucasian'][decile_col].plot(kind='hist', title='White Defendant\'s Decile Scores ')
# Histogram of decile scores for Black
cv[cv.race == 'African-American'][decile_col].plot(kind='hist', title='Black Defendant\'s Decile Scores')
Meanwhile the two_year_recid
field records whether or not each person was re-arrested for a violent offense within two years, which is what COMPAS is trying to predict.
# recidivism value counts
cv.two_year_recid.value_counts()
Now we can start looking at the relationships between these variables. First, recidivism by race.
# recidivism rates by race
recid_race = pd.crosstab(cv.race, cv.two_year_recid)
recid_race['rate'] = recid_race[1] / recid_race.sum(axis=1)
recid_race
Similarly for sex:
# recidivism rates by sex
recid_sex = pd.crosstab(cv.sex, cv.two_year_recid)
recid_sex['rate'] = recid_sex[1] / recid_sex.sum(axis=1)
recid_sex
There are significant differences in recidivism in this population by race and gender. These are the "base rates" we will talk about more. However, there may also be significant differences in the composition of these populations -- they may have different age, criminal histories, etc.
Let's see how the COMPAS risk scores break down by race and gender.
# high risk rates by race
score_race = pd.crosstab(cv.race, cv[score_col])
score_race['High risk rate'] = score_race['High'] / score_race.sum(axis=1)
score_race
# high risk rates by sex
score_sex = pd.crosstab(cv.sex, cv[score_col])
score_sex['High risk rate'] = score_sex['High'] / score_sex.sum(axis=1)
score_sex
Generally, the fraction of people assigned a high
risk is greater where the recidivism rates are also higher.
2. Predictive calibration and accuracy#
Being "accurate" in a predictive sense is only one type of "fairness," as we shall see, but it's still a desirable characteristic.
Let's start by looking at the proportion of people who are re-arrested in each decile score.
# probability of recidivism by decile
cv.groupby(decile_col).mean()['two_year_recid'].plot(kind='bar')
# probability of recidivism by decile and race
b = cv[cv.race=='African-American'].groupby([decile_col]).mean()['two_year_recid']
w = cv[cv.race=='Caucasian'].groupby([decile_col]).mean()['two_year_recid']
a = pd.concat([w,b], axis=1)
a.columns = ['White','Black']
a.plot.bar()
The outcome variable two_year_recid
is the actually observed results in the world, and it is binary -- was this person re-arrested within two years of their initial arrest and risk score assignment? To work with this data further we're going to simplify the COMPAS classifier scores it by thresholding them into a binary variable as well. ProPublica splits "low" from "medium or high" risk, according to their methodology.
Using this binary prediction variable lets us compute a confusion matrix for the COMPAS algorithm.
# COMPAS recidivism confusion matrix
cv['guessed_recid'] = cv[score_col] != 'Low'
cv['actual_recid'] = cv.two_year_recid == 1
cm = pd.crosstab(cv.actual_recid, cv.guessed_recid)
cm # for "confusion matrix"
All of the information about binary classifier performance and error (for a particular group) is in a 2x2 confusion matrix (also called a contingency table.) But we're usually interested in rates as opposed to raw numbers, so we're going to convert this table into the following values:
- Accuracy: the fraction of guesses that were correct
- Precision or Positive Predictive Value: of the people we guessed would recidivate, what fraction did?
- False Positive Rate: of the people who didn't recidivate, how many did we guess would?
- False Negative Rate: of the people who did recidivate, how many did we guess would not?
There's a wonderful little diagram on the quantitative definitions of fairness page that shows how all of these relate, and Wikipedia is also a good reference.
# The usual definitions. First index is predicted, second is actual
TN = cm[False][False]
TP = cm[True][True]
FN = cm[False][True]
FP = cm[True][False]
About 63% of those scored as medium or high risk end up getting arrested again within two years. This is the Positive Predictive Value (PPV) or Precision.
# PPV
TP / (TP + FP)
Of those who did not go on to be re-arrested, about 30% were classified as medium or high risk. This is the False Positive Rate (FPR).
# FPR
FP / (FP + TN)
It may help to understand many of these formulas if we define variables for the total number of true positive and negative cases:
P = TP + FN
N = TN + FP
# Equivalent definition of FPR that might be easier to understand, N in denominator
FP / N
We can also calculate the False Negative Rate (FNR) which counts those who were classified as low risk, as a fraction of those who were re-arrested.
# FNR
FN / (FN + TP)
# Alternate form with P in denominator
FN / P
To study the difference between races, let's define a few helper functions.
# cm is a confusion matrix. The rows are guessed, the columns are actual
def print_ppv_fpv(cm):
# the indices here are [col][row] or [actual][guessed]
TN = cm[False][False]
TP = cm[True][True]
FN = cm[True][False]
FP = cm[False][True]
print('Accuracy: ', (TN+TP)/(TN+TP+FN+FP))
print('PPV: ', TP / (TP + FP))
print('FPR: ', FP / (FP + TN))
print('FNR: ', FN / (FN + TP))
print()
def print_metrics(guessed, actual):
cm = pd.crosstab(guessed, actual, rownames=['guessed'], colnames=['actual'])
print(cm)
print()
print_ppv_fpv(cm)
print('White')
subset = cv[cv.race == 'Caucasian']
print_metrics(subset.guessed_recid, subset.actual_recid)
print('Black')
subset = cv[cv.race == 'African-American']
print_metrics(subset.guessed_recid, subset.actual_recid)
And here is the statistical core of ProPublica's story: the False Positive Rate is substantially higher for black defendants.
However, also note that the PPV is similar between black and white. In fact the lower PPV for white means the score is has greater predictive accuracy for black defendants. Here "accurate" measures the proportion of people that actually were re-arrested, as a proportion of the people that COMPAS guessed would be.
3. Logistic regression to build our own predictor#
We are going to use logistic regression to try to build our own predictor, just from the information we we have. This is actually quite a lot:
- Age
- Sex
- Felony or Misdemeanor charge (
c_charge_degree
) - Number of prior arrests (
c_priors_count
)
And we'll try this both with and without race as a predictive factor, too.
# build up dummy variables for age, race, gender
features = pd.concat(
[pd.get_dummies(cv.age_cat, prefix='age'),
pd.get_dummies(cv.sex, prefix='sex'),
pd.get_dummies(cv.c_charge_degree, prefix='degree'), # felony or misdemeanor charge ('f' or 'm')
cv.priors_count],
axis=1)
# We should have one less dummy variable than the number of categories, to avoid the "dummy variable trap"
# See https://www.quora.com/When-do-I-fall-in-the-dummy-variable-trap
features.drop(['age_25 - 45', 'sex_Female', 'degree_M'], axis=1, inplace=True)
# Try to predict whether someone is re-arrested
target = cv.two_year_recid
x = features.values
y = target.values
lr = LogisticRegression()
lr.fit(x,y)
This is a logistic regression, so the coefficients are odds ratios (after undoing the logarithm.) Let's look at them to see what weights it used to make its predictions.
# Examine regression coefficients
coeffs = pd.DataFrame(np.exp(lr.coef_), columns=features.columns)
coeffs
The model thinks that (for the non-violent data set):
- being young (<25) more than doubles your odds of recidivism
- but being >45 years old makes half as likely
- being male increases your odds by 40%
- every prior arrest increases your odds by 18%
Now let's put our model through the same tests as we used on the COMPAS score to see how well this predictor does.
# Crosstab for our predictive model
y_pred = lr.predict(x)
guessed=pd.Series(y_pred)==1
actual=cv.two_year_recid==1
cm = pd.crosstab(guessed, actual, rownames=['guessed'], colnames=['actual'])
cm
print_ppv_fpv(cm)
Once again, we can compare between White and Black.
print('White')
subset = cv.race == 'Caucasian'
print_metrics(guessed[subset], actual[subset])
print('Black')
subset = cv.race == 'African-American'
print_metrics(guessed[subset], actual[subset])
4. The limits of prediction#
Both COMPAS and our logistic regression classifier only get about 65% accuracy overall. Would it be possible to do better with are more sophisticated classifier or feature encoding? We can take a look at two variables at a time to try to see what's happening here.
# Scatterpolot of age vs. priors, colored by two_year_recid
colors = cv.two_year_recid.apply(lambda x: 'red' if x else 'blue')
plt.scatter(cv.age, cv.priors_count, c=colors, alpha=0.05)
# add a noise to the values in the array
def jitter(arr):
# pick a standard deviation for the jitter of 3% of the data range
stdev = .02*(max(arr)-min(arr))
return arr + np.random.randn(len(arr)) * stdev
# Scatterpolot of age vs. sex, colored by two_year_recid
plt.scatter(cv.age, jitter(features.sex_Male), c=colors, alpha=0.05)
There is no way to draw a line (even a curved line) that cleanly separates the red (recidivated) and blue (did not recidivate) dots. We can do a little better by looking at more than two axes at a time, and might be able to imagine fitting a curved plane, but it's still not possible to separate red and blue enough to give us a very accurate classifier.