Logistic regression of jury rejections using statsmodels' formula method#
In this notebook we'll be looking for evidence of racial bias in the jury selection process. To this end we'll be working with the statsmodels package, and specifically its R-formula-like smf.logit
method.
Import a lot#
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', '{:.5f}'.format)
%matplotlib inline
Read in the data#
We'll start by reading in the pre-cleaned dataset. We've already joined the potential jurors, the trial information, and the judge information. We've also added the struck_by_state
column and converted true and false values into ones and zeroes.
df = pd.read_csv("data/jury-cleaned.csv")
df.head(2)
Add additional features#
While our dataset is already pretty big, we also want to calculate a few new features to match what APM Reports has in their methodology document. For simplicity's sake, we're only calculating the ones that appear in the final regression.
df['is_black'] = df.race == 'Black'
df['race_unknown'] = df.race == 'Unknown'
df['same_race'] = df.race == df.defendant_race
df.head(2)
Since they're all trues and falses, we'll need to take a second to convert them to ones and zeroes so that our regression will work.
df = df.replace({
True: 1,
False: 0
})
Performing our regression#
We're going to perform the simple regression from the end of their methodology. Not too many columns at all!
model = smf.logit(formula="""
struck_by_state ~
same_race + accused + fam_accused + fam_law_enforcement
+ know_def + death_hesitation
""", data=df)
results = model.fit()
results.summary()
The irritating thing about this, though, is we had to make new columns. Making columns is a pain, in that it takes time and effort and there's always the potential to screw things up.
An alternative technique#
When you're putting together your formula, you can actually do more than just add together columns! Instead, you can actually make the comparisons that say, "is this person's race black?" or "are they the same race as the defendant?"
model = smf.logit(formula="""
struck_by_state ~
(df.race == 'Black')
+ (df.defendant_race == df.race)
+ accused
+ fam_accused + fam_law_enforcement + know_def
+ death_hesitation
""", data=df)
results = model.fit()
results.summary()
So exciting!!! We didn't need to make any columns at all!
One downside of this method is that statsmodels can pick either True
or False
as what's shown in the coefficients list. The above [T.True]
means the coefficient is for when they are black, but you could easily show up in a position where it's [T.False]
, meaning "this is the coefficient for when they are not black."
If you need to force statsmodels to use one or the other, you just need to explain which one you want as the reference category. You do this by shaking your comparison to look like this:
C(df.race == 'Black', Treatment(False))
model = smf.logit(formula="""
struck_by_state ~
C(df.race == 'Black', Treatment(False))
+ (df.defendant_race == df.race)
+ accused
+ fam_accused + fam_law_enforcement + know_def
+ death_hesitation
""", data=df)
results = model.fit()
results.summary()
This looks the same as before, so not very exciting. While it doesn't make much sense, we can change the reference category to be True
, so our result will show us what happens when race is not black.
model = smf.logit(formula="""
struck_by_state ~
C(df.race == 'Black', Treatment(True))
+ (df.defendant_race == df.race)
+ accused
+ fam_accused + fam_law_enforcement + know_def
+ death_hesitation
""", data=df)
results = model.fit()
results.summary()
# Calcualte the odds ratio without making a big dataframe...
np.exp(-1.8972)
Which means non-black jurors have a 0.15x chance of getting rejected. Not as pleasant, is it? Just pay attention to your reference categories.
Another alternative#
Up above we're only checking to see if they're black or not. But what if there were multiple races, and we wanted to look at each one of them individually?
- If you know what I'm talking about: you could do a lot of fancy one-hot encoding and blah blah blah pandas/sklearn magic.
- If you don't know what I'm talking about: that sounds overly complex, doesn't it?
Watch this.
model = smf.logit(formula="""
struck_by_state ~
C(df.race, Treatment('White'))
+ (df.defendant_race == df.race)
+ accused
+ fam_accused + fam_law_enforcement + know_def
+ death_hesitation
""", data=df)
results = model.fit()
results.summary()
We changed the is_black
variable into something slightly more complicated
C(df.race, Treatment('White'))
This tells statsmodels to look at all of the options in the race
column, and calculate all of the coefficients in relation to the White
value. While before we just knew if someone is black or not, now we have more options!
C(df.race, Treatment('White'))[T.Black]
is when a juror is blackC(df.race, Treatment('White'))[T.Unknown]
is when a juror's race is unknown
Well, not a lot - the only options are "Black," "White," and "Unknown," but you get the idea. The Treatment('White')
part lets you know that this is all in comparison to jurors listed as white.
Now if we take the coefficient for black jurors - 1.9027 - and turn it into an odds ratio - 6.7 - we need to remember this is all in reference to white jurors. When we did it before the comparison was "black vs non-black," but now our comparison is "black vs. white" and "unknown race vs white."
Taking advantage of this feature saves you a lot of time when you're trying to pick apart complicated categorical columns.
For example, we could look at the judges! In the methodology from APM Reports, they have a couple different columns:
trial__judge_Loper
: Judge for trial was Joseph Loper, reference category: Judge Morgantrial__judge_OTHER
: The judge was neither Loper nor Morgan
We can do the same thing, but instead of creating multiple new categories we can just use C()
and Treatment()
# Find the actual names of the judges
df.judge.value_counts()
# Run the regression
model = smf.logit(formula="""
struck_by_state ~
C(df.race, Treatment('White'))
+ C(df.judge, Treatment('C. Morgan, III'))
+ accused
+ fam_accused + fam_law_enforcement + know_def
+ death_hesitation
""", data=df)
results = model.fit()
results.summary()
And there you go! Formulas make it all so easy.
Note: Remember that the coefficient isn't the odds ratio! We need to do an extra step to get that.
coefs = pd.DataFrame({
'coef': results.params.values,
'odds ratio': np.exp(results.params.values),
'pvalue': results.pvalues,
'column': results.params.index
}).sort_values(by='odds ratio', ascending=False)
coefs
Review#
We looked at the way statsmodels formulas work, allowing you to make comparisons and automatically split categories into separate features. Categories get assigned a reference, which is what your odds ratio will be compared with.
For example:
formula | meaning |
---|---|
C(df.race, Treatment('White')) |
Comparing black vs. white |
df.race == 'Black' |
Comparing black vs. non-black |
is_black |
Same as above, just more typing to make the column! |
Discussion topics#
- What are the pluses and minuses of using
C()
compared to building new columns? - How do you pick the reference category?
- The p-value for unknown race is uselessly high compared to the p-value for black jurors. What do you think you should do about it, if anything?
- Are you heartbroken that you learned some tricks from the p-value filtering notebook, but if you end up using these techniques those tricks totally won't work? Because I am.