Logistic regression of jury rejections using statsmodels' formula method#

In this notebook we'll be looking for evidence of racial bias in the jury selection process. To this end we'll be working with the statsmodels package, and specifically its R-formula-like smf.logit method.

Import a lot#

import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', '{:.5f}'.format)

%matplotlib inline

Read in the data#

We'll start by reading in the pre-cleaned dataset. We've already joined the potential jurors, the trial information, and the judge information. We've also added the struck_by_state column and converted true and false values into ones and zeroes.

df = pd.read_csv("data/jury-cleaned.csv")
df.head(2)
id_x juror_id juror_id__trial__id no_responses married children religious education leans_state leans_defense leans_ambi moral_hardship job_hardship caretaker communication medical employed social prior_jury crime_victim fam_crime_victim accused fam_accused eyewitness fam_eyewitness military law_enforcement fam_law_enforcement premature_verdict premature_guilt premature_innocence def_race vic_race def_gender vic_gender def_social vic_social def_age vic_age def_sexpref vic_sexpref def_incarcerated vic_incarcerated beliefs other_biases innocence take_stand arrest_is_guilt cant_decide cant_affirm cant_decide_evidence cant_follow know_def know_vic know_wit know_attny civil_plantiff civil_def civil_witness witness_defense witness_state prior_info death_hesitation no_death no_life no_cops yes_cops legally_disqualified witness_ambi notes id_y trial trial__id race gender race_source gender_source struck_by strike_eligibility id defendant_name cause_number state_strikes defense_strikes county defendant_race second_defendant_race third_defendant_race fourth_defendant_race more_than_four_defendants judge prosecutor_1 prosecutor_2 prosecutor_3 prosecutors_more_than_three def_attny_1 def_attny_2 def_attny_3 def_attnys_more_than_three offense_code_1 offense_title_1 offense_code_2 offense_title_2 offense_code_3 offense_title_3 offense_code_4 offense_title_4 offense_code_5 offense_title_5 offense_code_6 offense_title_6 more_than_six verdict case_appealed batson_claim_by_defense batson_claim_by_state voir_dire_present struck_by_state
0 1521 107.00000 3.00000 0 unknown unknown unknown unknown 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN 107 2004-0257--Sparky Watson 3 White Male Jury strike sheet Jury strike sheet Struck by the defense Both State and Defense 3 Sparky Watson 2004-0257 1 1 Grenada Black NaN NaN nan 0 C. Morgan, III Susan Denley Ryan Berry NaN 0 M. Kevin Horan Elizabeth Davis NaN 0 41-29-139(a)(1)(b)(3) sale of marihuana (less than 30 grams) 41-29-139(a)(1)(b)(1) sale of cocaine NaN NaN NaN NaN NaN NaN NaN NaN 0 Guilty on at least one offense 1 0 0 1 0
1 1524 108.00000 3.00000 0 unknown unknown unknown unknown 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN 108 2004-0257--Sparky Watson 3 Black Female Jury strike sheet Jury strike sheet Struck by the state State 3 Sparky Watson 2004-0257 1 1 Grenada Black NaN NaN nan 0 C. Morgan, III Susan Denley Ryan Berry NaN 0 M. Kevin Horan Elizabeth Davis NaN 0 41-29-139(a)(1)(b)(3) sale of marihuana (less than 30 grams) 41-29-139(a)(1)(b)(1) sale of cocaine NaN NaN NaN NaN NaN NaN NaN NaN 0 Guilty on at least one offense 1 0 0 1 1

Add additional features#

While our dataset is already pretty big, we also want to calculate a few new features to match what APM Reports has in their methodology document. For simplicity's sake, we're only calculating the ones that appear in the final regression.

df['is_black'] = df.race == 'Black'
df['race_unknown'] = df.race == 'Unknown'
df['same_race'] = df.race == df.defendant_race
df.head(2)
id_x juror_id juror_id__trial__id no_responses married children religious education leans_state leans_defense leans_ambi moral_hardship job_hardship caretaker communication medical employed social prior_jury crime_victim fam_crime_victim accused fam_accused eyewitness fam_eyewitness military law_enforcement fam_law_enforcement premature_verdict premature_guilt premature_innocence def_race vic_race def_gender vic_gender def_social vic_social def_age vic_age def_sexpref vic_sexpref def_incarcerated vic_incarcerated beliefs other_biases innocence take_stand arrest_is_guilt cant_decide cant_affirm cant_decide_evidence cant_follow know_def know_vic know_wit know_attny civil_plantiff civil_def civil_witness witness_defense witness_state prior_info death_hesitation no_death no_life no_cops yes_cops legally_disqualified witness_ambi notes id_y trial trial__id race gender race_source gender_source struck_by strike_eligibility id defendant_name cause_number state_strikes defense_strikes county defendant_race second_defendant_race third_defendant_race fourth_defendant_race more_than_four_defendants judge prosecutor_1 prosecutor_2 prosecutor_3 prosecutors_more_than_three def_attny_1 def_attny_2 def_attny_3 def_attnys_more_than_three offense_code_1 offense_title_1 offense_code_2 offense_title_2 offense_code_3 offense_title_3 offense_code_4 offense_title_4 offense_code_5 offense_title_5 offense_code_6 offense_title_6 more_than_six verdict case_appealed batson_claim_by_defense batson_claim_by_state voir_dire_present struck_by_state is_black race_unknown same_race juror_id__gender_m juror_id__gender_unknown trial__defendant_race_asian trial__defendant_race_black trial__defendant_race_unknown trial__judge_Loper trial__judge_OTHER
0 1521 107.00000 3.00000 0 unknown unknown unknown unknown 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN 107 2004-0257--Sparky Watson 3 White Male Jury strike sheet Jury strike sheet Struck by the defense Both State and Defense 3 Sparky Watson 2004-0257 1 1 Grenada Black NaN NaN nan 0 C. Morgan, III Susan Denley Ryan Berry NaN 0 M. Kevin Horan Elizabeth Davis NaN 0 41-29-139(a)(1)(b)(3) sale of marihuana (less than 30 grams) 41-29-139(a)(1)(b)(1) sale of cocaine NaN NaN NaN NaN NaN NaN NaN NaN 0 Guilty on at least one offense 1 0 0 1 0 False False False 1 0 0 1 0 0 0
1 1524 108.00000 3.00000 0 unknown unknown unknown unknown 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN 108 2004-0257--Sparky Watson 3 Black Female Jury strike sheet Jury strike sheet Struck by the state State 3 Sparky Watson 2004-0257 1 1 Grenada Black NaN NaN nan 0 C. Morgan, III Susan Denley Ryan Berry NaN 0 M. Kevin Horan Elizabeth Davis NaN 0 41-29-139(a)(1)(b)(3) sale of marihuana (less than 30 grams) 41-29-139(a)(1)(b)(1) sale of cocaine NaN NaN NaN NaN NaN NaN NaN NaN 0 Guilty on at least one offense 1 0 0 1 1 True False True 0 0 0 1 0 0 0

Since they're all trues and falses, we'll need to take a second to convert them to ones and zeroes so that our regression will work.

df = df.replace({
    True: 1,
    False: 0
})

Performing our regression#

We're going to perform the simple regression from the end of their methodology. Not too many columns at all!

model = smf.logit(formula="""
    struck_by_state ~ 
        same_race + accused + fam_accused + fam_law_enforcement
        + know_def + death_hesitation
""", data=df)

results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.453673
         Iterations 6
Logit Regression Results
Dep. Variable: struck_by_state No. Observations: 2295
Model: Logit Df Residuals: 2288
Method: MLE Df Model: 6
Date: Mon, 04 Nov 2019 Pseudo R-squ.: 0.1927
Time: 15:19:04 Log-Likelihood: -1041.2
converged: True LL-Null: -1289.7
Covariance Type: nonrobust LLR p-value: 3.524e-104
coef std err z P>|z| [0.025 0.975]
Intercept -2.0663 0.090 -23.032 0.000 -2.242 -1.890
same_race 1.3847 0.111 12.490 0.000 1.167 1.602
accused 2.7632 0.522 5.298 0.000 1.741 3.785
fam_accused 1.7841 0.150 11.866 0.000 1.489 2.079
fam_law_enforcement -0.6989 0.156 -4.494 0.000 -1.004 -0.394
know_def 1.3989 0.207 6.766 0.000 0.994 1.804
death_hesitation 1.8131 0.550 3.295 0.001 0.734 2.892

The irritating thing about this, though, is we had to make new columns. Making columns is a pain, in that it takes time and effort and there's always the potential to screw things up.

An alternative technique#

When you're putting together your formula, you can actually do more than just add together columns! Instead, you can actually make the comparisons that say, "is this person's race black?" or "are they the same race as the defendant?"

model = smf.logit(formula="""
    struck_by_state ~ 
        (df.race == 'Black')
        + (df.defendant_race == df.race)
        + accused
        + fam_accused + fam_law_enforcement + know_def
        + death_hesitation
""", data=df)

results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.411232
         Iterations 6
Logit Regression Results
Dep. Variable: struck_by_state No. Observations: 2295
Model: Logit Df Residuals: 2287
Method: MLE Df Model: 7
Date: Mon, 04 Nov 2019 Pseudo R-squ.: 0.2682
Time: 15:30:38 Log-Likelihood: -943.78
converged: True LL-Null: -1289.7
Covariance Type: nonrobust LLR p-value: 3.815e-145
coef std err z P>|z| [0.025 0.975]
Intercept -2.4307 0.101 -24.017 0.000 -2.629 -2.232
df.race == 'Black'[T.True] 1.8972 0.141 13.443 0.000 1.621 2.174
df.defendant_race == df.race[T.True] 0.3603 0.140 2.575 0.010 0.086 0.635
accused 2.5128 0.545 4.606 0.000 1.444 3.582
fam_accused 1.8476 0.162 11.402 0.000 1.530 2.165
fam_law_enforcement -0.5627 0.162 -3.468 0.001 -0.881 -0.245
know_def 1.3257 0.223 5.937 0.000 0.888 1.763
death_hesitation 1.8243 0.592 3.084 0.002 0.665 2.984

So exciting!!! We didn't need to make any columns at all!

One downside of this method is that statsmodels can pick either True or False as what's shown in the coefficients list. The above [T.True] means the coefficient is for when they are black, but you could easily show up in a position where it's [T.False], meaning "this is the coefficient for when they are not black."

If you need to force statsmodels to use one or the other, you just need to explain which one you want as the reference category. You do this by shaking your comparison to look like this:

C(df.race == 'Black', Treatment(False))
model = smf.logit(formula="""
    struck_by_state ~ 
        C(df.race == 'Black', Treatment(False))
        + (df.defendant_race == df.race)
        + accused
        + fam_accused + fam_law_enforcement + know_def
        + death_hesitation
""", data=df)

results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.411232
         Iterations 6
Logit Regression Results
Dep. Variable: struck_by_state No. Observations: 2295
Model: Logit Df Residuals: 2287
Method: MLE Df Model: 7
Date: Mon, 04 Nov 2019 Pseudo R-squ.: 0.2682
Time: 15:35:15 Log-Likelihood: -943.78
converged: True LL-Null: -1289.7
Covariance Type: nonrobust LLR p-value: 3.815e-145
coef std err z P>|z| [0.025 0.975]
Intercept -2.4307 0.101 -24.017 0.000 -2.629 -2.232
C(df.race == 'Black', Treatment(False))[T.True] 1.8972 0.141 13.443 0.000 1.621 2.174
df.defendant_race == df.race[T.True] 0.3603 0.140 2.575 0.010 0.086 0.635
accused 2.5128 0.545 4.606 0.000 1.444 3.582
fam_accused 1.8476 0.162 11.402 0.000 1.530 2.165
fam_law_enforcement -0.5627 0.162 -3.468 0.001 -0.881 -0.245
know_def 1.3257 0.223 5.937 0.000 0.888 1.763
death_hesitation 1.8243 0.592 3.084 0.002 0.665 2.984

This looks the same as before, so not very exciting. While it doesn't make much sense, we can change the reference category to be True, so our result will show us what happens when race is not black.

model = smf.logit(formula="""
    struck_by_state ~ 
        C(df.race == 'Black', Treatment(True))
        + (df.defendant_race == df.race)
        + accused
        + fam_accused + fam_law_enforcement + know_def
        + death_hesitation
""", data=df)

results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.411232
         Iterations 6
Logit Regression Results
Dep. Variable: struck_by_state No. Observations: 2295
Model: Logit Df Residuals: 2287
Method: MLE Df Model: 7
Date: Mon, 04 Nov 2019 Pseudo R-squ.: 0.2682
Time: 15:35:59 Log-Likelihood: -943.78
converged: True LL-Null: -1289.7
Covariance Type: nonrobust LLR p-value: 3.815e-145
coef std err z P>|z| [0.025 0.975]
Intercept -0.5335 0.137 -3.897 0.000 -0.802 -0.265
C(df.race == 'Black', Treatment(True))[T.False] -1.8972 0.141 -13.443 0.000 -2.174 -1.621
df.defendant_race == df.race[T.True] 0.3603 0.140 2.575 0.010 0.086 0.635
accused 2.5128 0.545 4.606 0.000 1.444 3.582
fam_accused 1.8476 0.162 11.402 0.000 1.530 2.165
fam_law_enforcement -0.5627 0.162 -3.468 0.001 -0.881 -0.245
know_def 1.3257 0.223 5.937 0.000 0.888 1.763
death_hesitation 1.8243 0.592 3.084 0.002 0.665 2.984
# Calcualte the odds ratio without making a big dataframe... 
np.exp(-1.8972)
0.14998799821305078

Which means non-black jurors have a 0.15x chance of getting rejected. Not as pleasant, is it? Just pay attention to your reference categories.

Another alternative#

Up above we're only checking to see if they're black or not. But what if there were multiple races, and we wanted to look at each one of them individually?

  • If you know what I'm talking about: you could do a lot of fancy one-hot encoding and blah blah blah pandas/sklearn magic.
  • If you don't know what I'm talking about: that sounds overly complex, doesn't it?

Watch this.

model = smf.logit(formula="""
    struck_by_state ~ 
        C(df.race, Treatment('White'))
        + (df.defendant_race == df.race)
        + accused
        + fam_accused + fam_law_enforcement + know_def
        + death_hesitation
""", data=df)

results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.411066
         Iterations 6
Logit Regression Results
Dep. Variable: struck_by_state No. Observations: 2295
Model: Logit Df Residuals: 2286
Method: MLE Df Model: 8
Date: Mon, 04 Nov 2019 Pseudo R-squ.: 0.2685
Time: 15:39:37 Log-Likelihood: -943.40
converged: True LL-Null: -1289.7
Covariance Type: nonrobust LLR p-value: 2.698e-144
coef std err z P>|z| [0.025 0.975]
Intercept -2.4406 0.102 -23.917 0.000 -2.641 -2.241
C(df.race, Treatment('White'))[T.Black] 1.9027 0.141 13.452 0.000 1.625 2.180
C(df.race, Treatment('White'))[T.Unknown] 0.7358 0.775 0.949 0.343 -0.784 2.256
df.defendant_race == df.race[T.True] 0.3642 0.140 2.599 0.009 0.090 0.639
accused 2.5173 0.546 4.611 0.000 1.447 3.587
fam_accused 1.8528 0.162 11.415 0.000 1.535 2.171
fam_law_enforcement -0.5590 0.162 -3.441 0.001 -0.877 -0.241
know_def 1.3282 0.224 5.942 0.000 0.890 1.766
death_hesitation 1.8283 0.592 3.088 0.002 0.668 2.989

We changed the is_black variable into something slightly more complicated

C(df.race, Treatment('White'))

This tells statsmodels to look at all of the options in the race column, and calculate all of the coefficients in relation to the White value. While before we just knew if someone is black or not, now we have more options!

  • C(df.race, Treatment('White'))[T.Black] is when a juror is black
  • C(df.race, Treatment('White'))[T.Unknown] is when a juror's race is unknown

Well, not a lot - the only options are "Black," "White," and "Unknown," but you get the idea. The Treatment('White') part lets you know that this is all in comparison to jurors listed as white.

Now if we take the coefficient for black jurors - 1.9027 - and turn it into an odds ratio - 6.7 - we need to remember this is all in reference to white jurors. When we did it before the comparison was "black vs non-black," but now our comparison is "black vs. white" and "unknown race vs white."

Taking advantage of this feature saves you a lot of time when you're trying to pick apart complicated categorical columns.

For example, we could look at the judges! In the methodology from APM Reports, they have a couple different columns:

  • trial__judge_Loper: Judge for trial was Joseph Loper, reference category: Judge Morgan
  • trial__judge_OTHER: The judge was neither Loper nor Morgan

We can do the same thing, but instead of creating multiple new categories we can just use C() and Treatment()

# Find the actual names of the judges
df.judge.value_counts()
Joseph Loper, Jr    1282
C. Morgan, III       966
Other                 47
Name: judge, dtype: int64
# Run the regression
model = smf.logit(formula="""
    struck_by_state ~ 
        C(df.race, Treatment('White'))
        + C(df.judge, Treatment('C. Morgan, III'))
        + accused
        + fam_accused + fam_law_enforcement + know_def
        + death_hesitation
""", data=df)

results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.411939
         Iterations 6
Logit Regression Results
Dep. Variable: struck_by_state No. Observations: 2295
Model: Logit Df Residuals: 2285
Method: MLE Df Model: 9
Date: Mon, 04 Nov 2019 Pseudo R-squ.: 0.2670
Time: 15:47:05 Log-Likelihood: -945.40
converged: True LL-Null: -1289.7
Covariance Type: nonrobust LLR p-value: 1.885e-142
coef std err z P>|z| [0.025 0.975]
Intercept -2.4853 0.124 -19.995 0.000 -2.729 -2.242
C(df.race, Treatment('White'))[T.Black] 2.1134 0.119 17.788 0.000 1.881 2.346
C(df.race, Treatment('White'))[T.Unknown] 0.6073 0.776 0.782 0.434 -0.914 2.129
C(df.judge, Treatment('C. Morgan, III'))[T.Joseph Loper, Jr] 0.1899 0.120 1.586 0.113 -0.045 0.425
C(df.judge, Treatment('C. Morgan, III'))[T.Other] -0.0420 0.452 -0.093 0.926 -0.927 0.843
accused 2.4955 0.543 4.599 0.000 1.432 3.559
fam_accused 1.8845 0.162 11.615 0.000 1.566 2.202
fam_law_enforcement -0.5639 0.162 -3.488 0.000 -0.881 -0.247
know_def 1.4005 0.221 6.342 0.000 0.968 1.833
death_hesitation 1.9159 0.586 3.268 0.001 0.767 3.065

And there you go! Formulas make it all so easy.

Note: Remember that the coefficient isn't the odds ratio! We need to do an extra step to get that.

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'pvalue': results.pvalues,
    'column': results.params.index
}).sort_values(by='odds ratio', ascending=False)
coefs
coef odds ratio pvalue column
accused 2.49549 12.12764 0.00000 accused
C(df.race, Treatment('White'))[T.Black] 2.11344 8.27668 0.00000 C(df.race, Treatment('White'))[T.Black]
death_hesitation 1.91590 6.79303 0.00108 death_hesitation
fam_accused 1.88448 6.58293 0.00000 fam_accused
know_def 1.40049 4.05718 0.00000 know_def
C(df.race, Treatment('White'))[T.Unknown] 0.60727 1.83542 0.43411 C(df.race, Treatment('White'))[T.Unknown]
C(df.judge, Treatment('C. Morgan, III'))[T.Joseph Loper, Jr] 0.18986 1.20908 0.11278 C(df.judge, Treatment('C. Morgan, III'))[T.Jos...
C(df.judge, Treatment('C. Morgan, III'))[T.Other] -0.04205 0.95882 0.92583 C(df.judge, Treatment('C. Morgan, III'))[T.Other]
fam_law_enforcement -0.56392 0.56897 0.00049 fam_law_enforcement
Intercept -2.48534 0.08330 0.00000 Intercept

Review#

We looked at the way statsmodels formulas work, allowing you to make comparisons and automatically split categories into separate features. Categories get assigned a reference, which is what your odds ratio will be compared with.

For example:

formula meaning
C(df.race, Treatment('White')) Comparing black vs. white
df.race == 'Black' Comparing black vs. non-black
is_black Same as above, just more typing to make the column!

Discussion topics#

  • What are the pluses and minuses of using C() compared to building new columns?
  • How do you pick the reference category?
  • The p-value for unknown race is uselessly high compared to the p-value for black jurors. What do you think you should do about it, if anything?
  • Are you heartbroken that you learned some tricks from the p-value filtering notebook, but if you end up using these techniques those tricks totally won't work? Because I am.