Using regression to find bias in the jury strike process#

When someone is being selected for a jury, what factors play a strong role? We'll track down the answer using logistic regression.

Import a lot#

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', '{:.5f}'.format)

%matplotlib inline

Read in the data#

We'll start by reading in the pre-cleaned dataset. We've already joined the potential jurors, the trial information, and the judge information. We've also added the struck_by_state column and converted true and false values into ones and zeroes.

df = pd.read_csv("data/jury-cleaned.csv")
df.head(2)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-ca4654de4f5c> in <module>
----> 1 df = pd.read_csv("data/jury-cleaned.csv")
      2 df.head(2)

NameError: name 'pd' is not defined

Add additional features#

While our dataset is already pretty big, we also want to calculate a few new features to match what APM Reports has in their methodology document.

df['is_black'] = df.race == 'Black'
df['race_unknown'] = df.race == 'Unknown'
df['same_race'] = df.race == df.defendant_race
df['juror_id__gender_m'] = df.gender == 'Male'
df['juror_id__gender_unknown'] = df.gender == 'Unknown'
df['trial__defendant_race_asian'] = df.defendant_race == 'Asian'
df['trial__defendant_race_black'] = df.defendant_race == 'Black'
df['trial__defendant_race_unknown'] = df.defendant_race == 'Unknown'
df['trial__judge_Loper'] = df.judge == 'Joseph Loper, Jr'
df['trial__judge_OTHER'] = df.judge == 'Other'
df.head(2)
id_x juror_id juror_id__trial__id no_responses married children religious education leans_state leans_defense leans_ambi moral_hardship job_hardship caretaker communication medical employed social prior_jury crime_victim fam_crime_victim accused fam_accused eyewitness fam_eyewitness military law_enforcement fam_law_enforcement premature_verdict premature_guilt premature_innocence def_race vic_race def_gender vic_gender def_social vic_social def_age vic_age def_sexpref vic_sexpref def_incarcerated vic_incarcerated beliefs other_biases innocence take_stand arrest_is_guilt cant_decide cant_affirm cant_decide_evidence cant_follow know_def know_vic know_wit know_attny civil_plantiff civil_def civil_witness witness_defense witness_state prior_info death_hesitation no_death no_life no_cops yes_cops legally_disqualified witness_ambi notes id_y trial trial__id race gender race_source gender_source struck_by strike_eligibility id defendant_name cause_number state_strikes defense_strikes county defendant_race second_defendant_race third_defendant_race fourth_defendant_race more_than_four_defendants judge prosecutor_1 prosecutor_2 prosecutor_3 prosecutors_more_than_three def_attny_1 def_attny_2 def_attny_3 def_attnys_more_than_three offense_code_1 offense_title_1 offense_code_2 offense_title_2 offense_code_3 offense_title_3 offense_code_4 offense_title_4 offense_code_5 offense_title_5 offense_code_6 offense_title_6 more_than_six verdict case_appealed batson_claim_by_defense batson_claim_by_state voir_dire_present struck_by_state is_black same_race juror_id__gender_m juror_id__gender_unknown trial__defendant_race_asian trial__defendant_race_black trial__defendant_race_unknown trial__judge_Loper trial__judge_OTHER race_unknown
0 1521 107.00000 3.00000 0 unknown unknown unknown unknown 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN 107 2004-0257--Sparky Watson 3 White Male Jury strike sheet Jury strike sheet Struck by the defense Both State and Defense 3 Sparky Watson 2004-0257 1 1 Grenada Black NaN NaN nan 0 C. Morgan, III Susan Denley Ryan Berry NaN 0 M. Kevin Horan Elizabeth Davis NaN 0 41-29-139(a)(1)(b)(3) sale of marihuana (less than 30 grams) 41-29-139(a)(1)(b)(1) sale of cocaine NaN NaN NaN NaN NaN NaN NaN NaN 0 Guilty on at least one offense 1 0 0 1 0 False False True False False True False False False False
1 1524 108.00000 3.00000 0 unknown unknown unknown unknown 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN 108 2004-0257--Sparky Watson 3 Black Female Jury strike sheet Jury strike sheet Struck by the state State 3 Sparky Watson 2004-0257 1 1 Grenada Black NaN NaN nan 0 C. Morgan, III Susan Denley Ryan Berry NaN 0 M. Kevin Horan Elizabeth Davis NaN 0 41-29-139(a)(1)(b)(3) sale of marihuana (less than 30 grams) 41-29-139(a)(1)(b)(1) sale of cocaine NaN NaN NaN NaN NaN NaN NaN NaN 0 Guilty on at least one offense 1 0 0 1 1 True True False False False True False False False False

Since they're all trues and falses, we'll need to take a second to convert them to ones and zeroes so that our regression will work.

df = df.replace({
    True: 1,
    False: 0
})

What columns are we interested in?#

Using whether the juror was struck by the state or not as the dependent variable and the juror’s responses during voir dire as the input data, APM Reports built a logistic regression model to test the importance of the different variables on the likelihood of being struck. Our logistic regression model used all the variables we tracked that had more than 5 event and non-event occurrences.

We'll start with making a list of all of the variables that were tracked.

potential_columns = [
    # First, the ones we made
    'is_black', 'race_unknown', 'same_race', 'juror_id__gender_m', 'juror_id__gender_unknown',
    'trial__defendant_race_asian', 'trial__defendant_race_black', 'trial__defendant_race_unknown',
    'trial__judge_Loper', 'trial__judge_OTHER',

    # Then, the ones from the dataset
    # We'll remove 'race' because we have is_black and race_unknown already
    'no_responses', 'leans_defense', 'leans_ambi', 'moral_hardship', 'job_hardship', 
    'caretaker', 'communication', 'medical', 'employed', 'social', 'prior_jury', 
    'crime_victim', 'fam_crime_victim', 'accused', 'fam_accused', 
    'eyewitness', 'fam_eyewitness', 'military', 'law_enforcement', 'fam_law_enforcement', 
    'premature_verdict', 'premature_guilt', 'premature_innocence', 'def_race', 'vic_race', 
    'def_gender', 'vic_gender', 'def_social', 'vic_social', 'def_age', 'vic_age', 
    'def_sexpref', 'vic_sexpref', 'def_incarcerated', 'vic_incarcerated', 'beliefs', 
    'other_biases', 'innocence', 'take_stand', 'arrest_is_guilt', 
    'cant_decide', 'cant_affirm', 'cant_decide_evidence', 'cant_follow', 'know_def', 
    'know_vic', 'know_wit', 'know_attny', 'civil_plantiff', 'civil_def', 'civil_witness', 
    'witness_defense', 'witness_state', 'prior_info', 'death_hesitation', 'no_death', 
    'no_life', 'no_cops', 'yes_cops', 'legally_disqualified', 'witness_ambi',  
]

Remove anything without 5 events and non-events#

From the methodology:

Our logistic regression model used all the variables we tracked that had more than 5 event and non-event occurrences

What's this mean? Think about it like this: if everyone said they were in the military, military wouldn't be a very useful column. Or if all potential jurors that said they were in the military never got accepted? Also useless.

What we're looking for is a good mix, where sometimes they were accepted and sometimes they were rejected, and where sometimes they answered yes and sometimes they answered no.

We'll start by seeing how we can count how many fall in each category, and when we'd accept or reject them.

For example, whether someone is black or not is a large mix of outcomes.

counted = df.groupby(['struck_by_state', 'is_black']).size().unstack(fill_value=0)
counted
is_black 0 1
struck_by_state
0 1377 345
1 177 396

On the other hand, only 5 people ever said they were in the military, and they were all accepted. Not very useful!

counted = df.groupby(['struck_by_state', 'military']).size().unstack(fill_value=0)
counted
military 0 1
struck_by_state
0 1717 5
1 573 0

No one said they can't follow instructions, so we won't want to use this feature.

counted = df.groupby(['struck_by_state', 'cant_follow']).size().unstack(fill_value=0)
counted
cant_follow 0
struck_by_state
0 1722
1 573

We'll need two techniques to filter there. First, we can use this to see if any of the cells are less than five.

(counted < 5).any(axis=None)
False

But remember how we sometimes only have one column? To remove those, we need to check and see if we have a full 2x2 square.

counted.count().sum()
2

Filtering columns without 5 events and non-events#

Now that we have our techniques, let's filter!

useable_cols = []
for col in feature_columns:
    counted = df.groupby(['struck_by_state', col]).size().unstack(fill_value=0)
    if counted.count().sum() < 4 or (counted < 5).any(axis=None):
        # print("Skipping", col)
        pass
    else:
        useable_cols.append(col)
useable_cols
['is_black',
 'same_race',
 'juror_id__gender_m',
 'juror_id__gender_unknown',
 'trial__defendant_race_asian',
 'trial__defendant_race_black',
 'trial__defendant_race_unknown',
 'trial__judge_Loper',
 'trial__judge_OTHER',
 'no_responses',
 'leans_ambi',
 'prior_jury',
 'crime_victim',
 'fam_crime_victim',
 'accused',
 'fam_accused',
 'law_enforcement',
 'fam_law_enforcement',
 'know_def',
 'know_vic',
 'know_wit',
 'know_attny',
 'prior_info',
 'death_hesitation']

Perform the regression#

We'll start by importing the statsmodels package for doing formula-based regression

import statsmodels.formula.api as smf

APM Reports first ran every variable through a logistic regression model. We then removed all variables with a p-value > 0.1. Finally, we selected all factors with a p-value < 0.05 and ran the model a third time.

We're going to use all of our useable_cols to perform this regression. There's another notebook where we filter based on p-values, I recommend taking a look at it! The method we use here is readable, but kind of a pain.

# I want to cut and paste for my formula
print(" + ".join(useable_cols))
is_black + same_race + juror_id__gender_m + juror_id__gender_unknown + trial__defendant_race_asian + trial__defendant_race_black + trial__defendant_race_unknown + trial__judge_Loper + trial__judge_OTHER + no_responses + leans_ambi + prior_jury + crime_victim + fam_crime_victim + accused + fam_accused + law_enforcement + fam_law_enforcement + know_def + know_vic + know_wit + know_attny + prior_info + death_hesitation
model = smf.logit(formula="""
    struck_by_state ~ 
        is_black + same_race + juror_id__gender_m + juror_id__gender_unknown
        + trial__defendant_race_asian + trial__defendant_race_black
        + trial__defendant_race_unknown + trial__judge_Loper + trial__judge_OTHER
        + no_responses + leans_ambi + prior_jury + crime_victim + fam_crime_victim
        + accused + fam_accused + law_enforcement + fam_law_enforcement + know_def
        + know_vic + know_wit + know_attny + prior_info + death_hesitation
""", data=df)

results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.405530
         Iterations 7
Logit Regression Results
Dep. Variable: struck_by_state No. Observations: 2295
Model: Logit Df Residuals: 2270
Method: MLE Df Model: 24
Date: Mon, 04 Nov 2019 Pseudo R-squ.: 0.2784
Time: 15:08:46 Log-Likelihood: -930.69
converged: True LL-Null: -1289.7
Covariance Type: nonrobust LLR p-value: 3.878e-136
coef std err z P>|z| [0.025 0.975]
Intercept -2.3416 0.223 -10.489 0.000 -2.779 -1.904
is_black 1.9325 0.143 13.506 0.000 1.652 2.213
same_race 0.4585 0.142 3.228 0.001 0.180 0.737
juror_id__gender_m 0.0488 0.123 0.397 0.691 -0.192 0.290
juror_id__gender_unknown -0.0303 0.376 -0.081 0.936 -0.768 0.707
trial__defendant_race_asian 0.7465 0.546 1.368 0.171 -0.323 1.816
trial__defendant_race_black -0.1635 0.151 -1.079 0.280 -0.460 0.133
trial__defendant_race_unknown 0.5651 0.410 1.378 0.168 -0.239 1.369
trial__judge_Loper 0.1796 0.134 1.337 0.181 -0.084 0.443
trial__judge_OTHER 0.0056 0.466 0.012 0.990 -0.907 0.918
no_responses -0.2995 0.164 -1.822 0.068 -0.622 0.023
leans_ambi 0.3274 0.666 0.492 0.623 -0.977 1.632
prior_jury -0.2290 0.210 -1.089 0.276 -0.641 0.183
crime_victim -0.0287 0.315 -0.091 0.928 -0.647 0.589
fam_crime_victim 0.5037 0.281 1.792 0.073 -0.047 1.055
accused 2.4623 0.548 4.492 0.000 1.388 3.537
fam_accused 1.7964 0.175 10.275 0.000 1.454 2.139
law_enforcement -0.9703 0.503 -1.929 0.054 -1.957 0.016
fam_law_enforcement -0.6832 0.173 -3.957 0.000 -1.022 -0.345
know_def 1.3204 0.239 5.536 0.000 0.853 1.788
know_vic 0.2446 0.239 1.022 0.307 -0.224 0.714
know_wit -0.3940 0.236 -1.666 0.096 -0.857 0.069
know_attny 0.3438 0.237 1.451 0.147 -0.120 0.808
prior_info -0.2074 0.200 -1.039 0.299 -0.599 0.184
death_hesitation 1.8562 0.598 3.103 0.002 0.684 3.029

APM Reports first ran every variable through a logistic regression model. We then removed all variables with a p-value > 0.1. Finally, we selected all factors with a p-value < 0.05 and ran the model a third time.

Going through the p-value list above, we'll remove any features that are at or above the 0.1 p-value threshold (that's the P>|z| column). If you'd like more details on the how or why of this, check out the notebook on feature selection by p-value.

model = smf.logit(formula="""
    struck_by_state ~ 
        is_black + same_race + no_responses + fam_crime_victim + accused
        + fam_accused + law_enforcement + fam_law_enforcement + know_def
        + know_wit + death_hesitation
""", data=df)

results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.408840
         Iterations 6
Logit Regression Results
Dep. Variable: struck_by_state No. Observations: 2295
Model: Logit Df Residuals: 2283
Method: MLE Df Model: 11
Date: Mon, 04 Nov 2019 Pseudo R-squ.: 0.2725
Time: 15:10:35 Log-Likelihood: -938.29
converged: True LL-Null: -1289.7
Covariance Type: nonrobust LLR p-value: 1.293e-143
coef std err z P>|z| [0.025 0.975]
Intercept -2.3054 0.126 -18.238 0.000 -2.553 -2.058
is_black 1.9239 0.143 13.440 0.000 1.643 2.204
same_race 0.3776 0.140 2.691 0.007 0.103 0.653
no_responses -0.2466 0.144 -1.713 0.087 -0.529 0.036
fam_crime_victim 0.4834 0.277 1.743 0.081 -0.060 1.027
accused 2.4520 0.545 4.503 0.000 1.385 3.519
fam_accused 1.7888 0.171 10.485 0.000 1.454 2.123
law_enforcement -0.8932 0.499 -1.791 0.073 -1.871 0.084
fam_law_enforcement -0.6728 0.171 -3.935 0.000 -1.008 -0.338
know_def 1.2936 0.236 5.485 0.000 0.831 1.756
know_wit -0.3339 0.232 -1.437 0.151 -0.789 0.121
death_hesitation 1.7635 0.595 2.961 0.003 0.596 2.931

According to the methodology we need to filter one more time: this time for features with a p-value under 0.5.

APM Reports first ran every variable through a logistic regression model. We then removed all variables with a p-value > 0.1. Finally, we selected all factors with a p-value < 0.05 and ran the model a third time.

model = smf.logit(formula="""
    struck_by_state ~ 
        is_black + same_race + accused
        + fam_accused + fam_law_enforcement + know_def
        + death_hesitation
""", data=df)

results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.411232
         Iterations 6
Logit Regression Results
Dep. Variable: struck_by_state No. Observations: 2295
Model: Logit Df Residuals: 2287
Method: MLE Df Model: 7
Date: Mon, 04 Nov 2019 Pseudo R-squ.: 0.2682
Time: 15:12:11 Log-Likelihood: -943.78
converged: True LL-Null: -1289.7
Covariance Type: nonrobust LLR p-value: 3.815e-145
coef std err z P>|z| [0.025 0.975]
Intercept -2.4307 0.101 -24.017 0.000 -2.629 -2.232
is_black 1.8972 0.141 13.443 0.000 1.621 2.174
same_race 0.3603 0.140 2.575 0.010 0.086 0.635
accused 2.5128 0.545 4.606 0.000 1.444 3.582
fam_accused 1.8476 0.162 11.402 0.000 1.530 2.165
fam_law_enforcement -0.5627 0.162 -3.468 0.001 -0.881 -0.245
know_def 1.3257 0.223 5.937 0.000 0.888 1.763
death_hesitation 1.8243 0.592 3.084 0.002 0.665 2.984

There we go! Now that we have a nice, noise-less set of results, we're free to plug this into a dataframe that can tell us odds ratios.

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'pvalue': results.pvalues,
    'column': results.params.index
}).sort_values(by='odds ratio', ascending=False)
coefs
coef odds ratio pvalue column
accused 2.51278 12.33918 0.00000 accused
is_black 1.89716 6.66696 0.00000 is_black
fam_accused 1.84760 6.34456 0.00000 fam_accused
death_hesitation 1.82434 6.19873 0.00204 death_hesitation
know_def 1.32570 3.76481 0.00000 know_def
same_race 0.36026 1.43370 0.01004 same_race
fam_law_enforcement -0.56268 0.56968 0.00052 fam_law_enforcement
Intercept -2.43071 0.08797 0.00000 Intercept

And there we have it! When taking these seven statistically-significant features into account, black jurors were over 6.5x more likely to be struck from a jury.

Variations on our results#

race vs same_race#

We used the same_race variable to code jurors that were the same race as any of the defendants. In building the logistic regression model, we included and excluded certain variables to see how that impacted the model. When we left out the race of the juror from the model, same_race had a much higher odds ratio (odds ratio = 4.5). But the model with the race of the juror added back in lowers the same_race odds ratio to 1.4.

model = smf.logit(formula="""
    struck_by_state ~ 
        same_race + accused
        + fam_accused + fam_law_enforcement + know_def
        + death_hesitation
""", data=df)

results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.453673
         Iterations 6
Logit Regression Results
Dep. Variable: struck_by_state No. Observations: 2295
Model: Logit Df Residuals: 2288
Method: MLE Df Model: 6
Date: Mon, 04 Nov 2019 Pseudo R-squ.: 0.1927
Time: 15:15:19 Log-Likelihood: -1041.2
converged: True LL-Null: -1289.7
Covariance Type: nonrobust LLR p-value: 3.524e-104
coef std err z P>|z| [0.025 0.975]
Intercept -2.0663 0.090 -23.032 0.000 -2.242 -1.890
same_race 1.3847 0.111 12.490 0.000 1.167 1.602
accused 2.7632 0.522 5.298 0.000 1.741 3.785
fam_accused 1.7841 0.150 11.866 0.000 1.489 2.079
fam_law_enforcement -0.6989 0.156 -4.494 0.000 -1.004 -0.394
know_def 1.3989 0.207 6.766 0.000 0.994 1.804
death_hesitation 1.8131 0.550 3.295 0.001 0.734 2.892