Using regression to find bias in the jury strike process#

When someone is being selected for a jury, what factors play a strong role? We'll track down the answer using logistic regression.

Read online Download notebook Interactive version

Import a lot#

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', '{:.5f}'.format)

%matplotlib inline

Read in the data#

We'll start by reading in the pre-cleaned dataset. We've already joined the potential jurors, the trial information, and the judge information. We've also added the struck_by_state column and converted true and false values into ones and zeroes.

df = pd.read_csv("data/jury-cleaned.csv")
df.head(2)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-ca4654de4f5c> in <module>
----> 1 df = pd.read_csv("data/jury-cleaned.csv")
      2 df.head(2)

NameError: name 'pd' is not defined

Add additional features#

While our dataset is already pretty big, we also want to calculate a few new features to match what APM Reports has in their methodology document.

df['is_black'] = df.race == 'Black'
df['race_unknown'] = df.race == 'Unknown'
df['same_race'] = df.race == df.defendant_race
df['juror_id__gender_m'] = df.gender == 'Male'
df['juror_id__gender_unknown'] = df.gender == 'Unknown'
df['trial__defendant_race_asian'] = df.defendant_race == 'Asian'
df['trial__defendant_race_black'] = df.defendant_race == 'Black'
df['trial__defendant_race_unknown'] = df.defendant_race == 'Unknown'
df['trial__judge_Loper'] = df.judge == 'Joseph Loper, Jr'
df['trial__judge_OTHER'] = df.judge == 'Other'
df.head(2)

	id_x	juror_id	juror_id__trial__id	no_responses	married	children	religious	education	leans_state	leans_defense	leans_ambi	moral_hardship	job_hardship	caretaker	communication	medical	employed	social	prior_jury	crime_victim	fam_crime_victim	accused	fam_accused	eyewitness	fam_eyewitness	military	law_enforcement	fam_law_enforcement	premature_verdict	premature_guilt	premature_innocence	def_race	vic_race	def_gender	vic_gender	def_social	vic_social	def_age	vic_age	def_sexpref	vic_sexpref	def_incarcerated	vic_incarcerated	beliefs	other_biases	innocence	take_stand	arrest_is_guilt	cant_decide	cant_affirm	cant_decide_evidence	cant_follow	know_def	know_vic	know_wit	know_attny	civil_plantiff	civil_def	civil_witness	witness_defense	witness_state	prior_info	death_hesitation	no_death	no_life	no_cops	yes_cops	legally_disqualified	witness_ambi	notes	id_y	trial	trial__id	race	gender	race_source	gender_source	struck_by	strike_eligibility	id	defendant_name	cause_number	state_strikes	defense_strikes	county	defendant_race	second_defendant_race	third_defendant_race	fourth_defendant_race	more_than_four_defendants	judge	prosecutor_1	prosecutor_2	prosecutor_3	prosecutors_more_than_three	def_attny_1	def_attny_2	def_attny_3	def_attnys_more_than_three	offense_code_1	offense_title_1	offense_code_2	offense_title_2	offense_code_3	offense_title_3	offense_code_4	offense_title_4	offense_code_5	offense_title_5	offense_code_6	offense_title_6	more_than_six	verdict	case_appealed	batson_claim_by_defense	batson_claim_by_state	voir_dire_present	struck_by_state	is_black	same_race	juror_id__gender_m	juror_id__gender_unknown	trial__defendant_race_asian	trial__defendant_race_black	trial__defendant_race_unknown	trial__judge_Loper	trial__judge_OTHER	race_unknown
0	1521	107.00000	3.00000	0	unknown	unknown	unknown	unknown	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	NaN	107	2004-0257--Sparky Watson	3	White	Male	Jury strike sheet	Jury strike sheet	Struck by the defense	Both State and Defense	3	Sparky Watson	2004-0257	1	1	Grenada	Black	NaN	NaN	nan	0	C. Morgan, III	Susan Denley	Ryan Berry	NaN	0	M. Kevin Horan	Elizabeth Davis	NaN	0	41-29-139(a)(1)(b)(3)	sale of marihuana (less than 30 grams)	41-29-139(a)(1)(b)(1)	sale of cocaine	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	Guilty on at least one offense	1	0	0	1	0	False	False	True	False	False	True	False	False	False	False
1	1524	108.00000	3.00000	0	unknown	unknown	unknown	unknown	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	NaN	108	2004-0257--Sparky Watson	3	Black	Female	Jury strike sheet	Jury strike sheet	Struck by the state	State	3	Sparky Watson	2004-0257	1	1	Grenada	Black	NaN	NaN	nan	0	C. Morgan, III	Susan Denley	Ryan Berry	NaN	0	M. Kevin Horan	Elizabeth Davis	NaN	0	41-29-139(a)(1)(b)(3)	sale of marihuana (less than 30 grams)	41-29-139(a)(1)(b)(1)	sale of cocaine	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	Guilty on at least one offense	1	0	0	1	1	True	True	False	False	False	True	False	False	False	False

Since they're all trues and falses, we'll need to take a second to convert them to ones and zeroes so that our regression will work.

df = df.replace({
    True: 1,
    False: 0
})

What columns are we interested in?#

Using whether the juror was struck by the state or not as the dependent variable and the juror’s responses during voir dire as the input data, APM Reports built a logistic regression model to test the importance of the different variables on the likelihood of being struck. Our logistic regression model used all the variables we tracked that had more than 5 event and non-event occurrences.

We'll start with making a list of all of the variables that were tracked.

potential_columns = [
    # First, the ones we made
    'is_black', 'race_unknown', 'same_race', 'juror_id__gender_m', 'juror_id__gender_unknown',
    'trial__defendant_race_asian', 'trial__defendant_race_black', 'trial__defendant_race_unknown',
    'trial__judge_Loper', 'trial__judge_OTHER',

    # Then, the ones from the dataset
    # We'll remove 'race' because we have is_black and race_unknown already
    'no_responses', 'leans_defense', 'leans_ambi', 'moral_hardship', 'job_hardship', 
    'caretaker', 'communication', 'medical', 'employed', 'social', 'prior_jury', 
    'crime_victim', 'fam_crime_victim', 'accused', 'fam_accused', 
    'eyewitness', 'fam_eyewitness', 'military', 'law_enforcement', 'fam_law_enforcement', 
    'premature_verdict', 'premature_guilt', 'premature_innocence', 'def_race', 'vic_race', 
    'def_gender', 'vic_gender', 'def_social', 'vic_social', 'def_age', 'vic_age', 
    'def_sexpref', 'vic_sexpref', 'def_incarcerated', 'vic_incarcerated', 'beliefs', 
    'other_biases', 'innocence', 'take_stand', 'arrest_is_guilt', 
    'cant_decide', 'cant_affirm', 'cant_decide_evidence', 'cant_follow', 'know_def', 
    'know_vic', 'know_wit', 'know_attny', 'civil_plantiff', 'civil_def', 'civil_witness', 
    'witness_defense', 'witness_state', 'prior_info', 'death_hesitation', 'no_death', 
    'no_life', 'no_cops', 'yes_cops', 'legally_disqualified', 'witness_ambi',  
]

Remove anything without 5 events and non-events#

From the methodology:

Our logistic regression model used all the variables we tracked that had more than 5 event and non-event occurrences

What's this mean? Think about it like this: if everyone said they were in the military, military wouldn't be a very useful column. Or if all potential jurors that said they were in the military never got accepted? Also useless.

What we're looking for is a good mix, where sometimes they were accepted and sometimes they were rejected, and where sometimes they answered yes and sometimes they answered no.

We'll start by seeing how we can count how many fall in each category, and when we'd accept or reject them.

For example, whether someone is black or not is a large mix of outcomes.

counted = df.groupby(['struck_by_state', 'is_black']).size().unstack(fill_value=0)
counted

is_black	0	1
struck_by_state
0	1377	345
1	177	396

On the other hand, only 5 people ever said they were in the military, and they were all accepted. Not very useful!

counted = df.groupby(['struck_by_state', 'military']).size().unstack(fill_value=0)
counted

military	0	1
struck_by_state
0	1717	5
1	573	0

No one said they can't follow instructions, so we won't want to use this feature.

counted = df.groupby(['struck_by_state', 'cant_follow']).size().unstack(fill_value=0)
counted

cant_follow	0
struck_by_state
0	1722
1	573

We'll need two techniques to filter there. First, we can use this to see if any of the cells are less than five.

(counted < 5).any(axis=None)

False

But remember how we sometimes only have one column? To remove those, we need to check and see if we have a full 2x2 square.

counted.count().sum()

Filtering columns without 5 events and non-events#

Now that we have our techniques, let's filter!

useable_cols = []
for col in feature_columns:
    counted = df.groupby(['struck_by_state', col]).size().unstack(fill_value=0)
    if counted.count().sum() < 4 or (counted < 5).any(axis=None):
        # print("Skipping", col)
        pass
    else:
        useable_cols.append(col)

useable_cols

['is_black',
 'same_race',
 'juror_id__gender_m',
 'juror_id__gender_unknown',
 'trial__defendant_race_asian',
 'trial__defendant_race_black',
 'trial__defendant_race_unknown',
 'trial__judge_Loper',
 'trial__judge_OTHER',
 'no_responses',
 'leans_ambi',
 'prior_jury',
 'crime_victim',
 'fam_crime_victim',
 'accused',
 'fam_accused',
 'law_enforcement',
 'fam_law_enforcement',
 'know_def',
 'know_vic',
 'know_wit',
 'know_attny',
 'prior_info',
 'death_hesitation']

Perform the regression#

We'll start by importing the statsmodels package for doing formula-based regression

import statsmodels.formula.api as smf

APM Reports first ran every variable through a logistic regression model. We then removed all variables with a p-value > 0.1. Finally, we selected all factors with a p-value < 0.05 and ran the model a third time.

We're going to use all of our useable_cols to perform this regression. There's another notebook where we filter based on p-values, I recommend taking a look at it! The method we use here is readable, but kind of a pain.

# I want to cut and paste for my formula
print(" + ".join(useable_cols))

is_black + same_race + juror_id__gender_m + juror_id__gender_unknown + trial__defendant_race_asian + trial__defendant_race_black + trial__defendant_race_unknown + trial__judge_Loper + trial__judge_OTHER + no_responses + leans_ambi + prior_jury + crime_victim + fam_crime_victim + accused + fam_accused + law_enforcement + fam_law_enforcement + know_def + know_vic + know_wit + know_attny + prior_info + death_hesitation

model = smf.logit(formula="""
    struck_by_state ~ 
        is_black + same_race + juror_id__gender_m + juror_id__gender_unknown
        + trial__defendant_race_asian + trial__defendant_race_black
        + trial__defendant_race_unknown + trial__judge_Loper + trial__judge_OTHER
        + no_responses + leans_ambi + prior_jury + crime_victim + fam_crime_victim
        + accused + fam_accused + law_enforcement + fam_law_enforcement + know_def
        + know_vic + know_wit + know_attny + prior_info + death_hesitation
""", data=df)

results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.405530
         Iterations 7

Logit Regression Results
Dep. Variable:	struck_by_state	No. Observations:	2295
Model:	Logit	Df Residuals:	2270
Method:	MLE	Df Model:	24
Date:	Mon, 04 Nov 2019	Pseudo R-squ.:	0.2784
Time:	15:08:46	Log-Likelihood:	-930.69
converged:	True	LL-Null:	-1289.7
Covariance Type:	nonrobust	LLR p-value:	3.878e-136

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.3416	0.223	-10.489	0.000	-2.779	-1.904
is_black	1.9325	0.143	13.506	0.000	1.652	2.213
same_race	0.4585	0.142	3.228	0.001	0.180	0.737
juror_id__gender_m	0.0488	0.123	0.397	0.691	-0.192	0.290
juror_id__gender_unknown	-0.0303	0.376	-0.081	0.936	-0.768	0.707
trial__defendant_race_asian	0.7465	0.546	1.368	0.171	-0.323	1.816
trial__defendant_race_black	-0.1635	0.151	-1.079	0.280	-0.460	0.133
trial__defendant_race_unknown	0.5651	0.410	1.378	0.168	-0.239	1.369
trial__judge_Loper	0.1796	0.134	1.337	0.181	-0.084	0.443
trial__judge_OTHER	0.0056	0.466	0.012	0.990	-0.907	0.918
no_responses	-0.2995	0.164	-1.822	0.068	-0.622	0.023
leans_ambi	0.3274	0.666	0.492	0.623	-0.977	1.632
prior_jury	-0.2290	0.210	-1.089	0.276	-0.641	0.183
crime_victim	-0.0287	0.315	-0.091	0.928	-0.647	0.589
fam_crime_victim	0.5037	0.281	1.792	0.073	-0.047	1.055
accused	2.4623	0.548	4.492	0.000	1.388	3.537
fam_accused	1.7964	0.175	10.275	0.000	1.454	2.139
law_enforcement	-0.9703	0.503	-1.929	0.054	-1.957	0.016
fam_law_enforcement	-0.6832	0.173	-3.957	0.000	-1.022	-0.345
know_def	1.3204	0.239	5.536	0.000	0.853	1.788
know_vic	0.2446	0.239	1.022	0.307	-0.224	0.714
know_wit	-0.3940	0.236	-1.666	0.096	-0.857	0.069
know_attny	0.3438	0.237	1.451	0.147	-0.120	0.808
prior_info	-0.2074	0.200	-1.039	0.299	-0.599	0.184
death_hesitation	1.8562	0.598	3.103	0.002	0.684	3.029

APM Reports first ran every variable through a logistic regression model. We then removed all variables with a p-value > 0.1. Finally, we selected all factors with a p-value < 0.05 and ran the model a third time.

Going through the p-value list above, we'll remove any features that are at or above the 0.1 p-value threshold (that's the P>|z| column). If you'd like more details on the how or why of this, check out the notebook on feature selection by p-value.

model = smf.logit(formula="""
    struck_by_state ~ 
        is_black + same_race + no_responses + fam_crime_victim + accused
        + fam_accused + law_enforcement + fam_law_enforcement + know_def
        + know_wit + death_hesitation
""", data=df)

results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.408840
         Iterations 6

Logit Regression Results
Dep. Variable:	struck_by_state	No. Observations:	2295
Model:	Logit	Df Residuals:	2283
Method:	MLE	Df Model:	11
Date:	Mon, 04 Nov 2019	Pseudo R-squ.:	0.2725
Time:	15:10:35	Log-Likelihood:	-938.29
converged:	True	LL-Null:	-1289.7
Covariance Type:	nonrobust	LLR p-value:	1.293e-143

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.3054	0.126	-18.238	0.000	-2.553	-2.058
is_black	1.9239	0.143	13.440	0.000	1.643	2.204
same_race	0.3776	0.140	2.691	0.007	0.103	0.653
no_responses	-0.2466	0.144	-1.713	0.087	-0.529	0.036
fam_crime_victim	0.4834	0.277	1.743	0.081	-0.060	1.027
accused	2.4520	0.545	4.503	0.000	1.385	3.519
fam_accused	1.7888	0.171	10.485	0.000	1.454	2.123
law_enforcement	-0.8932	0.499	-1.791	0.073	-1.871	0.084
fam_law_enforcement	-0.6728	0.171	-3.935	0.000	-1.008	-0.338
know_def	1.2936	0.236	5.485	0.000	0.831	1.756
know_wit	-0.3339	0.232	-1.437	0.151	-0.789	0.121
death_hesitation	1.7635	0.595	2.961	0.003	0.596	2.931

According to the methodology we need to filter one more time: this time for features with a p-value under 0.5.

APM Reports first ran every variable through a logistic regression model. We then removed all variables with a p-value > 0.1. Finally, we selected all factors with a p-value < 0.05 and ran the model a third time.

model = smf.logit(formula="""
    struck_by_state ~ 
        is_black + same_race + accused
        + fam_accused + fam_law_enforcement + know_def
        + death_hesitation
""", data=df)

results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.411232
         Iterations 6

Logit Regression Results
Dep. Variable:	struck_by_state	No. Observations:	2295
Model:	Logit	Df Residuals:	2287
Method:	MLE	Df Model:	7
Date:	Mon, 04 Nov 2019	Pseudo R-squ.:	0.2682
Time:	15:12:11	Log-Likelihood:	-943.78
converged:	True	LL-Null:	-1289.7
Covariance Type:	nonrobust	LLR p-value:	3.815e-145

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.4307	0.101	-24.017	0.000	-2.629	-2.232
is_black	1.8972	0.141	13.443	0.000	1.621	2.174
same_race	0.3603	0.140	2.575	0.010	0.086	0.635
accused	2.5128	0.545	4.606	0.000	1.444	3.582
fam_accused	1.8476	0.162	11.402	0.000	1.530	2.165
fam_law_enforcement	-0.5627	0.162	-3.468	0.001	-0.881	-0.245
know_def	1.3257	0.223	5.937	0.000	0.888	1.763
death_hesitation	1.8243	0.592	3.084	0.002	0.665	2.984

There we go! Now that we have a nice, noise-less set of results, we're free to plug this into a dataframe that can tell us odds ratios.

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'pvalue': results.pvalues,
    'column': results.params.index
}).sort_values(by='odds ratio', ascending=False)
coefs

	coef	odds ratio	pvalue	column
accused	2.51278	12.33918	0.00000	accused
is_black	1.89716	6.66696	0.00000	is_black
fam_accused	1.84760	6.34456	0.00000	fam_accused
death_hesitation	1.82434	6.19873	0.00204	death_hesitation
know_def	1.32570	3.76481	0.00000	know_def
same_race	0.36026	1.43370	0.01004	same_race
fam_law_enforcement	-0.56268	0.56968	0.00052	fam_law_enforcement
Intercept	-2.43071	0.08797	0.00000	Intercept

And there we have it! When taking these seven statistically-significant features into account, black jurors were over 6.5x more likely to be struck from a jury.

Variations on our results#

race vs same_race#

We used the same_race variable to code jurors that were the same race as any of the defendants. In building the logistic regression model, we included and excluded certain variables to see how that impacted the model. When we left out the race of the juror from the model, same_race had a much higher odds ratio (odds ratio = 4.5). But the model with the race of the juror added back in lowers the same_race odds ratio to 1.4.

model = smf.logit(formula="""
    struck_by_state ~ 
        same_race + accused
        + fam_accused + fam_law_enforcement + know_def
        + death_hesitation
""", data=df)

results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.453673
         Iterations 6

Logit Regression Results
Dep. Variable:	struck_by_state	No. Observations:	2295
Model:	Logit	Df Residuals:	2288
Method:	MLE	Df Model:	6
Date:	Mon, 04 Nov 2019	Pseudo R-squ.:	0.1927
Time:	15:15:19	Log-Likelihood:	-1041.2
converged:	True	LL-Null:	-1289.7
Covariance Type:	nonrobust	LLR p-value:	3.524e-104

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.0663	0.090	-23.032	0.000	-2.242	-1.890
same_race	1.3847	0.111	12.490	0.000	1.167	1.602
accused	2.7632	0.522	5.298	0.000	1.741	3.785
fam_accused	1.7841	0.150	11.866	0.000	1.489	2.079
fam_law_enforcement	-0.6989	0.156	-4.494	0.000	-1.004	-0.394
know_def	1.3989	0.207	6.766	0.000	0.994	1.804
death_hesitation	1.8131	0.550	3.295	0.001	0.734	2.892

Using regression to find bias in the jury strike process#

Import a lot#

Read in the data#

Add additional features#

What columns are we interested in?#

Remove anything without 5 events and non-events#

Filtering columns without 5 events and non-events#

Perform the regression#

Variations on our results#

race vs same_race#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects