Logistic regression of jury rejections using statsmodels' formula method#

In this notebook we'll be looking for evidence of racial bias in the jury selection process. To this end we'll be working with the statsmodels package, and specifically its R-formula-like smf.logit method.

Read online Download notebook Interactive version

Import a lot#

import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', '{:.5f}'.format)

%matplotlib inline

Read in the data#

We'll start by reading in the pre-cleaned dataset. We've already joined the potential jurors, the trial information, and the judge information. We've also added the struck_by_state column and converted true and false values into ones and zeroes.

df = pd.read_csv("data/jury-cleaned.csv")
df.head(2)

	id_x	juror_id	juror_id__trial__id	no_responses	married	children	religious	education	leans_state	leans_defense	leans_ambi	moral_hardship	job_hardship	caretaker	communication	medical	employed	social	prior_jury	crime_victim	fam_crime_victim	accused	fam_accused	eyewitness	fam_eyewitness	military	law_enforcement	fam_law_enforcement	premature_verdict	premature_guilt	premature_innocence	def_race	vic_race	def_gender	vic_gender	def_social	vic_social	def_age	vic_age	def_sexpref	vic_sexpref	def_incarcerated	vic_incarcerated	beliefs	other_biases	innocence	take_stand	arrest_is_guilt	cant_decide	cant_affirm	cant_decide_evidence	cant_follow	know_def	know_vic	know_wit	know_attny	civil_plantiff	civil_def	civil_witness	witness_defense	witness_state	prior_info	death_hesitation	no_death	no_life	no_cops	yes_cops	legally_disqualified	witness_ambi	notes	id_y	trial	trial__id	race	gender	race_source	gender_source	struck_by	strike_eligibility	id	defendant_name	cause_number	state_strikes	defense_strikes	county	defendant_race	second_defendant_race	third_defendant_race	fourth_defendant_race	more_than_four_defendants	judge	prosecutor_1	prosecutor_2	prosecutor_3	prosecutors_more_than_three	def_attny_1	def_attny_2	def_attny_3	def_attnys_more_than_three	offense_code_1	offense_title_1	offense_code_2	offense_title_2	offense_code_3	offense_title_3	offense_code_4	offense_title_4	offense_code_5	offense_title_5	offense_code_6	offense_title_6	more_than_six	verdict	case_appealed	batson_claim_by_defense	batson_claim_by_state	voir_dire_present	struck_by_state
0	1521	107.00000	3.00000	0	unknown	unknown	unknown	unknown	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	NaN	107	2004-0257--Sparky Watson	3	White	Male	Jury strike sheet	Jury strike sheet	Struck by the defense	Both State and Defense	3	Sparky Watson	2004-0257	1	1	Grenada	Black	NaN	NaN	nan	0	C. Morgan, III	Susan Denley	Ryan Berry	NaN	0	M. Kevin Horan	Elizabeth Davis	NaN	0	41-29-139(a)(1)(b)(3)	sale of marihuana (less than 30 grams)	41-29-139(a)(1)(b)(1)	sale of cocaine	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	Guilty on at least one offense	1	0	0	1	0
1	1524	108.00000	3.00000	0	unknown	unknown	unknown	unknown	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	NaN	108	2004-0257--Sparky Watson	3	Black	Female	Jury strike sheet	Jury strike sheet	Struck by the state	State	3	Sparky Watson	2004-0257	1	1	Grenada	Black	NaN	NaN	nan	0	C. Morgan, III	Susan Denley	Ryan Berry	NaN	0	M. Kevin Horan	Elizabeth Davis	NaN	0	41-29-139(a)(1)(b)(3)	sale of marihuana (less than 30 grams)	41-29-139(a)(1)(b)(1)	sale of cocaine	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	Guilty on at least one offense	1	0	0	1	1

Add additional features#

While our dataset is already pretty big, we also want to calculate a few new features to match what APM Reports has in their methodology document. For simplicity's sake, we're only calculating the ones that appear in the final regression.

df['is_black'] = df.race == 'Black'
df['race_unknown'] = df.race == 'Unknown'
df['same_race'] = df.race == df.defendant_race
df.head(2)

	id_x	juror_id	juror_id__trial__id	no_responses	married	children	religious	education	leans_state	leans_defense	leans_ambi	moral_hardship	job_hardship	caretaker	communication	medical	employed	social	prior_jury	crime_victim	fam_crime_victim	accused	fam_accused	eyewitness	fam_eyewitness	military	law_enforcement	fam_law_enforcement	premature_verdict	premature_guilt	premature_innocence	def_race	vic_race	def_gender	vic_gender	def_social	vic_social	def_age	vic_age	def_sexpref	vic_sexpref	def_incarcerated	vic_incarcerated	beliefs	other_biases	innocence	take_stand	arrest_is_guilt	cant_decide	cant_affirm	cant_decide_evidence	cant_follow	know_def	know_vic	know_wit	know_attny	civil_plantiff	civil_def	civil_witness	witness_defense	witness_state	prior_info	death_hesitation	no_death	no_life	no_cops	yes_cops	legally_disqualified	witness_ambi	notes	id_y	trial	trial__id	race	gender	race_source	gender_source	struck_by	strike_eligibility	id	defendant_name	cause_number	state_strikes	defense_strikes	county	defendant_race	second_defendant_race	third_defendant_race	fourth_defendant_race	more_than_four_defendants	judge	prosecutor_1	prosecutor_2	prosecutor_3	prosecutors_more_than_three	def_attny_1	def_attny_2	def_attny_3	def_attnys_more_than_three	offense_code_1	offense_title_1	offense_code_2	offense_title_2	offense_code_3	offense_title_3	offense_code_4	offense_title_4	offense_code_5	offense_title_5	offense_code_6	offense_title_6	more_than_six	verdict	case_appealed	batson_claim_by_defense	batson_claim_by_state	voir_dire_present	struck_by_state	is_black	race_unknown	same_race	juror_id__gender_m	juror_id__gender_unknown	trial__defendant_race_asian	trial__defendant_race_black	trial__defendant_race_unknown	trial__judge_Loper	trial__judge_OTHER
0	1521	107.00000	3.00000	0	unknown	unknown	unknown	unknown	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	NaN	107	2004-0257--Sparky Watson	3	White	Male	Jury strike sheet	Jury strike sheet	Struck by the defense	Both State and Defense	3	Sparky Watson	2004-0257	1	1	Grenada	Black	NaN	NaN	nan	0	C. Morgan, III	Susan Denley	Ryan Berry	NaN	0	M. Kevin Horan	Elizabeth Davis	NaN	0	41-29-139(a)(1)(b)(3)	sale of marihuana (less than 30 grams)	41-29-139(a)(1)(b)(1)	sale of cocaine	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	Guilty on at least one offense	1	0	0	1	0	False	False	False	1	0	0	1	0	0	0
1	1524	108.00000	3.00000	0	unknown	unknown	unknown	unknown	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	NaN	108	2004-0257--Sparky Watson	3	Black	Female	Jury strike sheet	Jury strike sheet	Struck by the state	State	3	Sparky Watson	2004-0257	1	1	Grenada	Black	NaN	NaN	nan	0	C. Morgan, III	Susan Denley	Ryan Berry	NaN	0	M. Kevin Horan	Elizabeth Davis	NaN	0	41-29-139(a)(1)(b)(3)	sale of marihuana (less than 30 grams)	41-29-139(a)(1)(b)(1)	sale of cocaine	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	Guilty on at least one offense	1	0	0	1	1	True	False	True	0	0	0	1	0	0	0

Since they're all trues and falses, we'll need to take a second to convert them to ones and zeroes so that our regression will work.

df = df.replace({
    True: 1,
    False: 0
})

Performing our regression#

We're going to perform the simple regression from the end of their methodology. Not too many columns at all!

model = smf.logit(formula="""
    struck_by_state ~ 
        same_race + accused + fam_accused + fam_law_enforcement
        + know_def + death_hesitation
""", data=df)

results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.453673
         Iterations 6

Logit Regression Results
Dep. Variable:	struck_by_state	No. Observations:	2295
Model:	Logit	Df Residuals:	2288
Method:	MLE	Df Model:	6
Date:	Mon, 04 Nov 2019	Pseudo R-squ.:	0.1927
Time:	15:19:04	Log-Likelihood:	-1041.2
converged:	True	LL-Null:	-1289.7
Covariance Type:	nonrobust	LLR p-value:	3.524e-104

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.0663	0.090	-23.032	0.000	-2.242	-1.890
same_race	1.3847	0.111	12.490	0.000	1.167	1.602
accused	2.7632	0.522	5.298	0.000	1.741	3.785
fam_accused	1.7841	0.150	11.866	0.000	1.489	2.079
fam_law_enforcement	-0.6989	0.156	-4.494	0.000	-1.004	-0.394
know_def	1.3989	0.207	6.766	0.000	0.994	1.804
death_hesitation	1.8131	0.550	3.295	0.001	0.734	2.892

The irritating thing about this, though, is we had to make new columns. Making columns is a pain, in that it takes time and effort and there's always the potential to screw things up.

An alternative technique#

When you're putting together your formula, you can actually do more than just add together columns! Instead, you can actually make the comparisons that say, "is this person's race black?" or "are they the same race as the defendant?"

model = smf.logit(formula="""
    struck_by_state ~ 
        (df.race == 'Black')
        + (df.defendant_race == df.race)
        + accused
        + fam_accused + fam_law_enforcement + know_def
        + death_hesitation
""", data=df)

results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.411232
         Iterations 6

Logit Regression Results
Dep. Variable:	struck_by_state	No. Observations:	2295
Model:	Logit	Df Residuals:	2287
Method:	MLE	Df Model:	7
Date:	Mon, 04 Nov 2019	Pseudo R-squ.:	0.2682
Time:	15:30:38	Log-Likelihood:	-943.78
converged:	True	LL-Null:	-1289.7
Covariance Type:	nonrobust	LLR p-value:	3.815e-145

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.4307	0.101	-24.017	0.000	-2.629	-2.232
df.race == 'Black'[T.True]	1.8972	0.141	13.443	0.000	1.621	2.174
df.defendant_race == df.race[T.True]	0.3603	0.140	2.575	0.010	0.086	0.635
accused	2.5128	0.545	4.606	0.000	1.444	3.582
fam_accused	1.8476	0.162	11.402	0.000	1.530	2.165
fam_law_enforcement	-0.5627	0.162	-3.468	0.001	-0.881	-0.245
know_def	1.3257	0.223	5.937	0.000	0.888	1.763
death_hesitation	1.8243	0.592	3.084	0.002	0.665	2.984

So exciting!!! We didn't need to make any columns at all!

One downside of this method is that statsmodels can pick either True or False as what's shown in the coefficients list. The above [T.True] means the coefficient is for when they are black, but you could easily show up in a position where it's [T.False], meaning "this is the coefficient for when they are not black."

If you need to force statsmodels to use one or the other, you just need to explain which one you want as the reference category. You do this by shaking your comparison to look like this:

C(df.race == 'Black', Treatment(False))

model = smf.logit(formula="""
    struck_by_state ~ 
        C(df.race == 'Black', Treatment(False))
        + (df.defendant_race == df.race)
        + accused
        + fam_accused + fam_law_enforcement + know_def
        + death_hesitation
""", data=df)

results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.411232
         Iterations 6

Logit Regression Results
Dep. Variable:	struck_by_state	No. Observations:	2295
Model:	Logit	Df Residuals:	2287
Method:	MLE	Df Model:	7
Date:	Mon, 04 Nov 2019	Pseudo R-squ.:	0.2682
Time:	15:35:15	Log-Likelihood:	-943.78
converged:	True	LL-Null:	-1289.7
Covariance Type:	nonrobust	LLR p-value:	3.815e-145

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.4307	0.101	-24.017	0.000	-2.629	-2.232
C(df.race == 'Black', Treatment(False))[T.True]	1.8972	0.141	13.443	0.000	1.621	2.174
df.defendant_race == df.race[T.True]	0.3603	0.140	2.575	0.010	0.086	0.635
accused	2.5128	0.545	4.606	0.000	1.444	3.582
fam_accused	1.8476	0.162	11.402	0.000	1.530	2.165
fam_law_enforcement	-0.5627	0.162	-3.468	0.001	-0.881	-0.245
know_def	1.3257	0.223	5.937	0.000	0.888	1.763
death_hesitation	1.8243	0.592	3.084	0.002	0.665	2.984

This looks the same as before, so not very exciting. While it doesn't make much sense, we can change the reference category to be True, so our result will show us what happens when race is not black.

model = smf.logit(formula="""
    struck_by_state ~ 
        C(df.race == 'Black', Treatment(True))
        + (df.defendant_race == df.race)
        + accused
        + fam_accused + fam_law_enforcement + know_def
        + death_hesitation
""", data=df)

results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.411232
         Iterations 6

Logit Regression Results
Dep. Variable:	struck_by_state	No. Observations:	2295
Model:	Logit	Df Residuals:	2287
Method:	MLE	Df Model:	7
Date:	Mon, 04 Nov 2019	Pseudo R-squ.:	0.2682
Time:	15:35:59	Log-Likelihood:	-943.78
converged:	True	LL-Null:	-1289.7
Covariance Type:	nonrobust	LLR p-value:	3.815e-145

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-0.5335	0.137	-3.897	0.000	-0.802	-0.265
C(df.race == 'Black', Treatment(True))[T.False]	-1.8972	0.141	-13.443	0.000	-2.174	-1.621
df.defendant_race == df.race[T.True]	0.3603	0.140	2.575	0.010	0.086	0.635
accused	2.5128	0.545	4.606	0.000	1.444	3.582
fam_accused	1.8476	0.162	11.402	0.000	1.530	2.165
fam_law_enforcement	-0.5627	0.162	-3.468	0.001	-0.881	-0.245
know_def	1.3257	0.223	5.937	0.000	0.888	1.763
death_hesitation	1.8243	0.592	3.084	0.002	0.665	2.984

# Calcualte the odds ratio without making a big dataframe... 
np.exp(-1.8972)

0.14998799821305078

Which means non-black jurors have a 0.15x chance of getting rejected. Not as pleasant, is it? Just pay attention to your reference categories.

Another alternative#

Up above we're only checking to see if they're black or not. But what if there were multiple races, and we wanted to look at each one of them individually?

If you know what I'm talking about: you could do a lot of fancy one-hot encoding and blah blah blah pandas/sklearn magic.
If you don't know what I'm talking about: that sounds overly complex, doesn't it?

Watch this.

model = smf.logit(formula="""
    struck_by_state ~ 
        C(df.race, Treatment('White'))
        + (df.defendant_race == df.race)
        + accused
        + fam_accused + fam_law_enforcement + know_def
        + death_hesitation
""", data=df)

results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.411066
         Iterations 6

Logit Regression Results
Dep. Variable:	struck_by_state	No. Observations:	2295
Model:	Logit	Df Residuals:	2286
Method:	MLE	Df Model:	8
Date:	Mon, 04 Nov 2019	Pseudo R-squ.:	0.2685
Time:	15:39:37	Log-Likelihood:	-943.40
converged:	True	LL-Null:	-1289.7
Covariance Type:	nonrobust	LLR p-value:	2.698e-144

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.4406	0.102	-23.917	0.000	-2.641	-2.241
C(df.race, Treatment('White'))[T.Black]	1.9027	0.141	13.452	0.000	1.625	2.180
C(df.race, Treatment('White'))[T.Unknown]	0.7358	0.775	0.949	0.343	-0.784	2.256
df.defendant_race == df.race[T.True]	0.3642	0.140	2.599	0.009	0.090	0.639
accused	2.5173	0.546	4.611	0.000	1.447	3.587
fam_accused	1.8528	0.162	11.415	0.000	1.535	2.171
fam_law_enforcement	-0.5590	0.162	-3.441	0.001	-0.877	-0.241
know_def	1.3282	0.224	5.942	0.000	0.890	1.766
death_hesitation	1.8283	0.592	3.088	0.002	0.668	2.989

We changed the is_black variable into something slightly more complicated

C(df.race, Treatment('White'))

This tells statsmodels to look at all of the options in the race column, and calculate all of the coefficients in relation to the White value. While before we just knew if someone is black or not, now we have more options!

C(df.race, Treatment('White'))[T.Black] is when a juror is black
C(df.race, Treatment('White'))[T.Unknown] is when a juror's race is unknown

Well, not a lot - the only options are "Black," "White," and "Unknown," but you get the idea. The Treatment('White') part lets you know that this is all in comparison to jurors listed as white.

Now if we take the coefficient for black jurors - 1.9027 - and turn it into an odds ratio - 6.7 - we need to remember this is all in reference to white jurors. When we did it before the comparison was "black vs non-black," but now our comparison is "black vs. white" and "unknown race vs white."

Taking advantage of this feature saves you a lot of time when you're trying to pick apart complicated categorical columns.

For example, we could look at the judges! In the methodology from APM Reports, they have a couple different columns:

trial__judge_Loper: Judge for trial was Joseph Loper, reference category: Judge Morgan
trial__judge_OTHER: The judge was neither Loper nor Morgan

We can do the same thing, but instead of creating multiple new categories we can just use C() and Treatment()

# Find the actual names of the judges
df.judge.value_counts()

Joseph Loper, Jr    1282
C. Morgan, III       966
Other                 47
Name: judge, dtype: int64

# Run the regression
model = smf.logit(formula="""
    struck_by_state ~ 
        C(df.race, Treatment('White'))
        + C(df.judge, Treatment('C. Morgan, III'))
        + accused
        + fam_accused + fam_law_enforcement + know_def
        + death_hesitation
""", data=df)

results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.411939
         Iterations 6

Logit Regression Results
Dep. Variable:	struck_by_state	No. Observations:	2295
Model:	Logit	Df Residuals:	2285
Method:	MLE	Df Model:	9
Date:	Mon, 04 Nov 2019	Pseudo R-squ.:	0.2670
Time:	15:47:05	Log-Likelihood:	-945.40
converged:	True	LL-Null:	-1289.7
Covariance Type:	nonrobust	LLR p-value:	1.885e-142

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.4853	0.124	-19.995	0.000	-2.729	-2.242
C(df.race, Treatment('White'))[T.Black]	2.1134	0.119	17.788	0.000	1.881	2.346
C(df.race, Treatment('White'))[T.Unknown]	0.6073	0.776	0.782	0.434	-0.914	2.129
C(df.judge, Treatment('C. Morgan, III'))[T.Joseph Loper, Jr]	0.1899	0.120	1.586	0.113	-0.045	0.425
C(df.judge, Treatment('C. Morgan, III'))[T.Other]	-0.0420	0.452	-0.093	0.926	-0.927	0.843
accused	2.4955	0.543	4.599	0.000	1.432	3.559
fam_accused	1.8845	0.162	11.615	0.000	1.566	2.202
fam_law_enforcement	-0.5639	0.162	-3.488	0.000	-0.881	-0.247
know_def	1.4005	0.221	6.342	0.000	0.968	1.833
death_hesitation	1.9159	0.586	3.268	0.001	0.767	3.065

And there you go! Formulas make it all so easy.

Note: Remember that the coefficient isn't the odds ratio! We need to do an extra step to get that.

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'pvalue': results.pvalues,
    'column': results.params.index
}).sort_values(by='odds ratio', ascending=False)
coefs

	coef	odds ratio	pvalue	column
accused	2.49549	12.12764	0.00000	accused
C(df.race, Treatment('White'))[T.Black]	2.11344	8.27668	0.00000	C(df.race, Treatment('White'))[T.Black]
death_hesitation	1.91590	6.79303	0.00108	death_hesitation
fam_accused	1.88448	6.58293	0.00000	fam_accused
know_def	1.40049	4.05718	0.00000	know_def
C(df.race, Treatment('White'))[T.Unknown]	0.60727	1.83542	0.43411	C(df.race, Treatment('White'))[T.Unknown]
C(df.judge, Treatment('C. Morgan, III'))[T.Joseph Loper, Jr]	0.18986	1.20908	0.11278	C(df.judge, Treatment('C. Morgan, III'))[T.Jos...
C(df.judge, Treatment('C. Morgan, III'))[T.Other]	-0.04205	0.95882	0.92583	C(df.judge, Treatment('C. Morgan, III'))[T.Other]
fam_law_enforcement	-0.56392	0.56897	0.00049	fam_law_enforcement
Intercept	-2.48534	0.08330	0.00000	Intercept

Review#

We looked at the way statsmodels formulas work, allowing you to make comparisons and automatically split categories into separate features. Categories get assigned a reference, which is what your odds ratio will be compared with.

For example:

formula	meaning
`C(df.race, Treatment('White'))`	Comparing black vs. white
`df.race == 'Black'`	Comparing black vs. non-black
`is_black`	Same as above, just more typing to make the column!

Discussion topics#

What are the pluses and minuses of using C() compared to building new columns?
How do you pick the reference category?
The p-value for unknown race is uselessly high compared to the p-value for black jurors. What do you think you should do about it, if anything?
Are you heartbroken that you learned some tricks from the p-value filtering notebook, but if you end up using these techniques those tricks totally won't work? Because I am.

Logistic regression of jury rejections using statsmodels' formula method#

Import a lot#

Read in the data#

Add additional features#

Performing our regression#

An alternative technique#

Another alternative#

Review#

Discussion topics#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects