# Using regression to find bias in the jury strike process#

When someone is being selected for a jury, what factors play a strong role? We'll track down the answer using logistic regression.

## Import a lot#

```
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', '{:.5f}'.format)
%matplotlib inline
```

## Read in the data#

We'll start by reading in the pre-cleaned dataset. We've already joined the potential jurors, the trial information, and the judge information. We've also added the `struck_by_state`

column and converted true and false values into ones and zeroes.

```
df = pd.read_csv("data/jury-cleaned.csv")
df.head(2)
```

## Add additional features#

While our dataset is already pretty big, we also want to calculate a few new features to match what APM Reports has in their methodology document.

```
df['is_black'] = df.race == 'Black'
df['race_unknown'] = df.race == 'Unknown'
df['same_race'] = df.race == df.defendant_race
df['juror_id__gender_m'] = df.gender == 'Male'
df['juror_id__gender_unknown'] = df.gender == 'Unknown'
df['trial__defendant_race_asian'] = df.defendant_race == 'Asian'
df['trial__defendant_race_black'] = df.defendant_race == 'Black'
df['trial__defendant_race_unknown'] = df.defendant_race == 'Unknown'
df['trial__judge_Loper'] = df.judge == 'Joseph Loper, Jr'
df['trial__judge_OTHER'] = df.judge == 'Other'
df.head(2)
```

Since they're all trues and falses, we'll need to take a second to convert them to ones and zeroes so that our regression will work.

```
df = df.replace({
True: 1,
False: 0
})
```

## What columns are we interested in?#

Using whether the juror was struck by the state or not as the dependent variable and the jurorâ€™s responses during voir dire as the input data, APM Reports built a logistic regression model to test the importance of the different variables on the likelihood of being struck. Our logistic regression model used

all the variables we trackedthat had more than 5 event and non-event occurrences.

We'll start with making a list of all of the variables that were tracked.

```
potential_columns = [
# First, the ones we made
'is_black', 'race_unknown', 'same_race', 'juror_id__gender_m', 'juror_id__gender_unknown',
'trial__defendant_race_asian', 'trial__defendant_race_black', 'trial__defendant_race_unknown',
'trial__judge_Loper', 'trial__judge_OTHER',
# Then, the ones from the dataset
# We'll remove 'race' because we have is_black and race_unknown already
'no_responses', 'leans_defense', 'leans_ambi', 'moral_hardship', 'job_hardship',
'caretaker', 'communication', 'medical', 'employed', 'social', 'prior_jury',
'crime_victim', 'fam_crime_victim', 'accused', 'fam_accused',
'eyewitness', 'fam_eyewitness', 'military', 'law_enforcement', 'fam_law_enforcement',
'premature_verdict', 'premature_guilt', 'premature_innocence', 'def_race', 'vic_race',
'def_gender', 'vic_gender', 'def_social', 'vic_social', 'def_age', 'vic_age',
'def_sexpref', 'vic_sexpref', 'def_incarcerated', 'vic_incarcerated', 'beliefs',
'other_biases', 'innocence', 'take_stand', 'arrest_is_guilt',
'cant_decide', 'cant_affirm', 'cant_decide_evidence', 'cant_follow', 'know_def',
'know_vic', 'know_wit', 'know_attny', 'civil_plantiff', 'civil_def', 'civil_witness',
'witness_defense', 'witness_state', 'prior_info', 'death_hesitation', 'no_death',
'no_life', 'no_cops', 'yes_cops', 'legally_disqualified', 'witness_ambi',
]
```

## Remove anything without 5 events and non-events#

From the methodology:

Our logistic regression model used all the variables we tracked that had

more than 5 event and non-event occurrences

What's this mean? Think about it like this: if everyone said they were in the military, `military`

wouldn't be a very useful column. Or if all potential jurors that said they were in the military never got accepted? Also useless.

What we're looking for is a good mix, where sometimes they were accepted and sometimes they were rejected, and where sometimes they answered yes and sometimes they answered no.

We'll start by seeing how we can count how many fall in each category, and when we'd accept or reject them.

For example, whether someone is black or not is a large mix of outcomes.

```
counted = df.groupby(['struck_by_state', 'is_black']).size().unstack(fill_value=0)
counted
```

On the other hand, only 5 people ever said they were in the military, and they were all accepted. Not very useful!

```
counted = df.groupby(['struck_by_state', 'military']).size().unstack(fill_value=0)
counted
```

No one said they can't follow instructions, so we won't want to use this feature.

```
counted = df.groupby(['struck_by_state', 'cant_follow']).size().unstack(fill_value=0)
counted
```

We'll need **two techniques** to filter there. First, we can use this to see if any of the cells are less than five.

```
(counted < 5).any(axis=None)
```

But remember how we sometimes only have one column? To remove those, we need to check and see if we have a full 2x2 square.

```
counted.count().sum()
```

### Filtering columns without 5 events and non-events#

Now that we have our techniques, let's filter!

```
useable_cols = []
for col in feature_columns:
counted = df.groupby(['struck_by_state', col]).size().unstack(fill_value=0)
if counted.count().sum() < 4 or (counted < 5).any(axis=None):
# print("Skipping", col)
pass
else:
useable_cols.append(col)
```

```
useable_cols
```

## Perform the regression#

We'll start by importing the statsmodels package for doing formula-based regression

```
import statsmodels.formula.api as smf
```

APM Reports first ran every variable through a logistic regression model.We then removed all variables with a p-value > 0.1. Finally, we selected all factors with a p-value < 0.05 and ran the model a third time.

We're going to use all of our `useable_cols`

to perform this regression. There's another notebook where we filter based on p-values, I recommend taking a look at it! The method we use here is readable, but kind of a pain.

```
# I want to cut and paste for my formula
print(" + ".join(useable_cols))
```

```
model = smf.logit(formula="""
struck_by_state ~
is_black + same_race + juror_id__gender_m + juror_id__gender_unknown
+ trial__defendant_race_asian + trial__defendant_race_black
+ trial__defendant_race_unknown + trial__judge_Loper + trial__judge_OTHER
+ no_responses + leans_ambi + prior_jury + crime_victim + fam_crime_victim
+ accused + fam_accused + law_enforcement + fam_law_enforcement + know_def
+ know_vic + know_wit + know_attny + prior_info + death_hesitation
""", data=df)
results = model.fit()
results.summary()
```

APM Reports first ran every variable through a logistic regression model.

We then removed all variables with a p-value > 0.1.Finally, we selected all factors with a p-value < 0.05 and ran the model a third time.

Going through the p-value list above, we'll remove any features that are at or above the `0.1`

p-value threshold (that's the `P>|z|`

column). If you'd like more details on the how or why of this, check out the notebook on feature selection by p-value.

```
model = smf.logit(formula="""
struck_by_state ~
is_black + same_race + no_responses + fam_crime_victim + accused
+ fam_accused + law_enforcement + fam_law_enforcement + know_def
+ know_wit + death_hesitation
""", data=df)
results = model.fit()
results.summary()
```

According to the methodology we need to filter one more time: this time for features with a p-value under `0.5`

.

APM Reports first ran every variable through a logistic regression model. We then removed all variables with a p-value > 0.1.

Finally, we selected all factors with a p-value < 0.05 and ran the model a third time.

```
model = smf.logit(formula="""
struck_by_state ~
is_black + same_race + accused
+ fam_accused + fam_law_enforcement + know_def
+ death_hesitation
""", data=df)
results = model.fit()
results.summary()
```

There we go! Now that we have a nice, noise-less set of results, we're free to plug this into a dataframe that can tell us odds ratios.

```
coefs = pd.DataFrame({
'coef': results.params.values,
'odds ratio': np.exp(results.params.values),
'pvalue': results.pvalues,
'column': results.params.index
}).sort_values(by='odds ratio', ascending=False)
coefs
```

And there we have it! When taking these seven statistically-significant features into account, black jurors were over 6.5x more likely to be struck from a jury.

## Variations on our results#

### race vs same_race#

We used the same_race variable to code jurors that were the same race as any of the defendants. In building the logistic regression model, we included and excluded certain variables to see how that impacted the model. When we left out the race of the juror from the model, same_race had a much higher odds ratio (odds ratio = 4.5). But the model with the race of the juror added back in lowers the same_race odds ratio to 1.4.

```
model = smf.logit(formula="""
struck_by_state ~
same_race + accused
+ fam_accused + fam_law_enforcement + know_def
+ death_hesitation
""", data=df)
results = model.fit()
results.summary()
```