# Logistic regression of jury rejections using statsmodels' formula method#

In this notebook we'll be looking for evidence of racial bias in the jury selection process. To this end we'll be working with the statsmodels package, and specifically its R-formula-like `smf.logit`

method.

## Import a lot#

```
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', '{:.5f}'.format)
%matplotlib inline
```

## Read in the data#

We'll start by reading in the pre-cleaned dataset. We've already joined the potential jurors, the trial information, and the judge information. We've also added the `struck_by_state`

column and converted true and false values into ones and zeroes.

```
df = pd.read_csv("data/jury-cleaned.csv")
df.head(2)
```

### Add additional features#

While our dataset is already pretty big, we also want to calculate a few new features to match what APM Reports has in their methodology document. For simplicity's sake, we're only calculating the ones that appear in the **final regression.**

```
df['is_black'] = df.race == 'Black'
df['race_unknown'] = df.race == 'Unknown'
df['same_race'] = df.race == df.defendant_race
df.head(2)
```

Since they're all trues and falses, we'll need to take a second to convert them to ones and zeroes so that our regression will work.

```
df = df.replace({
True: 1,
False: 0
})
```

## Performing our regression#

We're going to perform the simple regression from the end of their methodology. Not too many columns at all!

```
model = smf.logit(formula="""
struck_by_state ~
same_race + accused + fam_accused + fam_law_enforcement
+ know_def + death_hesitation
""", data=df)
results = model.fit()
results.summary()
```

The irritating thing about this, though, is **we had to make new columns.** Making columns is a pain, in that it takes time and effort and there's always the potential to screw things up.

## An alternative technique#

When you're putting together your formula, you can actually do more than just add together columns! Instead, you can actually make the comparisons that say, "is this person's race black?" or "are they the same race as the defendant?"

```
model = smf.logit(formula="""
struck_by_state ~
(df.race == 'Black')
+ (df.defendant_race == df.race)
+ accused
+ fam_accused + fam_law_enforcement + know_def
+ death_hesitation
""", data=df)
results = model.fit()
results.summary()
```

So exciting!!! We didn't need to make any columns at all!

One downside of this method is that statsmodels can pick either `True`

or `False`

as what's shown in the coefficients list. The above `[T.True]`

means the coefficient is for when they *are* black, but you could easily show up in a position where it's `[T.False]`

, meaning "this is the coefficient for when they are *not* black."

If you need to force statsmodels to use one or the other, you just need to explain which one you want as the **reference category**. You do this by shaking your comparison to look like this:

`C(df.race == 'Black', Treatment(False))`

```
model = smf.logit(formula="""
struck_by_state ~
C(df.race == 'Black', Treatment(False))
+ (df.defendant_race == df.race)
+ accused
+ fam_accused + fam_law_enforcement + know_def
+ death_hesitation
""", data=df)
results = model.fit()
results.summary()
```

This looks the same as before, so not very exciting. While it doesn't make much sense, we can change the reference category to be `True`

, so our result will show us what happens when race is *not* black.

```
model = smf.logit(formula="""
struck_by_state ~
C(df.race == 'Black', Treatment(True))
+ (df.defendant_race == df.race)
+ accused
+ fam_accused + fam_law_enforcement + know_def
+ death_hesitation
""", data=df)
results = model.fit()
results.summary()
```

```
# Calcualte the odds ratio without making a big dataframe...
np.exp(-1.8972)
```

Which means non-black jurors have a 0.15x chance of getting rejected. Not as pleasant, is it? **Just pay attention to your reference categories.**

## Another alternative#

Up above we're only checking to see if they're black or not. But what if there were multiple races, and we wanted to look at each one of them individually?

**If you know what I'm talking about:**you*could*do a lot of fancy one-hot encoding and blah blah blah pandas/sklearn magic.**If you don't know what I'm talking about:**that sounds overly complex, doesn't it?

Watch this.

```
model = smf.logit(formula="""
struck_by_state ~
C(df.race, Treatment('White'))
+ (df.defendant_race == df.race)
+ accused
+ fam_accused + fam_law_enforcement + know_def
+ death_hesitation
""", data=df)
results = model.fit()
results.summary()
```

We changed the `is_black`

variable into something slightly more complicated

`C(df.race, Treatment('White'))`

This tells statsmodels to look at all of the options in the `race`

column, and calculate all of the coefficients in relation to the `White`

value. While before we just knew if someone is black or not, now we have more options!

`C(df.race, Treatment('White'))[T.Black]`

is when a juror is black`C(df.race, Treatment('White'))[T.Unknown]`

is when a juror's race is unknown

Well, not a lot - the only options are "Black," "White," and "Unknown," but you get the idea. The `Treatment('White')`

part lets you know that this is all in comparison to jurors listed as white.

Now if we take the coefficient for black jurors - 1.9027 - and turn it into an odds ratio - 6.7 - we need to remember **this is all in reference to white jurors.** When we did it before the comparison was "black vs non-black," but now our comparison is "black vs. white" and "unknown race vs white."

Taking advantage of this feature saves you a lot of time when you're trying to pick apart complicated categorical columns.

For example, **we could look at the judges!** In the methodology from APM Reports, they have a couple different columns:

`trial__judge_Loper`

: Judge for trial was Joseph Loper, reference category: Judge Morgan`trial__judge_OTHER`

: The judge was neither Loper nor Morgan

We can do the same thing, but instead of creating multiple new categories we can just use `C()`

and `Treatment()`

```
# Find the actual names of the judges
df.judge.value_counts()
```

```
# Run the regression
model = smf.logit(formula="""
struck_by_state ~
C(df.race, Treatment('White'))
+ C(df.judge, Treatment('C. Morgan, III'))
+ accused
+ fam_accused + fam_law_enforcement + know_def
+ death_hesitation
""", data=df)
results = model.fit()
results.summary()
```

And there you go! Formulas make it all so easy.

**Note:** Remember that the coefficient isn't the odds ratio! We need to do an extra step to get that.

```
coefs = pd.DataFrame({
'coef': results.params.values,
'odds ratio': np.exp(results.params.values),
'pvalue': results.pvalues,
'column': results.params.index
}).sort_values(by='odds ratio', ascending=False)
coefs
```

## Review#

We looked at the way statsmodels **formulas** work, allowing you to **make comparisons** and **automatically split categories into separate features**. Categories get assigned a **reference**, which is what your odds ratio will be compared with.

For example:

formula | meaning |
---|---|

`C(df.race, Treatment('White'))` |
Comparing black vs. white |

`df.race == 'Black'` |
Comparing black vs. non-black |

`is_black` |
Same as above, just more typing to make the column! |

## Discussion topics#

- What are the pluses and minuses of using
`C()`

compared to building new columns? - How do you pick the reference category?
- The p-value for unknown race is uselessly high compared to the p-value for black jurors. What do you think you should do about it, if anything?
- Are you heartbroken that you learned some tricks from the p-value filtering notebook, but if you end up using these techniques those tricks totally won't work? Because I am.

```
```