Feature selection with p-values#

When you're performing a regression, not all fields have a connection with the output. We'll use p-values to cull the less important columns to refine our model.

Imports#

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd

pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 100)
pd.set_option('display.float_format', '{:.5f}'.format)

Read in the data#

We'll start by reading in the pre-cleaned dataset. We've already joined the potential jurors, the trial information, and the judge information. We've also added the struck_by_state column and converted true and false values into ones and zeroes.

df = pd.read_csv("data/jury-cleaned.csv")
df.head(2)
id_x juror_id juror_id__trial__id no_responses married children religious education leans_state leans_defense leans_ambi moral_hardship job_hardship caretaker communication medical employed social prior_jury crime_victim fam_crime_victim accused fam_accused eyewitness fam_eyewitness ... prosecutor_3 prosecutors_more_than_three def_attny_1 def_attny_2 def_attny_3 def_attnys_more_than_three offense_code_1 offense_title_1 offense_code_2 offense_title_2 offense_code_3 offense_title_3 offense_code_4 offense_title_4 offense_code_5 offense_title_5 offense_code_6 offense_title_6 more_than_six verdict case_appealed batson_claim_by_defense batson_claim_by_state voir_dire_present struck_by_state
0 1521 107.00000 3.00000 0 unknown unknown unknown unknown 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... NaN 0 M. Kevin Horan Elizabeth Davis NaN 0 41-29-139(a)(1)(b)(3) sale of marihuana (less than 30 grams) 41-29-139(a)(1)(b)(1) sale of cocaine NaN NaN NaN NaN NaN NaN NaN NaN 0 Guilty on at least one offense 1 0 0 1 0
1 1524 108.00000 3.00000 0 unknown unknown unknown unknown 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 ... NaN 0 M. Kevin Horan Elizabeth Davis NaN 0 41-29-139(a)(1)(b)(3) sale of marihuana (less than 30 grams) 41-29-139(a)(1)(b)(1) sale of cocaine NaN NaN NaN NaN NaN NaN NaN NaN 0 Guilty on at least one offense 1 0 0 1 1

2 rows × 118 columns

That is a lot of columns! But don't worry, we can add some more, too.

Adding additional features#

Since we're going to be selecting from a lot of columns, let's get crazy and add a few more.

df['is_black'] = df.race == 'Black'
df['same_race'] = df.race == df.defendant_race
df['juror_id__gender_m'] = df.gender == 'Male'
df['juror_id__gender_unknown'] = df.gender == 'Unknown'
df['trial__defendant_race_asian'] = df.defendant_race == 'Asian'
df['trial__defendant_race_black'] = df.defendant_race == 'Black'
df['trial__defendant_race_unknown'] = df.defendant_race == 'Unknown'
df['trial__judge_Loper'] = df.judge == 'Joseph Loper, Jr'
df['trial__judge_OTHER'] = df.judge == 'Other'
df.head(2)
id_x juror_id juror_id__trial__id no_responses married children religious education leans_state leans_defense leans_ambi moral_hardship job_hardship caretaker communication medical employed social prior_jury crime_victim fam_crime_victim accused fam_accused eyewitness fam_eyewitness ... offense_title_2 offense_code_3 offense_title_3 offense_code_4 offense_title_4 offense_code_5 offense_title_5 offense_code_6 offense_title_6 more_than_six verdict case_appealed batson_claim_by_defense batson_claim_by_state voir_dire_present struck_by_state is_black same_race juror_id__gender_m juror_id__gender_unknown trial__defendant_race_asian trial__defendant_race_black trial__defendant_race_unknown trial__judge_Loper trial__judge_OTHER
0 1521 107.00000 3.00000 0 unknown unknown unknown unknown 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... sale of cocaine NaN NaN NaN NaN NaN NaN NaN NaN 0 Guilty on at least one offense 1 0 0 1 0 False False True False False True False False False
1 1524 108.00000 3.00000 0 unknown unknown unknown unknown 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 ... sale of cocaine NaN NaN NaN NaN NaN NaN NaN NaN 0 Guilty on at least one offense 1 0 0 1 1 True True False False False True False False False

2 rows × 127 columns

Since these new columns are true and false, not ones and zeroes, we'll also need to convert those.

df = df.replace({
    False: 0,
    True: 1
})

Run a regression#

We'll start by running a simple regression: What's the relationship between a potential juror being black, and their chance of being struck by the state?

model = smf.logit(formula="struck_by_state ~ is_black", data=df)

results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.463138
         Iterations 6
Logit Regression Results
Dep. Variable: struck_by_state No. Observations: 2295
Model: Logit Df Residuals: 2293
Method: MLE Df Model: 1
Date: Tue, 14 Jan 2020 Pseudo R-squ.: 0.1759
Time: 11:40:33 Log-Likelihood: -1062.9
converged: True LL-Null: -1289.7
Covariance Type: nonrobust LLR p-value: 1.148e-100
coef std err z P>|z| [0.025 0.975]
Intercept -2.0515 0.080 -25.692 0.000 -2.208 -1.895
is_black 2.1894 0.109 20.155 0.000 1.976 2.402

Although this result might tell us how much more likely a black juror is to be rejected than a non-black juror, it's woefully incomplete. We know more about the situation than just their race, and it would be irresponsible to just ignore it.

We might start by adding one more feature - let's see whether the juror is black, and whether the student is male. Looking at two features instead of one allows us to dig a little bit deeper into the dataset.

model = smf.logit(formula="struck_by_state ~ is_black + juror_id__gender_m", data=df)

results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.463129
         Iterations 6
Logit Regression Results
Dep. Variable: struck_by_state No. Observations: 2295
Model: Logit Df Residuals: 2292
Method: MLE Df Model: 2
Date: Tue, 14 Jan 2020 Pseudo R-squ.: 0.1759
Time: 11:40:34 Log-Likelihood: -1062.9
converged: True LL-Null: -1289.7
Covariance Type: nonrobust LLR p-value: 3.012e-99
coef std err z P>|z| [0.025 0.975]
Intercept -2.0611 0.093 -22.074 0.000 -2.244 -1.878
is_black 2.1911 0.109 20.101 0.000 1.977 2.405
juror_id__gender_m 0.0221 0.111 0.199 0.842 -0.196 0.240

That's a slightly more complete model, but we have far, far, far more variables than just those two! How do we pick which features to use? Do we just throw all of them in?

The process of picking which features you're going to include in your model is called feature selection. One simple technique for picking features is to throw everything you know at the problem, then remove anything that (statistically) doesn't seem to make sense.

It might sound like an exceptionally boring, odd, or imprecise method, but let's take a look!

Add many, many more features#

We're going to start off by tacking a ton of features onto our regression! We're following the original story's methodology, starting about halfway through - the data is cleaned, we've kicked out some of the features that are always accepted/rejected, and now we're beginning our regressions.

From the report:

APM Reports first ran every variable through a logistic regression model.

Let's do that now - just like we did above, but with a million more variables. It's a little bit longer than the first couple formulas, but they do the exact same thing.

model = smf.logit(formula="""
    struck_by_state ~ 
        is_black + same_race + juror_id__gender_m + juror_id__gender_unknown
        + trial__defendant_race_asian + trial__defendant_race_black
        + trial__defendant_race_unknown + trial__judge_Loper + trial__judge_OTHER
        + no_responses + leans_ambi + prior_jury + crime_victim + fam_crime_victim
        + accused + fam_accused + law_enforcement + fam_law_enforcement
        + know_def + know_vic + know_wit + know_attny + prior_info + death_hesitation
""", data=df)

results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.405530
         Iterations 7
Logit Regression Results
Dep. Variable: struck_by_state No. Observations: 2295
Model: Logit Df Residuals: 2270
Method: MLE Df Model: 24
Date: Tue, 14 Jan 2020 Pseudo R-squ.: 0.2784
Time: 11:40:34 Log-Likelihood: -930.69
converged: True LL-Null: -1289.7
Covariance Type: nonrobust LLR p-value: 3.878e-136
coef std err z P>|z| [0.025 0.975]
Intercept -2.3416 0.223 -10.489 0.000 -2.779 -1.904
is_black 1.9325 0.143 13.506 0.000 1.652 2.213
same_race 0.4585 0.142 3.228 0.001 0.180 0.737
juror_id__gender_m 0.0488 0.123 0.397 0.691 -0.192 0.290
juror_id__gender_unknown -0.0303 0.376 -0.081 0.936 -0.768 0.707
trial__defendant_race_asian 0.7465 0.546 1.368 0.171 -0.323 1.816
trial__defendant_race_black -0.1635 0.151 -1.079 0.280 -0.460 0.133
trial__defendant_race_unknown 0.5651 0.410 1.378 0.168 -0.239 1.369
trial__judge_Loper 0.1796 0.134 1.337 0.181 -0.084 0.443
trial__judge_OTHER 0.0056 0.466 0.012 0.990 -0.907 0.918
no_responses -0.2995 0.164 -1.822 0.068 -0.622 0.023
leans_ambi 0.3274 0.666 0.492 0.623 -0.977 1.632
prior_jury -0.2290 0.210 -1.089 0.276 -0.641 0.183
crime_victim -0.0287 0.315 -0.091 0.928 -0.647 0.589
fam_crime_victim 0.5037 0.281 1.792 0.073 -0.047 1.055
accused 2.4623 0.548 4.492 0.000 1.388 3.537
fam_accused 1.7964 0.175 10.275 0.000 1.454 2.139
law_enforcement -0.9703 0.503 -1.929 0.054 -1.957 0.016
fam_law_enforcement -0.6832 0.173 -3.957 0.000 -1.022 -0.345
know_def 1.3204 0.239 5.536 0.000 0.853 1.788
know_vic 0.2446 0.239 1.022 0.307 -0.224 0.714
know_wit -0.3940 0.236 -1.666 0.096 -0.857 0.069
know_attny 0.3438 0.237 1.451 0.147 -0.120 0.808
prior_info -0.2074 0.200 -1.039 0.299 -0.599 0.184
death_hesitation 1.8562 0.598 3.103 0.002 0.684 3.029

No, wait! We're going to do this a different way#

I know that writing things out in formulas is nice a pretty and very very readable! But we're going to stop doing that, right now.

Unfortunately, the writing-out-formulas method involves, well, writing out formulas. Manually! And if we're going to be automating this process, we don't want to write out a bunch of ... + ... + ... + ... statements manually as we filter out different columns.

Yes, you could technically automate the ... + ... + ... + ... process but the code is kinda ugly.

Instead of writing out formulas, we're going to make dataframes with exactly the columns we're using, and feed that to our regression. To start, let's make a list of the columns we'll use in our regression.

feature_cols = [
    'is_black',
    'same_race',
    'juror_id__gender_m',
    'juror_id__gender_unknown',
    'trial__defendant_race_asian',
    'trial__defendant_race_black',
    'trial__defendant_race_unknown',
    'trial__judge_Loper',
    'trial__judge_OTHER',
    'no_responses',
    'leans_ambi',
    'prior_jury',
    'crime_victim',
    'fam_crime_victim',
    'accused',
    'fam_accused',
    'law_enforcement',
    'fam_law_enforcement',
    'know_def',
    'know_vic',
    'know_wit',
    'know_attny',
    'prior_info',
    'death_hesitation']

This list of columns will allow us to filter our original dataframe to keep only the columns we're interested in. Then we'll feed this to sm.Logit, which does logistic regressions with dataframes instead of written formulas. It's not what we did before, but it's the same basic idea and the exact same results.

If you're curious: We need to do sm.add_constant(X) below because by default smf.Logit doesn't add an intercept. If you use a written formula with sm.logit, though, it does add an intercept. If you aren't quite sure what that means, it's okay, just know we need the constant to make the two work methods the same. You might find this example useful if you'd like more information on the topic.

X = df[feature_cols]
y = df.struck_by_state

model = sm.Logit(y, sm.add_constant(X))
results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.405530
         Iterations 7
Logit Regression Results
Dep. Variable: struck_by_state No. Observations: 2295
Model: Logit Df Residuals: 2270
Method: MLE Df Model: 24
Date: Tue, 14 Jan 2020 Pseudo R-squ.: 0.2784
Time: 11:40:34 Log-Likelihood: -930.69
converged: True LL-Null: -1289.7
Covariance Type: nonrobust LLR p-value: 3.878e-136
coef std err z P>|z| [0.025 0.975]
const -2.3416 0.223 -10.489 0.000 -2.779 -1.904
is_black 1.9325 0.143 13.506 0.000 1.652 2.213
same_race 0.4585 0.142 3.228 0.001 0.180 0.737
juror_id__gender_m 0.0488 0.123 0.397 0.691 -0.192 0.290
juror_id__gender_unknown -0.0303 0.376 -0.081 0.936 -0.768 0.707
trial__defendant_race_asian 0.7465 0.546 1.368 0.171 -0.323 1.816
trial__defendant_race_black -0.1635 0.151 -1.079 0.280 -0.460 0.133
trial__defendant_race_unknown 0.5651 0.410 1.378 0.168 -0.239 1.369
trial__judge_Loper 0.1796 0.134 1.337 0.181 -0.084 0.443
trial__judge_OTHER 0.0056 0.466 0.012 0.990 -0.907 0.918
no_responses -0.2995 0.164 -1.822 0.068 -0.622 0.023
leans_ambi 0.3274 0.666 0.492 0.623 -0.977 1.632
prior_jury -0.2290 0.210 -1.089 0.276 -0.641 0.183
crime_victim -0.0287 0.315 -0.091 0.928 -0.647 0.589
fam_crime_victim 0.5037 0.281 1.792 0.073 -0.047 1.055
accused 2.4623 0.548 4.492 0.000 1.388 3.537
fam_accused 1.7964 0.175 10.275 0.000 1.454 2.139
law_enforcement -0.9703 0.503 -1.929 0.054 -1.957 0.016
fam_law_enforcement -0.6832 0.173 -3.957 0.000 -1.022 -0.345
know_def 1.3204 0.239 5.536 0.000 0.853 1.788
know_vic 0.2446 0.239 1.022 0.307 -0.224 0.714
know_wit -0.3940 0.236 -1.666 0.096 -0.857 0.069
know_attny 0.3438 0.237 1.451 0.147 -0.120 0.808
prior_info -0.2074 0.200 -1.039 0.299 -0.599 0.184
death_hesitation 1.8562 0.598 3.103 0.002 0.684 3.029

That's a lot lot lot of columns, and not all of them are useful!

P-values - the P>|z| column - are commonly referred to as "what's the chance that this result was just an accident?" For example, juror_id__gender_m has a p-value of 0.691, which (kind of, somewhat) means there's a 69.1% chance that this was just random fluctuations in the data, and nothing meaningful.

That isn't actually what p-values mean, but just work with me here! It's definitely close enough conceptually for what we're doing. You could also read something like this if you want more details about what p-values really mean.

Generally speaking, most people feel that keeping that chance at below 5% makes a result meaningful. From a technical perspective, that would mean only the variables with a p-value under 0.05 are meaningful.

If you look at these p-values, they're all over the place - some are basically zero, while one is as high as 98%!

Feature selection based on p values#

When you're trying to decide what features should go into your model and what shouldn't, p-values seem to be are one thing to pay attention to. If a feature is likely just related to our outcome by chance, it doesn't seem very effective, does it? We should probably remove those columns to simplify things and remove noise.

From APM Reports' whitepaper:

APM Reports first ran every variable through a logistic regression model. We then removed all variables with a p-value > 0.1.

We can do the same thing! It can be done pretty easily with some fun coding tricks, so let's give it a shot.

Filtering regression variables by p values (the explain-y way)#

In other notebooks, we've taken results of a regression and put them into another dataframe for easy viewing. For example, we can do that now and sort it by p-values.

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'pvalue': results.pvalues,
    'name': results.params.index
}).sort_values(by='pvalue', ascending=False)
coefs
coef odds ratio pvalue name
trial__judge_OTHER 0.00565 1.00566 0.99032 trial__judge_OTHER
juror_id__gender_unknown -0.03033 0.97012 0.93575 juror_id__gender_unknown
crime_victim -0.02865 0.97175 0.92758 crime_victim
juror_id__gender_m 0.04885 1.05006 0.69120 juror_id__gender_m
leans_ambi 0.32740 1.38736 0.62276 leans_ambi
know_vic 0.24461 1.27712 0.30675 know_vic
prior_info -0.20741 0.81269 0.29879 prior_info
trial__defendant_race_black -0.16346 0.84920 0.28040 trial__defendant_race_black
prior_jury -0.22904 0.79530 0.27603 prior_jury
trial__judge_Loper 0.17956 1.19669 0.18116 trial__judge_Loper
trial__defendant_race_asian 0.74649 2.10958 0.17131 trial__defendant_race_asian
trial__defendant_race_unknown 0.56507 1.75957 0.16818 trial__defendant_race_unknown
know_attny 0.34379 1.41029 0.14665 know_attny
know_wit -0.39398 0.67437 0.09566 know_wit
fam_crime_victim 0.50367 1.65478 0.07320 fam_crime_victim
no_responses -0.29947 0.74121 0.06846 no_responses
law_enforcement -0.97034 0.37895 0.05379 law_enforcement
death_hesitation 1.85622 6.39948 0.00191 death_hesitation
same_race 0.45847 1.58165 0.00125 same_race
fam_law_enforcement -0.68325 0.50497 0.00008 fam_law_enforcement
accused 2.46227 11.73140 0.00001 accused
know_def 1.32043 3.74504 0.00000 know_def
fam_accused 1.79644 6.02814 0.00000 fam_accused
const -2.34159 0.09618 0.00000 const
is_black 1.93254 6.90702 0.00000 is_black

Even though we do have odds ratios for each of these columns, the ones with high p-values are just noise. It's no use talking about a feature if its relationship to the output is likely just accidental.

For example: trial__judge_OTHER has a 99% chance of being nothing but chance! That's definitely gotta go. This is what we'll be doing by filtering by p-values, getting rid of clutter that doesn't really mean anything. They also confuse our model and get in the way of our other, useful features, too ("Pay attention to the things that matter!").

If we wanted to be very explicit about what's happening when we're picking our useful features and kicking out the chance-y ones, we could use this dataframe to find everything with a p value of less than 0.1.

First we'll filter for features that meet the p-value threshold. Then we'll drop const because it isn't actually one of our columns, it's just some magic thrown in by the regression.

coefs[coefs.pvalue < 0.1].drop('const')
coef odds ratio pvalue name
know_wit -0.39398 0.67437 0.09566 know_wit
fam_crime_victim 0.50367 1.65478 0.07320 fam_crime_victim
no_responses -0.29947 0.74121 0.06846 no_responses
law_enforcement -0.97034 0.37895 0.05379 law_enforcement
death_hesitation 1.85622 6.39948 0.00191 death_hesitation
same_race 0.45847 1.58165 0.00125 same_race
fam_law_enforcement -0.68325 0.50497 0.00008 fam_law_enforcement
accused 2.46227 11.73140 0.00001 accused
know_def 1.32043 3.74504 0.00000 know_def
fam_accused 1.79644 6.02814 0.00000 fam_accused
is_black 1.93254 6.90702 0.00000 is_black

Looks pretty good! Now we can grab the name of the column and we'll be all set.

feature_cols = coefs[coefs.pvalue < 0.1].drop('const').index
feature_cols
Index(['know_wit', 'fam_crime_victim', 'no_responses', 'law_enforcement', 'death_hesitation',
       'same_race', 'fam_law_enforcement', 'accused', 'know_def', 'fam_accused', 'is_black'],
      dtype='object')

We're using .index because it displays nicer than looking at .name.

Filtering regression variables by p values (the short way)#

While this method works, and might be easier to read, I prefer a more compact technique. We can't always use it - sometimes you have calculated columns or weird categories or whatever - but it's great when it works.

Instead of dragging ourselves through a dataframe, we can just look at the p values for the results. We'll also drop const again, since it isn't a real column.

results.pvalues.drop('const')
is_black                        0.00000
same_race                       0.00125
juror_id__gender_m              0.69120
juror_id__gender_unknown        0.93575
trial__defendant_race_asian     0.17131
trial__defendant_race_black     0.28040
trial__defendant_race_unknown   0.16818
trial__judge_Loper              0.18116
trial__judge_OTHER              0.99032
no_responses                    0.06846
leans_ambi                      0.62276
prior_jury                      0.27603
crime_victim                    0.92758
fam_crime_victim                0.07320
accused                         0.00001
fam_accused                     0.00000
law_enforcement                 0.05379
fam_law_enforcement             0.00008
know_def                        0.00000
know_vic                        0.30675
know_wit                        0.09566
know_attny                      0.14665
prior_info                      0.29879
death_hesitation                0.00191
dtype: float64

Then we can compare each one of those p-values to 0.1, just like we did before.

results.pvalues.drop('const') < 0.1
is_black                          True
same_race                         True
juror_id__gender_m               False
juror_id__gender_unknown         False
trial__defendant_race_asian      False
trial__defendant_race_black      False
trial__defendant_race_unknown    False
trial__judge_Loper               False
trial__judge_OTHER               False
no_responses                      True
leans_ambi                       False
prior_jury                       False
crime_victim                     False
fam_crime_victim                  True
accused                           True
fam_accused                       True
law_enforcement                   True
fam_law_enforcement               True
know_def                          True
know_vic                         False
know_wit                          True
know_attny                       False
prior_info                       False
death_hesitation                  True
dtype: bool

This gives us a set of true and falses. You know the original variable we used to feed into our logistic regression, X? We can ask X for a list of all the columns we used in our regression.

X.columns
Index(['is_black', 'same_race', 'juror_id__gender_m', 'juror_id__gender_unknown',
       'trial__defendant_race_asian', 'trial__defendant_race_black',
       'trial__defendant_race_unknown', 'trial__judge_Loper', 'trial__judge_OTHER', 'no_responses',
       'leans_ambi', 'prior_jury', 'crime_victim', 'fam_crime_victim', 'accused', 'fam_accused',
       'law_enforcement', 'fam_law_enforcement', 'know_def', 'know_vic', 'know_wit', 'know_attny',
       'prior_info', 'death_hesitation'],
      dtype='object')

And then filter that list for all the columns that had an acceptable p-value. We can then take those column names and use them as our new list of feature columns!

X.columns[results.pvalues.drop('const') < 0.1]
Index(['is_black', 'same_race', 'no_responses', 'fam_crime_victim', 'accused', 'fam_accused',
       'law_enforcement', 'fam_law_enforcement', 'know_def', 'know_wit', 'death_hesitation'],
      dtype='object')

And there we go!

Filtering regression variables by p values (the short, in action)#

We broke the p-values filter into a lot of steps, but we can actually bake it into a single line. If we want to re-run our regression with only the columns that have a p-value less than 0.1, it's a small tweak to our original regression.

# Filter our original features to only keep ones with a p-value less than 0.1
feature_cols = X.columns[results.pvalues.drop('const') < 0.1]

# Run our regression just like we normally do
X = df[feature_cols]
y = df.struck_by_state

model = sm.Logit(y, sm.add_constant(X))
results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.408840
         Iterations 6
Logit Regression Results
Dep. Variable: struck_by_state No. Observations: 2295
Model: Logit Df Residuals: 2283
Method: MLE Df Model: 11
Date: Tue, 14 Jan 2020 Pseudo R-squ.: 0.2725
Time: 11:40:37 Log-Likelihood: -938.29
converged: True LL-Null: -1289.7
Covariance Type: nonrobust LLR p-value: 1.293e-143
coef std err z P>|z| [0.025 0.975]
const -2.3054 0.126 -18.238 0.000 -2.553 -2.058
is_black 1.9239 0.143 13.440 0.000 1.643 2.204
same_race 0.3776 0.140 2.691 0.007 0.103 0.653
no_responses -0.2466 0.144 -1.713 0.087 -0.529 0.036
fam_crime_victim 0.4834 0.277 1.743 0.081 -0.060 1.027
accused 2.4520 0.545 4.503 0.000 1.385 3.519
fam_accused 1.7888 0.171 10.485 0.000 1.454 2.123
law_enforcement -0.8932 0.499 -1.791 0.073 -1.871 0.084
fam_law_enforcement -0.6728 0.171 -3.935 0.000 -1.008 -0.338
know_def 1.2936 0.236 5.485 0.000 0.831 1.756
know_wit -0.3339 0.232 -1.437 0.151 -0.789 0.121
death_hesitation 1.7635 0.595 2.961 0.003 0.596 2.931

Now that we've removed some variables and re-run our regression, our coefficients have changed! The noise and randomness from the other columns has decreased, and things are looking a lot better.

But if we check out our p-value column, we see it's changed, as well. We have a good number of p-values around 0, a couple around 0.7-0.8, and then one that jumped up to 0.151. APM Reports felt that doing a another pass of filtering would help, removing anything that has a p-value of under 0.5. From their methodology:

Finally, we selected all factors with a p-value < 0.05 and ran the model a third time.

We can actually just cut and paste the code from our last cell, and adjust the boundary from 0.1 to 0.05. Note that it will filter the results of our most recent regression, not our original regression.

# Filter our original features to only keep ones with a p-value less than 0.05
feature_cols = X.columns[results.pvalues.drop('const') < 0.05]

# Run our regression just like we normally do
X = df[feature_cols]
y = df.struck_by_state

model = sm.Logit(y, sm.add_constant(X))
results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.411232
         Iterations 6
Logit Regression Results
Dep. Variable: struck_by_state No. Observations: 2295
Model: Logit Df Residuals: 2287
Method: MLE Df Model: 7
Date: Tue, 14 Jan 2020 Pseudo R-squ.: 0.2682
Time: 11:40:38 Log-Likelihood: -943.78
converged: True LL-Null: -1289.7
Covariance Type: nonrobust LLR p-value: 3.815e-145
coef std err z P>|z| [0.025 0.975]
const -2.4307 0.101 -24.017 0.000 -2.629 -2.232
is_black 1.8972 0.141 13.443 0.000 1.621 2.174
same_race 0.3603 0.140 2.575 0.010 0.086 0.635
accused 2.5128 0.545 4.606 0.000 1.444 3.582
fam_accused 1.8476 0.162 11.402 0.000 1.530 2.165
fam_law_enforcement -0.5627 0.162 -3.468 0.001 -0.881 -0.245
know_def 1.3257 0.223 5.937 0.000 0.888 1.763
death_hesitation 1.8243 0.592 3.084 0.002 0.665 2.984

And there we go! Now that we have a solid selection of features with very low p-values, we can go ahead and examine our odds ratios!

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'pvalue': results.pvalues,
    'name': results.params.index
}).sort_values(by='odds ratio', ascending=False)
coefs
coef odds ratio pvalue name
accused 2.51278 12.33918 0.00000 accused
is_black 1.89716 6.66696 0.00000 is_black
fam_accused 1.84760 6.34456 0.00000 fam_accused
death_hesitation 1.82434 6.19873 0.00204 death_hesitation
know_def 1.32570 3.76481 0.00000 know_def
same_race 0.36026 1.43370 0.01004 same_race
fam_law_enforcement -0.56268 0.56968 0.00052 fam_law_enforcement
const -2.43071 0.08797 0.00000 const

Review#

We learned how p-values relate to features in a logistic regression model. The lower a feature's p-value is, the greater a chance that the relationship is accurate. Typically p-values below 0.05 (aka 5%) are thought of as good p-values.

To make a simple and non-noisy model, we ran an initial regression and then filtered our features to only keep those with low p-values. We did this across a few rounds: first we used all of our variables, then the ones that met a 0.1 threshold, and then those that met a 0.05 threshold. This allowed us to drop from 24 features down to 7 features.

Discussion topics#

  • Why are values below 0.05 thought of as a good p-value?
  • If you had a feature at 0.049 and one at 0.051, would you keep the first and discard the second?
  • By removing features from a model, we're ignoring data that we have and could be using. What are the arguments in favor or against it, and how comfortable do you feel with it?
  • Do you prefer using the formula method or the dataframe method for a logistic regression? Is there a difference in readability or understandability between the two?