Feature selection with p-values#

When you're performing a regression, not all fields have a connection with the output. We'll use p-values to cull the less important columns to refine our model.

Read online Download notebook Interactive version

Imports#

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd

pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 100)
pd.set_option('display.float_format', '{:.5f}'.format)

Read in the data#

We'll start by reading in the pre-cleaned dataset. We've already joined the potential jurors, the trial information, and the judge information. We've also added the struck_by_state column and converted true and false values into ones and zeroes.

df = pd.read_csv("data/jury-cleaned.csv")
df.head(2)

	id_x	juror_id	juror_id__trial__id	no_responses	married	children	religious	education	leans_state	leans_defense	leans_ambi	moral_hardship	job_hardship	caretaker	communication	medical	employed	social	prior_jury	crime_victim	fam_crime_victim	accused	fam_accused	eyewitness	fam_eyewitness	...	prosecutor_3	prosecutors_more_than_three	def_attny_1	def_attny_2	def_attny_3	def_attnys_more_than_three	offense_code_1	offense_title_1	offense_code_2	offense_title_2	offense_code_3	offense_title_3	offense_code_4	offense_title_4	offense_code_5	offense_title_5	offense_code_6	offense_title_6	more_than_six	verdict	case_appealed	batson_claim_by_defense	batson_claim_by_state	voir_dire_present	struck_by_state
0	1521	107.00000	3.00000	0	unknown	unknown	unknown	unknown	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	NaN	0	M. Kevin Horan	Elizabeth Davis	NaN	0	41-29-139(a)(1)(b)(3)	sale of marihuana (less than 30 grams)	41-29-139(a)(1)(b)(1)	sale of cocaine	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	Guilty on at least one offense	1	0	0	1	0
1	1524	108.00000	3.00000	0	unknown	unknown	unknown	unknown	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	0	...	NaN	0	M. Kevin Horan	Elizabeth Davis	NaN	0	41-29-139(a)(1)(b)(3)	sale of marihuana (less than 30 grams)	41-29-139(a)(1)(b)(1)	sale of cocaine	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	Guilty on at least one offense	1	0	0	1	1

2 rows × 118 columns

That is a lot of columns! But don't worry, we can add some more, too.

Adding additional features#

Since we're going to be selecting from a lot of columns, let's get crazy and add a few more.

df['is_black'] = df.race == 'Black'
df['same_race'] = df.race == df.defendant_race
df['juror_id__gender_m'] = df.gender == 'Male'
df['juror_id__gender_unknown'] = df.gender == 'Unknown'
df['trial__defendant_race_asian'] = df.defendant_race == 'Asian'
df['trial__defendant_race_black'] = df.defendant_race == 'Black'
df['trial__defendant_race_unknown'] = df.defendant_race == 'Unknown'
df['trial__judge_Loper'] = df.judge == 'Joseph Loper, Jr'
df['trial__judge_OTHER'] = df.judge == 'Other'
df.head(2)

	id_x	juror_id	juror_id__trial__id	no_responses	married	children	religious	education	leans_state	leans_defense	leans_ambi	moral_hardship	job_hardship	caretaker	communication	medical	employed	social	prior_jury	crime_victim	fam_crime_victim	accused	fam_accused	eyewitness	fam_eyewitness	...	offense_title_2	offense_code_3	offense_title_3	offense_code_4	offense_title_4	offense_code_5	offense_title_5	offense_code_6	offense_title_6	more_than_six	verdict	case_appealed	batson_claim_by_defense	batson_claim_by_state	voir_dire_present	struck_by_state	is_black	same_race	juror_id__gender_m	juror_id__gender_unknown	trial__defendant_race_asian	trial__defendant_race_black	trial__defendant_race_unknown	trial__judge_Loper	trial__judge_OTHER
0	1521	107.00000	3.00000	0	unknown	unknown	unknown	unknown	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	sale of cocaine	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	Guilty on at least one offense	1	0	0	1	0	False	False	True	False	False	True	False	False	False
1	1524	108.00000	3.00000	0	unknown	unknown	unknown	unknown	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	0	...	sale of cocaine	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	Guilty on at least one offense	1	0	0	1	1	True	True	False	False	False	True	False	False	False

2 rows × 127 columns

Since these new columns are true and false, not ones and zeroes, we'll also need to convert those.

df = df.replace({
    False: 0,
    True: 1
})

Run a regression#

We'll start by running a simple regression: What's the relationship between a potential juror being black, and their chance of being struck by the state?

model = smf.logit(formula="struck_by_state ~ is_black", data=df)

results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.463138
         Iterations 6

Logit Regression Results
Dep. Variable:	struck_by_state	No. Observations:	2295
Model:	Logit	Df Residuals:	2293
Method:	MLE	Df Model:	1
Date:	Tue, 14 Jan 2020	Pseudo R-squ.:	0.1759
Time:	11:40:33	Log-Likelihood:	-1062.9
converged:	True	LL-Null:	-1289.7
Covariance Type:	nonrobust	LLR p-value:	1.148e-100

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.0515	0.080	-25.692	0.000	-2.208	-1.895
is_black	2.1894	0.109	20.155	0.000	1.976	2.402

Although this result might tell us how much more likely a black juror is to be rejected than a non-black juror, it's woefully incomplete. We know more about the situation than just their race, and it would be irresponsible to just ignore it.

We might start by adding one more feature - let's see whether the juror is black, and whether the student is male. Looking at two features instead of one allows us to dig a little bit deeper into the dataset.

model = smf.logit(formula="struck_by_state ~ is_black + juror_id__gender_m", data=df)

results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.463129
         Iterations 6

Logit Regression Results
Dep. Variable:	struck_by_state	No. Observations:	2295
Model:	Logit	Df Residuals:	2292
Method:	MLE	Df Model:	2
Date:	Tue, 14 Jan 2020	Pseudo R-squ.:	0.1759
Time:	11:40:34	Log-Likelihood:	-1062.9
converged:	True	LL-Null:	-1289.7
Covariance Type:	nonrobust	LLR p-value:	3.012e-99

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.0611	0.093	-22.074	0.000	-2.244	-1.878
is_black	2.1911	0.109	20.101	0.000	1.977	2.405
juror_id__gender_m	0.0221	0.111	0.199	0.842	-0.196	0.240

That's a slightly more complete model, but we have far, far, far more variables than just those two! How do we pick which features to use? Do we just throw all of them in?

The process of picking which features you're going to include in your model is called feature selection. One simple technique for picking features is to throw everything you know at the problem, then remove anything that (statistically) doesn't seem to make sense.

It might sound like an exceptionally boring, odd, or imprecise method, but let's take a look!

Add many, many more features#

We're going to start off by tacking a ton of features onto our regression! We're following the original story's methodology, starting about halfway through - the data is cleaned, we've kicked out some of the features that are always accepted/rejected, and now we're beginning our regressions.

From the report:

APM Reports first ran every variable through a logistic regression model.

Let's do that now - just like we did above, but with a million more variables. It's a little bit longer than the first couple formulas, but they do the exact same thing.

model = smf.logit(formula="""
    struck_by_state ~ 
        is_black + same_race + juror_id__gender_m + juror_id__gender_unknown
        + trial__defendant_race_asian + trial__defendant_race_black
        + trial__defendant_race_unknown + trial__judge_Loper + trial__judge_OTHER
        + no_responses + leans_ambi + prior_jury + crime_victim + fam_crime_victim
        + accused + fam_accused + law_enforcement + fam_law_enforcement
        + know_def + know_vic + know_wit + know_attny + prior_info + death_hesitation
""", data=df)

results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.405530
         Iterations 7

Logit Regression Results
Dep. Variable:	struck_by_state	No. Observations:	2295
Model:	Logit	Df Residuals:	2270
Method:	MLE	Df Model:	24
Date:	Tue, 14 Jan 2020	Pseudo R-squ.:	0.2784
Time:	11:40:34	Log-Likelihood:	-930.69
converged:	True	LL-Null:	-1289.7
Covariance Type:	nonrobust	LLR p-value:	3.878e-136

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.3416	0.223	-10.489	0.000	-2.779	-1.904
is_black	1.9325	0.143	13.506	0.000	1.652	2.213
same_race	0.4585	0.142	3.228	0.001	0.180	0.737
juror_id__gender_m	0.0488	0.123	0.397	0.691	-0.192	0.290
juror_id__gender_unknown	-0.0303	0.376	-0.081	0.936	-0.768	0.707
trial__defendant_race_asian	0.7465	0.546	1.368	0.171	-0.323	1.816
trial__defendant_race_black	-0.1635	0.151	-1.079	0.280	-0.460	0.133
trial__defendant_race_unknown	0.5651	0.410	1.378	0.168	-0.239	1.369
trial__judge_Loper	0.1796	0.134	1.337	0.181	-0.084	0.443
trial__judge_OTHER	0.0056	0.466	0.012	0.990	-0.907	0.918
no_responses	-0.2995	0.164	-1.822	0.068	-0.622	0.023
leans_ambi	0.3274	0.666	0.492	0.623	-0.977	1.632
prior_jury	-0.2290	0.210	-1.089	0.276	-0.641	0.183
crime_victim	-0.0287	0.315	-0.091	0.928	-0.647	0.589
fam_crime_victim	0.5037	0.281	1.792	0.073	-0.047	1.055
accused	2.4623	0.548	4.492	0.000	1.388	3.537
fam_accused	1.7964	0.175	10.275	0.000	1.454	2.139
law_enforcement	-0.9703	0.503	-1.929	0.054	-1.957	0.016
fam_law_enforcement	-0.6832	0.173	-3.957	0.000	-1.022	-0.345
know_def	1.3204	0.239	5.536	0.000	0.853	1.788
know_vic	0.2446	0.239	1.022	0.307	-0.224	0.714
know_wit	-0.3940	0.236	-1.666	0.096	-0.857	0.069
know_attny	0.3438	0.237	1.451	0.147	-0.120	0.808
prior_info	-0.2074	0.200	-1.039	0.299	-0.599	0.184
death_hesitation	1.8562	0.598	3.103	0.002	0.684	3.029

No, wait! We're going to do this a different way#

I know that writing things out in formulas is nice a pretty and very very readable! But we're going to stop doing that, right now.

Unfortunately, the writing-out-formulas method involves, well, writing out formulas. Manually! And if we're going to be automating this process, we don't want to write out a bunch of ... + ... + ... + ... statements manually as we filter out different columns.

Yes, you could technically automate the ... + ... + ... + ... process but the code is kinda ugly.

Instead of writing out formulas, we're going to make dataframes with exactly the columns we're using, and feed that to our regression. To start, let's make a list of the columns we'll use in our regression.

feature_cols = [
    'is_black',
    'same_race',
    'juror_id__gender_m',
    'juror_id__gender_unknown',
    'trial__defendant_race_asian',
    'trial__defendant_race_black',
    'trial__defendant_race_unknown',
    'trial__judge_Loper',
    'trial__judge_OTHER',
    'no_responses',
    'leans_ambi',
    'prior_jury',
    'crime_victim',
    'fam_crime_victim',
    'accused',
    'fam_accused',
    'law_enforcement',
    'fam_law_enforcement',
    'know_def',
    'know_vic',
    'know_wit',
    'know_attny',
    'prior_info',
    'death_hesitation']

This list of columns will allow us to filter our original dataframe to keep only the columns we're interested in. Then we'll feed this to sm.Logit, which does logistic regressions with dataframes instead of written formulas. It's not what we did before, but it's the same basic idea and the exact same results.

If you're curious: We need to do sm.add_constant(X) below because by default smf.Logit doesn't add an intercept. If you use a written formula with sm.logit, though, it does add an intercept. If you aren't quite sure what that means, it's okay, just know we need the constant to make the two work methods the same. You might find this example useful if you'd like more information on the topic.

X = df[feature_cols]
y = df.struck_by_state

model = sm.Logit(y, sm.add_constant(X))
results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.405530
         Iterations 7

Logit Regression Results
Dep. Variable:	struck_by_state	No. Observations:	2295
Model:	Logit	Df Residuals:	2270
Method:	MLE	Df Model:	24
Date:	Tue, 14 Jan 2020	Pseudo R-squ.:	0.2784
Time:	11:40:34	Log-Likelihood:	-930.69
converged:	True	LL-Null:	-1289.7
Covariance Type:	nonrobust	LLR p-value:	3.878e-136

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-2.3416	0.223	-10.489	0.000	-2.779	-1.904
is_black	1.9325	0.143	13.506	0.000	1.652	2.213
same_race	0.4585	0.142	3.228	0.001	0.180	0.737
juror_id__gender_m	0.0488	0.123	0.397	0.691	-0.192	0.290
juror_id__gender_unknown	-0.0303	0.376	-0.081	0.936	-0.768	0.707
trial__defendant_race_asian	0.7465	0.546	1.368	0.171	-0.323	1.816
trial__defendant_race_black	-0.1635	0.151	-1.079	0.280	-0.460	0.133
trial__defendant_race_unknown	0.5651	0.410	1.378	0.168	-0.239	1.369
trial__judge_Loper	0.1796	0.134	1.337	0.181	-0.084	0.443
trial__judge_OTHER	0.0056	0.466	0.012	0.990	-0.907	0.918
no_responses	-0.2995	0.164	-1.822	0.068	-0.622	0.023
leans_ambi	0.3274	0.666	0.492	0.623	-0.977	1.632
prior_jury	-0.2290	0.210	-1.089	0.276	-0.641	0.183
crime_victim	-0.0287	0.315	-0.091	0.928	-0.647	0.589
fam_crime_victim	0.5037	0.281	1.792	0.073	-0.047	1.055
accused	2.4623	0.548	4.492	0.000	1.388	3.537
fam_accused	1.7964	0.175	10.275	0.000	1.454	2.139
law_enforcement	-0.9703	0.503	-1.929	0.054	-1.957	0.016
fam_law_enforcement	-0.6832	0.173	-3.957	0.000	-1.022	-0.345
know_def	1.3204	0.239	5.536	0.000	0.853	1.788
know_vic	0.2446	0.239	1.022	0.307	-0.224	0.714
know_wit	-0.3940	0.236	-1.666	0.096	-0.857	0.069
know_attny	0.3438	0.237	1.451	0.147	-0.120	0.808
prior_info	-0.2074	0.200	-1.039	0.299	-0.599	0.184
death_hesitation	1.8562	0.598	3.103	0.002	0.684	3.029

That's a lot lot lot of columns, and not all of them are useful!

P-values - the P>|z| column - are commonly referred to as "what's the chance that this result was just an accident?" For example, juror_id__gender_m has a p-value of 0.691, which (kind of, somewhat) means there's a 69.1% chance that this was just random fluctuations in the data, and nothing meaningful.

That isn't actually what p-values mean, but just work with me here! It's definitely close enough conceptually for what we're doing. You could also read something like this if you want more details about what p-values really mean.

Generally speaking, most people feel that keeping that chance at below 5% makes a result meaningful. From a technical perspective, that would mean only the variables with a p-value under 0.05 are meaningful.

If you look at these p-values, they're all over the place - some are basically zero, while one is as high as 98%!

Feature selection based on p values#

When you're trying to decide what features should go into your model and what shouldn't, p-values seem to be are one thing to pay attention to. If a feature is likely just related to our outcome by chance, it doesn't seem very effective, does it? We should probably remove those columns to simplify things and remove noise.

From APM Reports' whitepaper:

APM Reports first ran every variable through a logistic regression model. We then removed all variables with a p-value > 0.1.

We can do the same thing! It can be done pretty easily with some fun coding tricks, so let's give it a shot.

Filtering regression variables by p values (the explain-y way)#

In other notebooks, we've taken results of a regression and put them into another dataframe for easy viewing. For example, we can do that now and sort it by p-values.

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'pvalue': results.pvalues,
    'name': results.params.index
}).sort_values(by='pvalue', ascending=False)
coefs

	coef	odds ratio	pvalue	name
trial__judge_OTHER	0.00565	1.00566	0.99032	trial__judge_OTHER
juror_id__gender_unknown	-0.03033	0.97012	0.93575	juror_id__gender_unknown
crime_victim	-0.02865	0.97175	0.92758	crime_victim
juror_id__gender_m	0.04885	1.05006	0.69120	juror_id__gender_m
leans_ambi	0.32740	1.38736	0.62276	leans_ambi
know_vic	0.24461	1.27712	0.30675	know_vic
prior_info	-0.20741	0.81269	0.29879	prior_info
trial__defendant_race_black	-0.16346	0.84920	0.28040	trial__defendant_race_black
prior_jury	-0.22904	0.79530	0.27603	prior_jury
trial__judge_Loper	0.17956	1.19669	0.18116	trial__judge_Loper
trial__defendant_race_asian	0.74649	2.10958	0.17131	trial__defendant_race_asian
trial__defendant_race_unknown	0.56507	1.75957	0.16818	trial__defendant_race_unknown
know_attny	0.34379	1.41029	0.14665	know_attny
know_wit	-0.39398	0.67437	0.09566	know_wit
fam_crime_victim	0.50367	1.65478	0.07320	fam_crime_victim
no_responses	-0.29947	0.74121	0.06846	no_responses
law_enforcement	-0.97034	0.37895	0.05379	law_enforcement
death_hesitation	1.85622	6.39948	0.00191	death_hesitation
same_race	0.45847	1.58165	0.00125	same_race
fam_law_enforcement	-0.68325	0.50497	0.00008	fam_law_enforcement
accused	2.46227	11.73140	0.00001	accused
know_def	1.32043	3.74504	0.00000	know_def
fam_accused	1.79644	6.02814	0.00000	fam_accused
const	-2.34159	0.09618	0.00000	const
is_black	1.93254	6.90702	0.00000	is_black

Even though we do have odds ratios for each of these columns, the ones with high p-values are just noise. It's no use talking about a feature if its relationship to the output is likely just accidental.

For example: trial__judge_OTHER has a 99% chance of being nothing but chance! That's definitely gotta go. This is what we'll be doing by filtering by p-values, getting rid of clutter that doesn't really mean anything. They also confuse our model and get in the way of our other, useful features, too ("Pay attention to the things that matter!").

If we wanted to be very explicit about what's happening when we're picking our useful features and kicking out the chance-y ones, we could use this dataframe to find everything with a p value of less than 0.1.

First we'll filter for features that meet the p-value threshold. Then we'll drop const because it isn't actually one of our columns, it's just some magic thrown in by the regression.

coefs[coefs.pvalue < 0.1].drop('const')

	coef	odds ratio	pvalue	name
know_wit	-0.39398	0.67437	0.09566	know_wit
fam_crime_victim	0.50367	1.65478	0.07320	fam_crime_victim
no_responses	-0.29947	0.74121	0.06846	no_responses
law_enforcement	-0.97034	0.37895	0.05379	law_enforcement
death_hesitation	1.85622	6.39948	0.00191	death_hesitation
same_race	0.45847	1.58165	0.00125	same_race
fam_law_enforcement	-0.68325	0.50497	0.00008	fam_law_enforcement
accused	2.46227	11.73140	0.00001	accused
know_def	1.32043	3.74504	0.00000	know_def
fam_accused	1.79644	6.02814	0.00000	fam_accused
is_black	1.93254	6.90702	0.00000	is_black

Looks pretty good! Now we can grab the name of the column and we'll be all set.

feature_cols = coefs[coefs.pvalue < 0.1].drop('const').index
feature_cols

Index(['know_wit', 'fam_crime_victim', 'no_responses', 'law_enforcement', 'death_hesitation',
       'same_race', 'fam_law_enforcement', 'accused', 'know_def', 'fam_accused', 'is_black'],
      dtype='object')

We're using .index because it displays nicer than looking at .name.

Filtering regression variables by p values (the short way)#

While this method works, and might be easier to read, I prefer a more compact technique. We can't always use it - sometimes you have calculated columns or weird categories or whatever - but it's great when it works.

Instead of dragging ourselves through a dataframe, we can just look at the p values for the results. We'll also drop const again, since it isn't a real column.

results.pvalues.drop('const')

is_black                        0.00000
same_race                       0.00125
juror_id__gender_m              0.69120
juror_id__gender_unknown        0.93575
trial__defendant_race_asian     0.17131
trial__defendant_race_black     0.28040
trial__defendant_race_unknown   0.16818
trial__judge_Loper              0.18116
trial__judge_OTHER              0.99032
no_responses                    0.06846
leans_ambi                      0.62276
prior_jury                      0.27603
crime_victim                    0.92758
fam_crime_victim                0.07320
accused                         0.00001
fam_accused                     0.00000
law_enforcement                 0.05379
fam_law_enforcement             0.00008
know_def                        0.00000
know_vic                        0.30675
know_wit                        0.09566
know_attny                      0.14665
prior_info                      0.29879
death_hesitation                0.00191
dtype: float64

Then we can compare each one of those p-values to 0.1, just like we did before.

results.pvalues.drop('const') < 0.1

is_black                          True
same_race                         True
juror_id__gender_m               False
juror_id__gender_unknown         False
trial__defendant_race_asian      False
trial__defendant_race_black      False
trial__defendant_race_unknown    False
trial__judge_Loper               False
trial__judge_OTHER               False
no_responses                      True
leans_ambi                       False
prior_jury                       False
crime_victim                     False
fam_crime_victim                  True
accused                           True
fam_accused                       True
law_enforcement                   True
fam_law_enforcement               True
know_def                          True
know_vic                         False
know_wit                          True
know_attny                       False
prior_info                       False
death_hesitation                  True
dtype: bool

This gives us a set of true and falses. You know the original variable we used to feed into our logistic regression, X? We can ask X for a list of all the columns we used in our regression.

X.columns

Index(['is_black', 'same_race', 'juror_id__gender_m', 'juror_id__gender_unknown',
       'trial__defendant_race_asian', 'trial__defendant_race_black',
       'trial__defendant_race_unknown', 'trial__judge_Loper', 'trial__judge_OTHER', 'no_responses',
       'leans_ambi', 'prior_jury', 'crime_victim', 'fam_crime_victim', 'accused', 'fam_accused',
       'law_enforcement', 'fam_law_enforcement', 'know_def', 'know_vic', 'know_wit', 'know_attny',
       'prior_info', 'death_hesitation'],
      dtype='object')

And then filter that list for all the columns that had an acceptable p-value. We can then take those column names and use them as our new list of feature columns!

X.columns[results.pvalues.drop('const') < 0.1]

Index(['is_black', 'same_race', 'no_responses', 'fam_crime_victim', 'accused', 'fam_accused',
       'law_enforcement', 'fam_law_enforcement', 'know_def', 'know_wit', 'death_hesitation'],
      dtype='object')

And there we go!

Filtering regression variables by p values (the short, in action)#

We broke the p-values filter into a lot of steps, but we can actually bake it into a single line. If we want to re-run our regression with only the columns that have a p-value less than 0.1, it's a small tweak to our original regression.

# Filter our original features to only keep ones with a p-value less than 0.1
feature_cols = X.columns[results.pvalues.drop('const') < 0.1]

# Run our regression just like we normally do
X = df[feature_cols]
y = df.struck_by_state

model = sm.Logit(y, sm.add_constant(X))
results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.408840
         Iterations 6

Logit Regression Results
Dep. Variable:	struck_by_state	No. Observations:	2295
Model:	Logit	Df Residuals:	2283
Method:	MLE	Df Model:	11
Date:	Tue, 14 Jan 2020	Pseudo R-squ.:	0.2725
Time:	11:40:37	Log-Likelihood:	-938.29
converged:	True	LL-Null:	-1289.7
Covariance Type:	nonrobust	LLR p-value:	1.293e-143

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-2.3054	0.126	-18.238	0.000	-2.553	-2.058
is_black	1.9239	0.143	13.440	0.000	1.643	2.204
same_race	0.3776	0.140	2.691	0.007	0.103	0.653
no_responses	-0.2466	0.144	-1.713	0.087	-0.529	0.036
fam_crime_victim	0.4834	0.277	1.743	0.081	-0.060	1.027
accused	2.4520	0.545	4.503	0.000	1.385	3.519
fam_accused	1.7888	0.171	10.485	0.000	1.454	2.123
law_enforcement	-0.8932	0.499	-1.791	0.073	-1.871	0.084
fam_law_enforcement	-0.6728	0.171	-3.935	0.000	-1.008	-0.338
know_def	1.2936	0.236	5.485	0.000	0.831	1.756
know_wit	-0.3339	0.232	-1.437	0.151	-0.789	0.121
death_hesitation	1.7635	0.595	2.961	0.003	0.596	2.931

Now that we've removed some variables and re-run our regression, our coefficients have changed! The noise and randomness from the other columns has decreased, and things are looking a lot better.

But if we check out our p-value column, we see it's changed, as well. We have a good number of p-values around 0, a couple around 0.7-0.8, and then one that jumped up to 0.151. APM Reports felt that doing a another pass of filtering would help, removing anything that has a p-value of under 0.5. From their methodology:

Finally, we selected all factors with a p-value < 0.05 and ran the model a third time.

We can actually just cut and paste the code from our last cell, and adjust the boundary from 0.1 to 0.05. Note that it will filter the results of our most recent regression, not our original regression.

# Filter our original features to only keep ones with a p-value less than 0.05
feature_cols = X.columns[results.pvalues.drop('const') < 0.05]

# Run our regression just like we normally do
X = df[feature_cols]
y = df.struck_by_state

model = sm.Logit(y, sm.add_constant(X))
results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.411232
         Iterations 6

Logit Regression Results
Dep. Variable:	struck_by_state	No. Observations:	2295
Model:	Logit	Df Residuals:	2287
Method:	MLE	Df Model:	7
Date:	Tue, 14 Jan 2020	Pseudo R-squ.:	0.2682
Time:	11:40:38	Log-Likelihood:	-943.78
converged:	True	LL-Null:	-1289.7
Covariance Type:	nonrobust	LLR p-value:	3.815e-145

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-2.4307	0.101	-24.017	0.000	-2.629	-2.232
is_black	1.8972	0.141	13.443	0.000	1.621	2.174
same_race	0.3603	0.140	2.575	0.010	0.086	0.635
accused	2.5128	0.545	4.606	0.000	1.444	3.582
fam_accused	1.8476	0.162	11.402	0.000	1.530	2.165
fam_law_enforcement	-0.5627	0.162	-3.468	0.001	-0.881	-0.245
know_def	1.3257	0.223	5.937	0.000	0.888	1.763
death_hesitation	1.8243	0.592	3.084	0.002	0.665	2.984

And there we go! Now that we have a solid selection of features with very low p-values, we can go ahead and examine our odds ratios!

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'pvalue': results.pvalues,
    'name': results.params.index
}).sort_values(by='odds ratio', ascending=False)
coefs

	coef	odds ratio	pvalue	name
accused	2.51278	12.33918	0.00000	accused
is_black	1.89716	6.66696	0.00000	is_black
fam_accused	1.84760	6.34456	0.00000	fam_accused
death_hesitation	1.82434	6.19873	0.00204	death_hesitation
know_def	1.32570	3.76481	0.00000	know_def
same_race	0.36026	1.43370	0.01004	same_race
fam_law_enforcement	-0.56268	0.56968	0.00052	fam_law_enforcement
const	-2.43071	0.08797	0.00000	const

Review#

We learned how p-values relate to features in a logistic regression model. The lower a feature's p-value is, the greater a chance that the relationship is accurate. Typically p-values below 0.05 (aka 5%) are thought of as good p-values.

To make a simple and non-noisy model, we ran an initial regression and then filtered our features to only keep those with low p-values. We did this across a few rounds: first we used all of our variables, then the ones that met a 0.1 threshold, and then those that met a 0.05 threshold. This allowed us to drop from 24 features down to 7 features.

Discussion topics#

Why are values below 0.05 thought of as a good p-value?
If you had a feature at 0.049 and one at 0.051, would you keep the first and discard the second?
By removing features from a model, we're ignoring data that we have and could be using. What are the arguments in favor or against it, and how comfortable do you feel with it?
Do you prefer using the formula method or the dataframe method for a logistic regression? Is there a difference in readability or understandability between the two?