Feature selection with p-values#
When you're performing a regression, not all fields have a connection with the output. We'll use p-values to cull the less important columns to refine our model.
Imports#
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 100)
pd.set_option('display.float_format', '{:.5f}'.format)
Read in the data#
We'll start by reading in the pre-cleaned dataset. We've already joined the potential jurors, the trial information, and the judge information. We've also added the struck_by_state
column and converted true and false values into ones and zeroes.
df = pd.read_csv("data/jury-cleaned.csv")
df.head(2)
That is a lot of columns! But don't worry, we can add some more, too.
Adding additional features#
Since we're going to be selecting from a lot of columns, let's get crazy and add a few more.
df['is_black'] = df.race == 'Black'
df['same_race'] = df.race == df.defendant_race
df['juror_id__gender_m'] = df.gender == 'Male'
df['juror_id__gender_unknown'] = df.gender == 'Unknown'
df['trial__defendant_race_asian'] = df.defendant_race == 'Asian'
df['trial__defendant_race_black'] = df.defendant_race == 'Black'
df['trial__defendant_race_unknown'] = df.defendant_race == 'Unknown'
df['trial__judge_Loper'] = df.judge == 'Joseph Loper, Jr'
df['trial__judge_OTHER'] = df.judge == 'Other'
df.head(2)
Since these new columns are true and false, not ones and zeroes, we'll also need to convert those.
df = df.replace({
False: 0,
True: 1
})
Run a regression#
We'll start by running a simple regression: What's the relationship between a potential juror being black, and their chance of being struck by the state?
model = smf.logit(formula="struck_by_state ~ is_black", data=df)
results = model.fit()
results.summary()
Although this result might tell us how much more likely a black juror is to be rejected than a non-black juror, it's woefully incomplete. We know more about the situation than just their race, and it would be irresponsible to just ignore it.
We might start by adding one more feature - let's see whether the juror is black, and whether the student is male. Looking at two features instead of one allows us to dig a little bit deeper into the dataset.
model = smf.logit(formula="struck_by_state ~ is_black + juror_id__gender_m", data=df)
results = model.fit()
results.summary()
That's a slightly more complete model, but we have far, far, far more variables than just those two! How do we pick which features to use? Do we just throw all of them in?
The process of picking which features you're going to include in your model is called feature selection. One simple technique for picking features is to throw everything you know at the problem, then remove anything that (statistically) doesn't seem to make sense.
It might sound like an exceptionally boring, odd, or imprecise method, but let's take a look!
Add many, many more features#
We're going to start off by tacking a ton of features onto our regression! We're following the original story's methodology, starting about halfway through - the data is cleaned, we've kicked out some of the features that are always accepted/rejected, and now we're beginning our regressions.
From the report:
APM Reports first ran every variable through a logistic regression model.
Let's do that now - just like we did above, but with a million more variables. It's a little bit longer than the first couple formulas, but they do the exact same thing.
model = smf.logit(formula="""
struck_by_state ~
is_black + same_race + juror_id__gender_m + juror_id__gender_unknown
+ trial__defendant_race_asian + trial__defendant_race_black
+ trial__defendant_race_unknown + trial__judge_Loper + trial__judge_OTHER
+ no_responses + leans_ambi + prior_jury + crime_victim + fam_crime_victim
+ accused + fam_accused + law_enforcement + fam_law_enforcement
+ know_def + know_vic + know_wit + know_attny + prior_info + death_hesitation
""", data=df)
results = model.fit()
results.summary()
No, wait! We're going to do this a different way#
I know that writing things out in formulas is nice a pretty and very very readable! But we're going to stop doing that, right now.
Unfortunately, the writing-out-formulas method involves, well, writing out formulas. Manually! And if we're going to be automating this process, we don't want to write out a bunch of ... + ... + ... + ...
statements manually as we filter out different columns.
Yes, you could technically automate the
... + ... + ... + ...
process but the code is kinda ugly.
Instead of writing out formulas, we're going to make dataframes with exactly the columns we're using, and feed that to our regression. To start, let's make a list of the columns we'll use in our regression.
feature_cols = [
'is_black',
'same_race',
'juror_id__gender_m',
'juror_id__gender_unknown',
'trial__defendant_race_asian',
'trial__defendant_race_black',
'trial__defendant_race_unknown',
'trial__judge_Loper',
'trial__judge_OTHER',
'no_responses',
'leans_ambi',
'prior_jury',
'crime_victim',
'fam_crime_victim',
'accused',
'fam_accused',
'law_enforcement',
'fam_law_enforcement',
'know_def',
'know_vic',
'know_wit',
'know_attny',
'prior_info',
'death_hesitation']
This list of columns will allow us to filter our original dataframe to keep only the columns we're interested in. Then we'll feed this to sm.Logit
, which does logistic regressions with dataframes instead of written formulas. It's not what we did before, but it's the same basic idea and the exact same results.
If you're curious: We need to do
sm.add_constant(X)
below because by defaultsmf.Logit
doesn't add an intercept. If you use a written formula withsm.logit
, though, it does add an intercept. If you aren't quite sure what that means, it's okay, just know we need the constant to make the two work methods the same. You might find this example useful if you'd like more information on the topic.
X = df[feature_cols]
y = df.struck_by_state
model = sm.Logit(y, sm.add_constant(X))
results = model.fit()
results.summary()
That's a lot lot lot of columns, and not all of them are useful!
P-values - the P>|z|
column - are commonly referred to as "what's the chance that this result was just an accident?" For example, juror_id__gender_m has a p-value of 0.691, which (kind of, somewhat) means there's a 69.1% chance that this was just random fluctuations in the data, and nothing meaningful.
That isn't actually what p-values mean, but just work with me here! It's definitely close enough conceptually for what we're doing. You could also read something like this if you want more details about what p-values really mean.
Generally speaking, most people feel that keeping that chance at below 5% makes a result meaningful. From a technical perspective, that would mean only the variables with a p-value under 0.05 are meaningful.
If you look at these p-values, they're all over the place - some are basically zero, while one is as high as 98%!
Feature selection based on p values#
When you're trying to decide what features should go into your model and what shouldn't, p-values seem to be are one thing to pay attention to. If a feature is likely just related to our outcome by chance, it doesn't seem very effective, does it? We should probably remove those columns to simplify things and remove noise.
From APM Reports' whitepaper:
APM Reports first ran every variable through a logistic regression model. We then removed all variables with a p-value > 0.1.
We can do the same thing! It can be done pretty easily with some fun coding tricks, so let's give it a shot.
Filtering regression variables by p values (the explain-y way)#
In other notebooks, we've taken results of a regression and put them into another dataframe for easy viewing. For example, we can do that now and sort it by p-values.
coefs = pd.DataFrame({
'coef': results.params.values,
'odds ratio': np.exp(results.params.values),
'pvalue': results.pvalues,
'name': results.params.index
}).sort_values(by='pvalue', ascending=False)
coefs
Even though we do have odds ratios for each of these columns, the ones with high p-values are just noise. It's no use talking about a feature if its relationship to the output is likely just accidental.
For example: trial__judge_OTHER
has a 99% chance of being nothing but chance! That's definitely gotta go. This is what we'll be doing by filtering by p-values, getting rid of clutter that doesn't really mean anything. They also confuse our model and get in the way of our other, useful features, too ("Pay attention to the things that matter!").
If we wanted to be very explicit about what's happening when we're picking our useful features and kicking out the chance-y ones, we could use this dataframe to find everything with a p value of less than 0.1
.
First we'll filter for features that meet the p-value threshold. Then we'll drop const
because it isn't actually one of our columns, it's just some magic thrown in by the regression.
coefs[coefs.pvalue < 0.1].drop('const')
Looks pretty good! Now we can grab the name of the column and we'll be all set.
feature_cols = coefs[coefs.pvalue < 0.1].drop('const').index
feature_cols
We're using
.index
because it displays nicer than looking at.name
.
Filtering regression variables by p values (the short way)#
While this method works, and might be easier to read, I prefer a more compact technique. We can't always use it - sometimes you have calculated columns or weird categories or whatever - but it's great when it works.
Instead of dragging ourselves through a dataframe, we can just look at the p values for the results. We'll also drop const
again, since it isn't a real column.
results.pvalues.drop('const')
Then we can compare each one of those p-values to 0.1
, just like we did before.
results.pvalues.drop('const') < 0.1
This gives us a set of true and falses. You know the original variable we used to feed into our logistic regression, X
? We can ask X for a list of all the columns we used in our regression.
X.columns
And then filter that list for all the columns that had an acceptable p-value. We can then take those column names and use them as our new list of feature columns!
X.columns[results.pvalues.drop('const') < 0.1]
And there we go!
Filtering regression variables by p values (the short, in action)#
We broke the p-values filter into a lot of steps, but we can actually bake it into a single line. If we want to re-run our regression with only the columns that have a p-value less than 0.1
, it's a small tweak to our original regression.
# Filter our original features to only keep ones with a p-value less than 0.1
feature_cols = X.columns[results.pvalues.drop('const') < 0.1]
# Run our regression just like we normally do
X = df[feature_cols]
y = df.struck_by_state
model = sm.Logit(y, sm.add_constant(X))
results = model.fit()
results.summary()
Now that we've removed some variables and re-run our regression, our coefficients have changed! The noise and randomness from the other columns has decreased, and things are looking a lot better.
But if we check out our p-value column, we see it's changed, as well. We have a good number of p-values around 0, a couple around 0.7-0.8, and then one that jumped up to 0.151. APM Reports felt that doing a another pass of filtering would help, removing anything that has a p-value of under 0.5. From their methodology:
Finally, we selected all factors with a p-value < 0.05 and ran the model a third time.
We can actually just cut and paste the code from our last cell, and adjust the boundary from 0.1
to 0.05
. Note that it will filter the results of our most recent regression, not our original regression.
# Filter our original features to only keep ones with a p-value less than 0.05
feature_cols = X.columns[results.pvalues.drop('const') < 0.05]
# Run our regression just like we normally do
X = df[feature_cols]
y = df.struck_by_state
model = sm.Logit(y, sm.add_constant(X))
results = model.fit()
results.summary()
And there we go! Now that we have a solid selection of features with very low p-values, we can go ahead and examine our odds ratios!
coefs = pd.DataFrame({
'coef': results.params.values,
'odds ratio': np.exp(results.params.values),
'pvalue': results.pvalues,
'name': results.params.index
}).sort_values(by='odds ratio', ascending=False)
coefs
Review#
We learned how p-values relate to features in a logistic regression model. The lower a feature's p-value is, the greater a chance that the relationship is accurate. Typically p-values below 0.05
(aka 5%
) are thought of as good p-values.
To make a simple and non-noisy model, we ran an initial regression and then filtered our features to only keep those with low p-values. We did this across a few rounds: first we used all of our variables, then the ones that met a 0.1
threshold, and then those that met a 0.05
threshold. This allowed us to drop from 24 features down to 7 features.
Discussion topics#
- Why are values below
0.05
thought of as a good p-value? - If you had a feature at
0.049
and one at0.051
, would you keep the first and discard the second? - By removing features from a model, we're ignoring data that we have and could be using. What are the arguments in favor or against it, and how comfortable do you feel with it?
- Do you prefer using the formula method or the dataframe method for a logistic regression? Is there a difference in readability or understandability between the two?