Lending disparities using Logistic Regression#
The story: https://www.revealnews.org/article/for-people-of-color-banks-are-shutting-the-door-to-homeownership/
Setup#
Import pandas as usual, but also import numpy. We'll need it for logarithms and exponents.
Some of our datasets have a lot of columns, so you'll also want to use pd.set_option
to display up to 100 columns or so.
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format",'{:,.5f}'.format)
Read in your data#
We're using pre-cleaned data this time, with the mortgage and census data joined together and the unwanted rows removed.
# We're just looking at Philly
merged = pd.read_csv("data/mortgage-census-cleaned-merged.csv")
merged.head(5)
Formulas and calculations in statsmodels formulas#
Instead of building new columns in pandas, we're just going to tell statsmodels to do it for us. This is using something called Patsy, imitating the programming language R.
description | pandas style | formula style |
---|---|---|
Multiply column | df.colname * 100 |
np.multiply(colname, 100) |
Divide columns | df.loan_amount / df.income |
np.divide(loan_amount, income) |
Percentage | df.pop_black / pop_total * 100 |
np.multiply(pop_black / pop_total, 100) |
Calculate log | np.log(income) |
np.log(income) |
One-hot encoding | pd.get_dummies(df.agency_code).drop('FDIC', axis=1) |
C(agency_code, Treatment('FDIC') |
If you haven't heard of one-hot encoding before, I recommend reading the longer version of this notebook! Or looking at what happens down below and thinking it through.
If we follow Reveal's methodology, we have a nice long list of features to include in our formula. Turning them all into a statsmodels/Patsy formula, the result looks like this:
import statsmodels.formula.api as smf
model = smf.logit("""
loan_denied ~
tract_to_msa_income_percent
+ np.log(income)
+ np.log(loan_amount)
+ np.divide(loan_amount, income)
+ C(co_applicant, Treatment('no'))
+ C(applicant_sex, Treatment('female'))
+ C(applicant_race, Treatment('white'))
+ C(agency_code, Treatment('FDIC'))
+ np.multiply(pop_hispanic / pop_total, 100)
+ np.multiply(pop_black / pop_total, 100)
+ np.multiply(pop_amer_indian / pop_total, 100)
+ np.multiply(pop_asian / pop_total, 100)
+ np.multiply(pop_pac_islander / pop_total, 100)
""", data=merged)
result = model.fit()
result.summary()
Renaming our output fields#
If we love the formula method but hate the feature names, we can rename them. It isn't the easiest thing that's ever happened, but it isn't so bad.
# Copy the names to a pd.Series for easy search/replace
# We'll also keep a safe copy to make double-checking easy later
names = pd.Series(model.data.xnames)
originals = list(names.copy())
# Reformat 'C(agency_code, Treatment('FDIC'))[T.FRS]' as 'agency_code_FRS'
names = names.str.replace(r", ?Treatment\(.*\)", r"")
names = names.str.replace(r"C\(([\w]+)", r"\1_")
names = names.str.replace(r"\[T.(.*)\]", r"\1")
# Manually replace other ones
names = names.replace({
'np.multiply(pop_hispanic / pop_total, 100)': 'pop_hispanic',
'np.multiply(pop_black / pop_total, 100)': 'pop_black',
'np.multiply(pop_amer_indian / pop_total, 100)': 'pop_amer_indian',
'np.multiply(pop_asian / pop_total, 100)': 'pct_asian',
'np.multiply(pop_pac_islander / pop_total, 100)': 'pop_pac_islander',
'np.log(income)': 'log_income',
'np.log(loan_amount)': 'log_loan',
'np.divide(loan_amount, income)': 'loan_income_ratio',
})
original_names = model.data.xnames
# Assign back into the model for display
model.data.xnames = list(names)
# Redo our summary, and we get nice output!
result.summary()
Everything about our model still works great!
We can build a coefficient/odds ratio/p-value table without any trouble at all.
feature_names = result.params.index
coefficients = result.params.values
coefs = pd.DataFrame({
'coef': coefficients,
'odds ratio': np.exp(result.params.values),
'pvalue': result.pvalues,
'original': originals
}).sort_values(by='odds ratio', ascending=False)
coefs
And then you're all set!