Lending disparities using Logistic Regression#

The story: https://www.revealnews.org/article/for-people-of-color-banks-are-shutting-the-door-to-homeownership/

Read online Download notebook Interactive version

Setup#

Import pandas as usual, but also import numpy. We'll need it for logarithms and exponents.

Some of our datasets have a lot of columns, so you'll also want to use pd.set_option to display up to 100 columns or so.

import numpy as np
import pandas as pd

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format",'{:,.5f}'.format)

Read in your data#

We're using pre-cleaned data this time, with the mortgage and census data joined together and the unwanted rows removed.

# We're just looking at Philly
merged = pd.read_csv("data/mortgage-census-cleaned-merged.csv")
merged.head(5)

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex	loan_denied	co_applicant	STATEA	COUNTYA	TRACTA	pop_total	pop_white	pop_black	pop_asian	pop_hispanic
0	401900	45	42	female	59	112	1	1	1	1	1	OCC	133.09000	white	5	0	no	42	45	401900	4219	2064	1719	224	38
1	401900	45	42	male	63	192	1	1	1	1	1	NCUA	133.09000	na	5	0	no	42	45	401900	4219	2064	1719	224	38
2	401900	45	42	female	80	105	1	1	1	1	1	FDIC	133.09000	black	1	0	yes	42	45	401900	4219	2064	1719	224	38
3	401900	45	42	female	84	128	1	1	1	1	1	NCUA	133.09000	white	1	0	yes	42	45	401900	4219	2064	1719	224	38
4	401900	45	42	female	26	168	1	1	1	3	1	NCUA	133.09000	black	5	1	no	42	45	401900	4219	2064	1719	224	38

Formulas and calculations in statsmodels formulas#

Instead of building new columns in pandas, we're just going to tell statsmodels to do it for us. This is using something called Patsy, imitating the programming language R.

description	pandas style	formula style
Multiply column	`df.colname * 100`	`np.multiply(colname, 100)`
Divide columns	`df.loan_amount / df.income`	`np.divide(loan_amount, income)`
Percentage	`df.pop_black / pop_total * 100`	`np.multiply(pop_black / pop_total, 100)`
Calculate log	`np.log(income)`	`np.log(income)`
One-hot encoding	`pd.get_dummies(df.agency_code).drop('FDIC', axis=1)`	`C(agency_code, Treatment('FDIC')`

If you haven't heard of one-hot encoding before, I recommend reading the longer version of this notebook! Or looking at what happens down below and thinking it through.

If we follow Reveal's methodology, we have a nice long list of features to include in our formula. Turning them all into a statsmodels/Patsy formula, the result looks like this:

import statsmodels.formula.api as smf

model = smf.logit("""
    loan_denied ~ 
        tract_to_msa_income_percent
        + np.log(income)
        + np.log(loan_amount)
        + np.divide(loan_amount, income)
        + C(co_applicant, Treatment('no'))
        + C(applicant_sex, Treatment('female'))
        + C(applicant_race, Treatment('white'))
        + C(agency_code, Treatment('FDIC'))
        + np.multiply(pop_hispanic / pop_total, 100)
        + np.multiply(pop_black / pop_total, 100)
        + np.multiply(pop_amer_indian / pop_total, 100)
        + np.multiply(pop_asian / pop_total, 100)
        + np.multiply(pop_pac_islander / pop_total, 100)
""", data=merged)

result = model.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.334016
         Iterations 7

Logit Regression Results
Dep. Variable:	loan_denied	No. Observations:	10107
Model:	Logit	Df Residuals:	10082
Method:	MLE	Df Model:	24
Date:	Tue, 05 Nov 2019	Pseudo R-squ.:	0.09749
Time:	12:57:59	Log-Likelihood:	-3375.9
converged:	True	LL-Null:	-3740.6
Covariance Type:	nonrobust	LLR p-value:	1.629e-138

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-0.6024	0.306	-1.971	0.049	-1.202	-0.003
C(co_applicant, Treatment('no'))[T.unknown]	0.3957	0.207	1.914	0.056	-0.010	0.801
C(co_applicant, Treatment('no'))[T.yes]	-0.0958	0.078	-1.222	0.222	-0.249	0.058
C(applicant_sex, Treatment('female'))[T.male]	0.1213	0.070	1.722	0.085	-0.017	0.259
C(applicant_sex, Treatment('female'))[T.na]	-0.1091	0.176	-0.618	0.537	-0.455	0.237
C(applicant_race, Treatment('white'))[T.asian]	0.3783	0.105	3.595	0.000	0.172	0.584
C(applicant_race, Treatment('white'))[T.black]	0.7496	0.115	6.506	0.000	0.524	0.975
C(applicant_race, Treatment('white'))[T.hawaiian]	1.0989	0.464	2.368	0.018	0.189	2.008
C(applicant_race, Treatment('white'))[T.latino]	0.3232	0.164	1.970	0.049	0.002	0.645
C(applicant_race, Treatment('white'))[T.na]	0.4511	0.119	3.798	0.000	0.218	0.684
C(applicant_race, Treatment('white'))[T.native_amer]	1.0546	0.591	1.785	0.074	-0.103	2.213
C(agency_code, Treatment('FDIC'))[T.CFPB]	1.1066	0.136	8.158	0.000	0.841	1.372
C(agency_code, Treatment('FDIC'))[T.FRS]	-0.1133	0.221	-0.513	0.608	-0.546	0.319
C(agency_code, Treatment('FDIC'))[T.HUD]	0.1170	0.140	0.837	0.402	-0.157	0.391
C(agency_code, Treatment('FDIC'))[T.NCUA]	1.3009	0.153	8.507	0.000	1.001	1.601
C(agency_code, Treatment('FDIC'))[T.OCC]	0.3160	0.201	1.575	0.115	-0.077	0.709
tract_to_msa_income_percent	0.0017	0.001	2.627	0.009	0.000	0.003
np.log(income)	-0.3555	0.070	-5.104	0.000	-0.492	-0.219
np.log(loan_amount)	-0.2283	0.056	-4.060	0.000	-0.338	-0.118
np.divide(loan_amount, income)	0.0110	0.007	1.476	0.140	-0.004	0.026
np.multiply(pop_hispanic / pop_total, 100)	0.0072	0.004	2.050	0.040	0.000	0.014
np.multiply(pop_black / pop_total, 100)	0.0062	0.002	3.884	0.000	0.003	0.009
np.multiply(pop_amer_indian / pop_total, 100)	-0.2571	0.097	-2.652	0.008	-0.447	-0.067
np.multiply(pop_asian / pop_total, 100)	0.0107	0.004	2.427	0.015	0.002	0.019
np.multiply(pop_pac_islander / pop_total, 100)	0.0904	0.159	0.568	0.570	-0.221	0.402

Renaming our output fields#

If we love the formula method but hate the feature names, we can rename them. It isn't the easiest thing that's ever happened, but it isn't so bad.

# Copy the names to a pd.Series for easy search/replace
# We'll also keep a safe copy to make double-checking easy later
names = pd.Series(model.data.xnames)
originals = list(names.copy())

# Reformat 'C(agency_code, Treatment('FDIC'))[T.FRS]' as 'agency_code_FRS'
names = names.str.replace(r", ?Treatment\(.*\)", r"")
names = names.str.replace(r"C\(([\w]+)", r"\1_")
names = names.str.replace(r"\[T.(.*)\]", r"\1")

# Manually replace other ones
names = names.replace({
    'np.multiply(pop_hispanic / pop_total, 100)': 'pop_hispanic',
    'np.multiply(pop_black / pop_total, 100)': 'pop_black',
    'np.multiply(pop_amer_indian / pop_total, 100)': 'pop_amer_indian',
    'np.multiply(pop_asian / pop_total, 100)': 'pct_asian',
    'np.multiply(pop_pac_islander / pop_total, 100)': 'pop_pac_islander',
    'np.log(income)': 'log_income',
    'np.log(loan_amount)': 'log_loan',
    'np.divide(loan_amount, income)': 'loan_income_ratio',    
})

original_names = model.data.xnames
# Assign back into the model for display
model.data.xnames = list(names)

# Redo our summary, and we get nice output!
result.summary()

Logit Regression Results
Dep. Variable:	loan_denied	No. Observations:	10107
Model:	Logit	Df Residuals:	10082
Method:	MLE	Df Model:	24
Date:	Tue, 05 Nov 2019	Pseudo R-squ.:	0.09749
Time:	12:58:11	Log-Likelihood:	-3375.9
converged:	True	LL-Null:	-3740.6
Covariance Type:	nonrobust	LLR p-value:	1.629e-138

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-0.6024	0.306	-1.971	0.049	-1.202	-0.003
co_applicant_unknown	0.3957	0.207	1.914	0.056	-0.010	0.801
co_applicant_yes	-0.0958	0.078	-1.222	0.222	-0.249	0.058
applicant_sex_male	0.1213	0.070	1.722	0.085	-0.017	0.259
applicant_sex_na	-0.1091	0.176	-0.618	0.537	-0.455	0.237
applicant_race_asian	0.3783	0.105	3.595	0.000	0.172	0.584
applicant_race_black	0.7496	0.115	6.506	0.000	0.524	0.975
applicant_race_hawaiian	1.0989	0.464	2.368	0.018	0.189	2.008
applicant_race_latino	0.3232	0.164	1.970	0.049	0.002	0.645
applicant_race_na	0.4511	0.119	3.798	0.000	0.218	0.684
applicant_race_native_amer	1.0546	0.591	1.785	0.074	-0.103	2.213
agency_code_CFPB	1.1066	0.136	8.158	0.000	0.841	1.372
agency_code_FRS	-0.1133	0.221	-0.513	0.608	-0.546	0.319
agency_code_HUD	0.1170	0.140	0.837	0.402	-0.157	0.391
agency_code_NCUA	1.3009	0.153	8.507	0.000	1.001	1.601
agency_code_OCC	0.3160	0.201	1.575	0.115	-0.077	0.709
tract_to_msa_income_percent	0.0017	0.001	2.627	0.009	0.000	0.003
log_income	-0.3555	0.070	-5.104	0.000	-0.492	-0.219
log_loan	-0.2283	0.056	-4.060	0.000	-0.338	-0.118
loan_income_ratio	0.0110	0.007	1.476	0.140	-0.004	0.026
pop_hispanic	0.0072	0.004	2.050	0.040	0.000	0.014
pop_black	0.0062	0.002	3.884	0.000	0.003	0.009
pop_amer_indian	-0.2571	0.097	-2.652	0.008	-0.447	-0.067
pct_asian	0.0107	0.004	2.427	0.015	0.002	0.019
pop_pac_islander	0.0904	0.159	0.568	0.570	-0.221	0.402

Everything about our model still works great!

We can build a coefficient/odds ratio/p-value table without any trouble at all.

feature_names = result.params.index
coefficients = result.params.values

coefs = pd.DataFrame({
    'coef': coefficients,
    'odds ratio': np.exp(result.params.values),
    'pvalue': result.pvalues,
    'original': originals
}).sort_values(by='odds ratio', ascending=False)
coefs

	coef	odds ratio	pvalue	original
agency_code_NCUA	1.30091	3.67262	0.00000	C(agency_code, Treatment('FDIC'))[T.NCUA]
agency_code_CFPB	1.10661	3.02410	0.00000	C(agency_code, Treatment('FDIC'))[T.CFPB]
applicant_race_hawaiian	1.09891	3.00090	0.01787	C(applicant_race, Treatment('white'))[T.hawaiian]
applicant_race_native_amer	1.05461	2.87087	0.07426	C(applicant_race, Treatment('white'))[T.native...
applicant_race_black	0.74963	2.11622	0.00000	C(applicant_race, Treatment('white'))[T.black]
applicant_race_na	0.45112	1.57007	0.00015	C(applicant_race, Treatment('white'))[T.na]
co_applicant_unknown	0.39570	1.48543	0.05568	C(co_applicant, Treatment('no'))[T.unknown]
applicant_race_asian	0.37826	1.45974	0.00032	C(applicant_race, Treatment('white'))[T.asian]
applicant_race_latino	0.32318	1.38151	0.04881	C(applicant_race, Treatment('white'))[T.latino]
agency_code_OCC	0.31598	1.37160	0.11521	C(agency_code, Treatment('FDIC'))[T.OCC]
applicant_sex_male	0.12133	1.12900	0.08514	C(applicant_sex, Treatment('female'))[T.male]
agency_code_HUD	0.11701	1.12413	0.40245	C(agency_code, Treatment('FDIC'))[T.HUD]
pop_pac_islander	0.09038	1.09459	0.56988	np.multiply(pop_pac_islander / pop_total, 100)
loan_income_ratio	0.01096	1.01102	0.13984	np.divide(loan_amount, income)
pct_asian	0.01069	1.01074	0.01522	np.multiply(pop_asian / pop_total, 100)
pop_hispanic	0.00719	1.00721	0.04041	np.multiply(pop_hispanic / pop_total, 100)
pop_black	0.00623	1.00625	0.00010	np.multiply(pop_black / pop_total, 100)
tract_to_msa_income_percent	0.00175	1.00175	0.00862	tract_to_msa_income_percent
co_applicant_yes	-0.09583	0.90862	0.22154	C(co_applicant, Treatment('no'))[T.yes]
applicant_sex_na	-0.10905	0.89668	0.53657	C(applicant_sex, Treatment('female'))[T.na]
agency_code_FRS	-0.11333	0.89285	0.60766	C(agency_code, Treatment('FDIC'))[T.FRS]
log_loan	-0.22825	0.79592	0.00005	np.log(loan_amount)
pop_amer_indian	-0.25706	0.77332	0.00801	np.multiply(pop_amer_indian / pop_total, 100)
log_income	-0.35546	0.70085	0.00000	np.log(income)
Intercept	-0.60242	0.54748	0.04874	Intercept

And then you're all set!

Lending disparities using Logistic Regression#

Setup#

Read in your data#

Formulas and calculations in statsmodels formulas#

Renaming our output fields#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects