Setup#

Import pandas as usual, but also import numpy. We'll need it for logarithms and exponents.

Some of our datasets have a lot of columns, so you'll also want to use pd.set_option to display up to 100 columns or so.

import numpy as np
import pandas as pd

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format",'{:,.5f}'.format)

Read in your data#

We're using pre-cleaned data this time, with the mortgage and census data joined together and the unwanted rows removed.

# We're just looking at Philly
merged = pd.read_csv("data/mortgage-census-cleaned-merged.csv")
merged.head(5)
census_tract county_code state_code applicant_sex income loan_amount loan_type property_type occupancy action_type loan_purpose agency_code tract_to_msa_income_percent applicant_race co_applicant_sex loan_denied co_applicant STATEA COUNTYA TRACTA pop_total pop_white pop_black pop_amer_indian pop_asian pop_pac_islander pop_hispanic
0 401900 45 42 female 59 112 1 1 1 1 1 OCC 133.09000 white 5 0 no 42 45 401900 4219 2064 1719 0 224 0 38
1 401900 45 42 male 63 192 1 1 1 1 1 NCUA 133.09000 na 5 0 no 42 45 401900 4219 2064 1719 0 224 0 38
2 401900 45 42 female 80 105 1 1 1 1 1 FDIC 133.09000 black 1 0 yes 42 45 401900 4219 2064 1719 0 224 0 38
3 401900 45 42 female 84 128 1 1 1 1 1 NCUA 133.09000 white 1 0 yes 42 45 401900 4219 2064 1719 0 224 0 38
4 401900 45 42 female 26 168 1 1 1 3 1 NCUA 133.09000 black 5 1 no 42 45 401900 4219 2064 1719 0 224 0 38

Formulas and calculations in statsmodels formulas#

Instead of building new columns in pandas, we're just going to tell statsmodels to do it for us. This is using something called Patsy, imitating the programming language R.

description pandas style formula style
Multiply column df.colname * 100 np.multiply(colname, 100)
Divide columns df.loan_amount / df.income np.divide(loan_amount, income)
Percentage df.pop_black / pop_total * 100 np.multiply(pop_black / pop_total, 100)
Calculate log np.log(income) np.log(income)
One-hot encoding pd.get_dummies(df.agency_code).drop('FDIC', axis=1) C(agency_code, Treatment('FDIC')

If you haven't heard of one-hot encoding before, I recommend reading the longer version of this notebook! Or looking at what happens down below and thinking it through.

If we follow Reveal's methodology, we have a nice long list of features to include in our formula. Turning them all into a statsmodels/Patsy formula, the result looks like this:

import statsmodels.formula.api as smf

model = smf.logit("""
    loan_denied ~ 
        tract_to_msa_income_percent
        + np.log(income)
        + np.log(loan_amount)
        + np.divide(loan_amount, income)
        + C(co_applicant, Treatment('no'))
        + C(applicant_sex, Treatment('female'))
        + C(applicant_race, Treatment('white'))
        + C(agency_code, Treatment('FDIC'))
        + np.multiply(pop_hispanic / pop_total, 100)
        + np.multiply(pop_black / pop_total, 100)
        + np.multiply(pop_amer_indian / pop_total, 100)
        + np.multiply(pop_asian / pop_total, 100)
        + np.multiply(pop_pac_islander / pop_total, 100)
""", data=merged)

result = model.fit()
result.summary()
Optimization terminated successfully.
         Current function value: 0.334016
         Iterations 7
Logit Regression Results
Dep. Variable: loan_denied No. Observations: 10107
Model: Logit Df Residuals: 10082
Method: MLE Df Model: 24
Date: Tue, 05 Nov 2019 Pseudo R-squ.: 0.09749
Time: 12:57:59 Log-Likelihood: -3375.9
converged: True LL-Null: -3740.6
Covariance Type: nonrobust LLR p-value: 1.629e-138
coef std err z P>|z| [0.025 0.975]
Intercept -0.6024 0.306 -1.971 0.049 -1.202 -0.003
C(co_applicant, Treatment('no'))[T.unknown] 0.3957 0.207 1.914 0.056 -0.010 0.801
C(co_applicant, Treatment('no'))[T.yes] -0.0958 0.078 -1.222 0.222 -0.249 0.058
C(applicant_sex, Treatment('female'))[T.male] 0.1213 0.070 1.722 0.085 -0.017 0.259
C(applicant_sex, Treatment('female'))[T.na] -0.1091 0.176 -0.618 0.537 -0.455 0.237
C(applicant_race, Treatment('white'))[T.asian] 0.3783 0.105 3.595 0.000 0.172 0.584
C(applicant_race, Treatment('white'))[T.black] 0.7496 0.115 6.506 0.000 0.524 0.975
C(applicant_race, Treatment('white'))[T.hawaiian] 1.0989 0.464 2.368 0.018 0.189 2.008
C(applicant_race, Treatment('white'))[T.latino] 0.3232 0.164 1.970 0.049 0.002 0.645
C(applicant_race, Treatment('white'))[T.na] 0.4511 0.119 3.798 0.000 0.218 0.684
C(applicant_race, Treatment('white'))[T.native_amer] 1.0546 0.591 1.785 0.074 -0.103 2.213
C(agency_code, Treatment('FDIC'))[T.CFPB] 1.1066 0.136 8.158 0.000 0.841 1.372
C(agency_code, Treatment('FDIC'))[T.FRS] -0.1133 0.221 -0.513 0.608 -0.546 0.319
C(agency_code, Treatment('FDIC'))[T.HUD] 0.1170 0.140 0.837 0.402 -0.157 0.391
C(agency_code, Treatment('FDIC'))[T.NCUA] 1.3009 0.153 8.507 0.000 1.001 1.601
C(agency_code, Treatment('FDIC'))[T.OCC] 0.3160 0.201 1.575 0.115 -0.077 0.709
tract_to_msa_income_percent 0.0017 0.001 2.627 0.009 0.000 0.003
np.log(income) -0.3555 0.070 -5.104 0.000 -0.492 -0.219
np.log(loan_amount) -0.2283 0.056 -4.060 0.000 -0.338 -0.118
np.divide(loan_amount, income) 0.0110 0.007 1.476 0.140 -0.004 0.026
np.multiply(pop_hispanic / pop_total, 100) 0.0072 0.004 2.050 0.040 0.000 0.014
np.multiply(pop_black / pop_total, 100) 0.0062 0.002 3.884 0.000 0.003 0.009
np.multiply(pop_amer_indian / pop_total, 100) -0.2571 0.097 -2.652 0.008 -0.447 -0.067
np.multiply(pop_asian / pop_total, 100) 0.0107 0.004 2.427 0.015 0.002 0.019
np.multiply(pop_pac_islander / pop_total, 100) 0.0904 0.159 0.568 0.570 -0.221 0.402

Renaming our output fields#

If we love the formula method but hate the feature names, we can rename them. It isn't the easiest thing that's ever happened, but it isn't so bad.

# Copy the names to a pd.Series for easy search/replace
# We'll also keep a safe copy to make double-checking easy later
names = pd.Series(model.data.xnames)
originals = list(names.copy())

# Reformat 'C(agency_code, Treatment('FDIC'))[T.FRS]' as 'agency_code_FRS'
names = names.str.replace(r", ?Treatment\(.*\)", r"")
names = names.str.replace(r"C\(([\w]+)", r"\1_")
names = names.str.replace(r"\[T.(.*)\]", r"\1")

# Manually replace other ones
names = names.replace({
    'np.multiply(pop_hispanic / pop_total, 100)': 'pop_hispanic',
    'np.multiply(pop_black / pop_total, 100)': 'pop_black',
    'np.multiply(pop_amer_indian / pop_total, 100)': 'pop_amer_indian',
    'np.multiply(pop_asian / pop_total, 100)': 'pct_asian',
    'np.multiply(pop_pac_islander / pop_total, 100)': 'pop_pac_islander',
    'np.log(income)': 'log_income',
    'np.log(loan_amount)': 'log_loan',
    'np.divide(loan_amount, income)': 'loan_income_ratio',    
})

original_names = model.data.xnames
# Assign back into the model for display
model.data.xnames = list(names)

# Redo our summary, and we get nice output!
result.summary()
Logit Regression Results
Dep. Variable: loan_denied No. Observations: 10107
Model: Logit Df Residuals: 10082
Method: MLE Df Model: 24
Date: Tue, 05 Nov 2019 Pseudo R-squ.: 0.09749
Time: 12:58:11 Log-Likelihood: -3375.9
converged: True LL-Null: -3740.6
Covariance Type: nonrobust LLR p-value: 1.629e-138
coef std err z P>|z| [0.025 0.975]
Intercept -0.6024 0.306 -1.971 0.049 -1.202 -0.003
co_applicant_unknown 0.3957 0.207 1.914 0.056 -0.010 0.801
co_applicant_yes -0.0958 0.078 -1.222 0.222 -0.249 0.058
applicant_sex_male 0.1213 0.070 1.722 0.085 -0.017 0.259
applicant_sex_na -0.1091 0.176 -0.618 0.537 -0.455 0.237
applicant_race_asian 0.3783 0.105 3.595 0.000 0.172 0.584
applicant_race_black 0.7496 0.115 6.506 0.000 0.524 0.975
applicant_race_hawaiian 1.0989 0.464 2.368 0.018 0.189 2.008
applicant_race_latino 0.3232 0.164 1.970 0.049 0.002 0.645
applicant_race_na 0.4511 0.119 3.798 0.000 0.218 0.684
applicant_race_native_amer 1.0546 0.591 1.785 0.074 -0.103 2.213
agency_code_CFPB 1.1066 0.136 8.158 0.000 0.841 1.372
agency_code_FRS -0.1133 0.221 -0.513 0.608 -0.546 0.319
agency_code_HUD 0.1170 0.140 0.837 0.402 -0.157 0.391
agency_code_NCUA 1.3009 0.153 8.507 0.000 1.001 1.601
agency_code_OCC 0.3160 0.201 1.575 0.115 -0.077 0.709
tract_to_msa_income_percent 0.0017 0.001 2.627 0.009 0.000 0.003
log_income -0.3555 0.070 -5.104 0.000 -0.492 -0.219
log_loan -0.2283 0.056 -4.060 0.000 -0.338 -0.118
loan_income_ratio 0.0110 0.007 1.476 0.140 -0.004 0.026
pop_hispanic 0.0072 0.004 2.050 0.040 0.000 0.014
pop_black 0.0062 0.002 3.884 0.000 0.003 0.009
pop_amer_indian -0.2571 0.097 -2.652 0.008 -0.447 -0.067
pct_asian 0.0107 0.004 2.427 0.015 0.002 0.019
pop_pac_islander 0.0904 0.159 0.568 0.570 -0.221 0.402

Everything about our model still works great!

We can build a coefficient/odds ratio/p-value table without any trouble at all.

feature_names = result.params.index
coefficients = result.params.values

coefs = pd.DataFrame({
    'coef': coefficients,
    'odds ratio': np.exp(result.params.values),
    'pvalue': result.pvalues,
    'original': originals
}).sort_values(by='odds ratio', ascending=False)
coefs
coef odds ratio pvalue original
agency_code_NCUA 1.30091 3.67262 0.00000 C(agency_code, Treatment('FDIC'))[T.NCUA]
agency_code_CFPB 1.10661 3.02410 0.00000 C(agency_code, Treatment('FDIC'))[T.CFPB]
applicant_race_hawaiian 1.09891 3.00090 0.01787 C(applicant_race, Treatment('white'))[T.hawaiian]
applicant_race_native_amer 1.05461 2.87087 0.07426 C(applicant_race, Treatment('white'))[T.native...
applicant_race_black 0.74963 2.11622 0.00000 C(applicant_race, Treatment('white'))[T.black]
applicant_race_na 0.45112 1.57007 0.00015 C(applicant_race, Treatment('white'))[T.na]
co_applicant_unknown 0.39570 1.48543 0.05568 C(co_applicant, Treatment('no'))[T.unknown]
applicant_race_asian 0.37826 1.45974 0.00032 C(applicant_race, Treatment('white'))[T.asian]
applicant_race_latino 0.32318 1.38151 0.04881 C(applicant_race, Treatment('white'))[T.latino]
agency_code_OCC 0.31598 1.37160 0.11521 C(agency_code, Treatment('FDIC'))[T.OCC]
applicant_sex_male 0.12133 1.12900 0.08514 C(applicant_sex, Treatment('female'))[T.male]
agency_code_HUD 0.11701 1.12413 0.40245 C(agency_code, Treatment('FDIC'))[T.HUD]
pop_pac_islander 0.09038 1.09459 0.56988 np.multiply(pop_pac_islander / pop_total, 100)
loan_income_ratio 0.01096 1.01102 0.13984 np.divide(loan_amount, income)
pct_asian 0.01069 1.01074 0.01522 np.multiply(pop_asian / pop_total, 100)
pop_hispanic 0.00719 1.00721 0.04041 np.multiply(pop_hispanic / pop_total, 100)
pop_black 0.00623 1.00625 0.00010 np.multiply(pop_black / pop_total, 100)
tract_to_msa_income_percent 0.00175 1.00175 0.00862 tract_to_msa_income_percent
co_applicant_yes -0.09583 0.90862 0.22154 C(co_applicant, Treatment('no'))[T.yes]
applicant_sex_na -0.10905 0.89668 0.53657 C(applicant_sex, Treatment('female'))[T.na]
agency_code_FRS -0.11333 0.89285 0.60766 C(agency_code, Treatment('FDIC'))[T.FRS]
log_loan -0.22825 0.79592 0.00005 np.log(loan_amount)
pop_amer_indian -0.25706 0.77332 0.00801 np.multiply(pop_amer_indian / pop_total, 100)
log_income -0.35546 0.70085 0.00000 np.log(income)
Intercept -0.60242 0.54748 0.04874 Intercept

And then you're all set!