Lending disparities using Logistic Regression#

The story: https://www.revealnews.org/article/for-people-of-color-banks-are-shutting-the-door-to-homeownership/

Author: Aaron Glantz and Emmanuel Martinez

Topics: Logistic regression, odds ratios

Datasets

  • philadelphia-mortgages.csv: Philadelphia mortgage data for 2015
    • A subset of HMDA LAR data from FFEIC
    • Codebook is 2015HMDACodeSheet.pdf
    • A guide to HMDA reporting
    • I've massaged it slightly to make processing a bit easier
  • nhgis0006_ds233_20175_2017_tract.csv:
    • Table B03002: Hispanic or Latino Origin by Race
    • 2013-2017 American Community Survey data US Census Bureau, from NHGIS
    • Codebook is nhgis0006_ds233_20175_2017_tract_codebook.txt
  • lending_disparities_whitepaper_180214.pdf: the whitepaper outlining Reveal's methodology

What's the goal?#

Do banks provide mortgages at disparate rates between white applicants and people of color? We're going to look at the following variables to find out:

  • Race/Ethnicity
    • Native American
    • Asian
    • Black
    • Native Hawaiian
    • Hispanic/Latino
    • Race and ethnicity were not reported
  • Sex
  • Whether there was a co-applicant
  • Applicant’s annual income (includes co-applicant income)
  • Loan amount
  • Ratio between the loan amount and the applicant’s income
  • Ratio between the median income of the census tract and the median income of the Metro area
  • Racial and ethnic breakdown by percentage for each census tract
  • Regulating agency of the lending institution

Setup#

Import pandas as usual, but also import numpy. We'll need it for logarithms and exponents.

Some of our datasets have a lot of columns, so you'll also want to use pd.set_option to display up to 100 columns or so.

import numpy as np
import pandas as pd

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format",'{:,.5f}'.format)

What is each row of our data?#

If you aren't sure, you might need to look at either the whitepaper or the codebook. You'll need to look at them both eventually, so might as well get started now.

 

Read in your data#

Read in our Philadelphia mortgage data and take a peek at the first few rows.

  • Tip: As always, census tract columns like to cause problems if they're read in as numbers. Make sure pandas reads it in as a string.
# We're just looking at Philly
mortgage = pd.read_csv("data/philadelphia-mortgages.csv", dtype={ 'census_tract': 'str'})
mortgage.head(5)
census_tract county_code state_code applicant_sex income loan_amount loan_type property_type occupancy action_type loan_purpose agency_code tract_to_msa_income_percent applicant_race co_applicant_sex
0 0101.00 101 42 3 26 5 1 1 1 4 2 5 97.09000 6 5
1 0264.00 101 42 2 26 40 1 1 1 4 2 5 98.27000 3 5
2 0281.00 101 42 2 22 20 1 1 1 5 2 5 72.28000 6 5
3 0158.00 101 42 2 57 36 1 1 1 5 3 5 105.87000 6 5
4 0358.00 101 42 1 80 34 1 1 1 1 3 5 139.62000 5 2

Check your column types#

I mentioned it above, but make sure census_tract is an object (a string) or merging isn't going to be any fun later on.

mortgage.dtypes
census_tract                    object
county_code                      int64
state_code                       int64
applicant_sex                    int64
income                           int64
loan_amount                      int64
loan_type                        int64
property_type                    int64
occupancy                        int64
action_type                      int64
loan_purpose                     int64
agency_code                      int64
tract_to_msa_income_percent    float64
applicant_race                   int64
co_applicant_sex                 int64
dtype: object

Engineering and cleaning up features#

When we plotted the number of applicants, how much money they made and the size of the loan, we found that it skewed to the left, meaning the majority of applicants were clustered on the lower end of the income and loan amount scales. This was especially true for applicants of color. We took the logarithm transformation of income and loan amount to normalize the distribution of those variables and limit the effect of extreme outliers.

Traditionally we would calculate these values and save them in new columns. Since it's just simple math, though, we'll be able to do it when we run the regression.

Co-applicants#

Right now we have a column about the co-applicant's sex (see the codebook for column details). We don't want the sex, though, we're interested in whether there is a co applicant or not. Use the co-applicant's sex to create a new column called co_applicant that is either 'yes', 'no', or 'unknown'.

  • Hint: If the co-applicant's sex was not provided or is not applicable, count it as unknown.
  • Hint: The easiest way is to use .replace on the co-applicant sex column, but store the result in your new column
mortgage['co_applicant'] = mortgage.co_applicant_sex.replace({
    1: 'yes',
    2: 'yes',
    3: 'unknown',
    4: 'unknown',
    5: 'no'
})
mortgage.head()
census_tract county_code state_code applicant_sex income loan_amount loan_type property_type occupancy action_type loan_purpose agency_code tract_to_msa_income_percent applicant_race co_applicant_sex co_applicant
0 0101.00 101 42 3 26 5 1 1 1 4 2 5 97.09000 6 5 no
1 0264.00 101 42 2 26 40 1 1 1 4 2 5 98.27000 3 5 no
2 0281.00 101 42 2 22 20 1 1 1 5 2 5 72.28000 6 5 no
3 0158.00 101 42 2 57 36 1 1 1 5 3 5 105.87000 6 5 no
4 0358.00 101 42 1 80 34 1 1 1 1 3 5 139.62000 5 2 yes

Filter loan applicants#

If you read the whitepaper - lending_disparities_whitepaper_180214.pdf - many filters are used to get to the target dataset for analysis.

Loan type

While we recognize the substantial presence of applicants of color in the FHA market, we focused on conventional home loans for several reasons.

Property type

Prospective borrowers submit loan applications for various types of structures: one- to four-unit properties, multi-family properties and manufactured homes. For this analysis, we focused on one- to four-unit properties.

Occupancy

We included only borrowers who said they planned to live in the house they were looking to buy. We did this to exclude developers or individuals who were buying property as an investment or to subsequently flip it.

Action Type

We wanted to look at the reasons lending institutions deny people a mortgage. After conversations with former officials at HUD, we decided to include only those applications that resulted in originations (action type 1) or denials (action type 3)

Income

An applicant’s income isn’t always reported in the data. In other cases, the data cuts off any incomes over \$9.9 million and any loan amounts over \\$99.9 million, meaning there’s a value in the database, but it’s not precise. We focused only on those records where income and loan amount have an accurate estimation. This meant discarding about 1 percent of all conventional home loans in the country for 2016. [Note: I already edited this]

When we plotted the number of applicants, how much money they made and the size of the loan, we found that it skewed to the left, meaning the majority of applicants were clustered on the lower end of the income and loan amount scales. This was especially true for applicants of color. We took the logarithm transformation of income and loan amount to normalize the distribution of those variables and limit the effect of extreme outliers.

Lien status

We included all cases in our analysis regardless of lien status.

Race and ethnicity

At first, we looked at race separate from ethnicity, but that approach introduced too many instances in which​ ​either the ethnicity or race was unknown. So we decided to combine race and ethnicity. Applicants who marked their ethnicity as Hispanic were grouped together as Hispanic/Latino regardless of race. Non-Hispanic applicants, as well as those who didn’t provide an ethnicity, were grouped together by race: non-Hispanic white, non-Hispanic black, etc. [Note: This has already been taken care of]

Loan purpose

We decided to look at home purchase, home improvement and refinance loans separately from each other. [Note: please look at home purchase loans.]

Use the text above (it's from the whitepaper) and the 2015HMDACodeSheet.pdf code book to filter the dataset.

  • Tip: there should be between 5-8 filters, depending on how you write them.
mortgage = mortgage[(mortgage.loan_type == 1) & \
            (mortgage.property_type == 1) & \
            (mortgage.occupancy == 1) & \
            mortgage.action_type.isin([1,3]) & \
            (mortgage.income != 9999) & \
            (mortgage.loan_amount != 99999) & \
            (mortgage.loan_purpose == 1)]
mortgage = mortgage.copy()
mortgage.shape
(10107, 16)

When you're done filtering, save your dataframe as a "copy" with df = df.copy() (if it's called df, of course). This will prevent irritating warnings when you're trying to create new columns.

Confirm that you have 10,107 loans with 19 columns#

mortgage.shape
(10107, 16)

Create a "loan denied" column#

Right now the action_type category reflects whether the loan was granted or not, and either has a value of 1 or 3. We'll need to create a new column called loan_denied, where the value is 0 if the loan was accepted and 1 if the loan was denied.

While we're eventually going to do a bunch of crazy comparisons and math inside of our statsmodels formula, we do need the target of our regression to be a number. You'll see what I mean later on!

mortgage['loan_denied'] = (mortgage.action_type == 3).astype(int)

Deal with categorical variables#

Let's go ahead and take a look at our categorical variables:

  • Applicant sex (male, female, na)
  • Applicant race
  • Mortgage agency
  • Co-applicant (yes, no, unknown)

Before we do anything crazy, let's use the codebook to turn them into strings.

  • Tip: We already did this with the co_applicant column, you only need to do the rest
  • Tip: Just use .replace
mortgage.applicant_sex = mortgage.applicant_sex.replace({
    1: 'male',
    2: 'female',
    3: 'na'
})
mortgage.applicant_race = mortgage.applicant_race.replace({
    1: 'native_amer',
    2: 'asian',
    3: 'black',
    4: 'hawaiian',
    5: 'white',
    6: 'na',
    7: 'na',
    8: 'na',
    99: 'latino'
})
mortgage.agency_code = mortgage.agency_code.replace({
    1: 'OCC',
    2: 'FRS',
    3: 'FDIC',
    5: 'NCUA',
    7: 'HUD',
    9: 'CFPB'
})
mortgage.head(3)
census_tract county_code state_code applicant_sex income loan_amount loan_type property_type occupancy action_type loan_purpose agency_code tract_to_msa_income_percent applicant_race co_applicant_sex co_applicant loan_denied
42 4019.00 45 42 female 59 112 1 1 1 1 1 OCC 133.09000 white 5 no 0
43 4099.02 45 42 na 177 375 1 1 1 1 1 OCC 208.56000 na 3 unknown 0
46 4102.00 45 42 male 150 381 1 1 1 1 1 OCC 215.35000 white 5 no 0

Double-check these columns match these values in the first three rows (and yes, you should have a lot of other columns, too).

applicant_sex agency_code applicant_race co_applicant
female OCC white no
na OCC na unknown
male OCC white no

Double-check our mortage data#

mortgage.head()
census_tract county_code state_code applicant_sex income loan_amount loan_type property_type occupancy action_type loan_purpose agency_code tract_to_msa_income_percent applicant_race co_applicant_sex co_applicant loan_denied
42 4019.00 45 42 female 59 112 1 1 1 1 1 OCC 133.09000 white 5 no 0
43 4099.02 45 42 na 177 375 1 1 1 1 1 OCC 208.56000 na 3 unknown 0
46 4102.00 45 42 male 150 381 1 1 1 1 1 OCC 215.35000 white 5 no 0
48 0312.00 101 42 female 65 136 1 1 1 1 1 OCC 93.11000 asian 5 no 0
51 4036.01 45 42 female 55 196 1 1 1 1 1 OCC 141.83000 asian 5 no 0
mortgage.shape
(10107, 17)

Census data#

Now we just need the final piece to the puzzle, the census data. Read in the census data file, calling the dataframe census.

Tip: As always, be sure to read the tract column in as a string. Interestingly, this time we don't need to worry about the state or county codes in the same way.

Tip: You're going to encounter a problem that you find every time you read in a file from the US government!

census = pd.read_csv("data/nhgis0007_ds215_20155_2015_tract.csv", encoding='latin-1', dtype={'TRACTA': 'str'})
census.head(2)
GISJOIN YEAR REGIONA DIVISIONA STATE STATEA COUNTY COUNTYA COUSUBA PLACEA TRACTA BLKGRPA CONCITA AIANHHA RES_ONLYA TRUSTA AITSCEA ANRCA CBSAA CSAA METDIVA NECTAA CNECTAA NECTADIVA UAA CDCURRA SLDUA SLDLA ZCTA5A SUBMCDA SDELMA SDSECA SDUNIA PUMA5A BTTRA BTBGA NAME_E ADK5E001 ADK5E002 ADK5E003 ADK5E004 ADK5E005 ADK5E006 ADK5E007 ADK5E008 ADK5E009 ADK5E010 ADK5E011 ADK5E012 ADK5E013 ADK5E014 ADK5E015 ADK5E016 ADK5E017 ADK5E018 ADK5E019 ADK5E020 ADK5E021 NAME_M ADK5M001 ADK5M002 ADK5M003 ADK5M004 ADK5M005 ADK5M006 ADK5M007 ADK5M008 ADK5M009 ADK5M010 ADK5M011 ADK5M012 ADK5M013 ADK5M014 ADK5M015 ADK5M016 ADK5M017 ADK5M018 ADK5M019 ADK5M020 ADK5M021
0 G0100010020100 2011-2015 nan nan Alabama 1 Autauga County 1 nan nan 020100 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan Census Tract 201, Autauga County, Alabama 1948 1931 1703 150 6 12 0 0 60 0 60 17 17 0 0 0 0 0 0 0 0 Census Tract 201, Autauga County, Alabama 203 212 229 126 8 16 11 11 44 11 44 21 21 11 11 11 11 11 11 11 11
1 G0100010020200 2011-2015 nan nan Alabama 1 Autauga County 1 nan nan 020200 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan Census Tract 202, Autauga County, Alabama 2156 2139 872 1149 0 50 0 0 68 0 68 17 14 0 0 0 0 3 0 0 0 Census Tract 202, Autauga County, Alabama 268 268 207 250 11 61 11 11 62 11 62 25 23 11 11 11 11 7 11 11 11

Rename some columns#

If you like to keep your data extra clean, feel free to rename the columns you're interested in. If not, feel free to skip it!

Tip: Make sure you're using the estimates columns, not the margin of error columns

# join on STATEA-state code, COUNTYA-county code, TRACTA-census tract (cleaned)
census = census.rename(columns={
    'ADK5E001': 'pop_total',
    'ADK5E003': 'pop_white',
    'ADK5E004': 'pop_black',
    'ADK5E005': 'pop_amer_indian',
    'ADK5E006': 'pop_asian',
    'ADK5E007': 'pop_pac_islander',
    'ADK5E012': 'pop_hispanic'
})
census.head(2).T
0 1
GISJOIN G0100010020100 G0100010020200
YEAR 2011-2015 2011-2015
REGIONA NaN NaN
DIVISIONA NaN NaN
STATE Alabama Alabama
STATEA 1 1
COUNTY Autauga County Autauga County
COUNTYA 1 1
COUSUBA NaN NaN
PLACEA NaN NaN
TRACTA 020100 020200
BLKGRPA NaN NaN
CONCITA NaN NaN
AIANHHA NaN NaN
RES_ONLYA NaN NaN
TRUSTA NaN NaN
AITSCEA NaN NaN
ANRCA NaN NaN
CBSAA NaN NaN
CSAA NaN NaN
METDIVA NaN NaN
NECTAA NaN NaN
CNECTAA NaN NaN
NECTADIVA NaN NaN
UAA NaN NaN
CDCURRA NaN NaN
SLDUA NaN NaN
SLDLA NaN NaN
ZCTA5A NaN NaN
SUBMCDA NaN NaN
SDELMA NaN NaN
SDSECA NaN NaN
SDUNIA NaN NaN
PUMA5A NaN NaN
BTTRA NaN NaN
BTBGA NaN NaN
NAME_E Census Tract 201, Autauga County, Alabama Census Tract 202, Autauga County, Alabama
pop_total 1948 2156
ADK5E002 1931 2139
pop_white 1703 872
pop_black 150 1149
pop_amer_indian 6 0
pop_asian 12 50
pop_pac_islander 0 0
ADK5E008 0 0
ADK5E009 60 68
ADK5E010 0 0
ADK5E011 60 68
pop_hispanic 17 17
ADK5E013 17 14
ADK5E014 0 0
ADK5E015 0 0
ADK5E016 0 0
ADK5E017 0 0
ADK5E018 0 3
ADK5E019 0 0
ADK5E020 0 0
ADK5E021 0 0
NAME_M Census Tract 201, Autauga County, Alabama Census Tract 202, Autauga County, Alabama
ADK5M001 203 268
ADK5M002 212 268
ADK5M003 229 207
ADK5M004 126 250
ADK5M005 8 11
ADK5M006 16 61
ADK5M007 11 11
ADK5M008 11 11
ADK5M009 44 62
ADK5M010 11 11
ADK5M011 44 62
ADK5M012 21 25
ADK5M013 21 23
ADK5M014 11 11
ADK5M015 11 11
ADK5M016 11 11
ADK5M017 11 11
ADK5M018 11 7
ADK5M019 11 11
ADK5M020 11 11
ADK5M021 11 11

Computed columns#

According to Reveal's regression output, you'll want to create the following columns:

  • Percent Black in tract
  • Percent Hispanic/Latino in tract (I hope you know how Hispanic/Latino + census data works by now)
  • Percent Asian in tract
  • Percent Native American in tract
  • Percent Native Hawaiian in tract

Notice that we don't include percent white - because all of the other columns add up to percent white, we ignore it! It's similar to a reference category.

If we want to use buzzwords here, the technical reason we're not using percent white is called collinearity. We'll talk more about it on Friday.

Merge datasets#

Merge mortgage and census into a new dataframe called merged.

Unfortunately something is a little different between our mortgage and census census tract columns. You'll need to fix it before you can merge.

Cleaning#

mortgage.head(2)
census_tract county_code state_code applicant_sex income loan_amount loan_type property_type occupancy action_type loan_purpose agency_code tract_to_msa_income_percent applicant_race co_applicant_sex co_applicant loan_denied
42 4019.00 45 42 female 59 112 1 1 1 1 1 OCC 133.09000 white 5 no 0
43 4099.02 45 42 na 177 375 1 1 1 1 1 OCC 208.56000 na 3 unknown 0
census.head(2)
GISJOIN YEAR REGIONA DIVISIONA STATE STATEA COUNTY COUNTYA COUSUBA PLACEA TRACTA BLKGRPA CONCITA AIANHHA RES_ONLYA TRUSTA AITSCEA ANRCA CBSAA CSAA METDIVA NECTAA CNECTAA NECTADIVA UAA CDCURRA SLDUA SLDLA ZCTA5A SUBMCDA SDELMA SDSECA SDUNIA PUMA5A BTTRA BTBGA NAME_E pop_total ADK5E002 pop_white pop_black pop_amer_indian pop_asian pop_pac_islander ADK5E008 ADK5E009 ADK5E010 ADK5E011 pop_hispanic ADK5E013 ADK5E014 ADK5E015 ADK5E016 ADK5E017 ADK5E018 ADK5E019 ADK5E020 ADK5E021 NAME_M ADK5M001 ADK5M002 ADK5M003 ADK5M004 ADK5M005 ADK5M006 ADK5M007 ADK5M008 ADK5M009 ADK5M010 ADK5M011 ADK5M012 ADK5M013 ADK5M014 ADK5M015 ADK5M016 ADK5M017 ADK5M018 ADK5M019 ADK5M020 ADK5M021
0 G0100010020100 2011-2015 nan nan Alabama 1 Autauga County 1 nan nan 020100 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan Census Tract 201, Autauga County, Alabama 1948 1931 1703 150 6 12 0 0 60 0 60 17 17 0 0 0 0 0 0 0 0 Census Tract 201, Autauga County, Alabama 203 212 229 126 8 16 11 11 44 11 44 21 21 11 11 11 11 11 11 11 11
1 G0100010020200 2011-2015 nan nan Alabama 1 Autauga County 1 nan nan 020200 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan Census Tract 202, Autauga County, Alabama 2156 2139 872 1149 0 50 0 0 68 0 68 17 14 0 0 0 0 3 0 0 0 Census Tract 202, Autauga County, Alabama 268 268 207 250 11 61 11 11 62 11 62 25 23 11 11 11 11 7 11 11 11
mortgage['census_tract'] = mortgage['census_tract'].str.replace(".", "")
mortgage.head(2)
census_tract county_code state_code applicant_sex income loan_amount loan_type property_type occupancy action_type loan_purpose agency_code tract_to_msa_income_percent applicant_race co_applicant_sex co_applicant loan_denied
42 401900 45 42 female 59 112 1 1 1 1 1 OCC 133.09000 white 5 no 0
43 409902 45 42 na 177 375 1 1 1 1 1 OCC 208.56000 na 3 unknown 0

Do the merge#

merged = mortgage.merge(census,
                         left_on=['state_code', 'county_code', 'census_tract'],
                         right_on=['STATEA', 'COUNTYA', 'TRACTA'])
merged.head()
census_tract county_code state_code applicant_sex income loan_amount loan_type property_type occupancy action_type loan_purpose agency_code tract_to_msa_income_percent applicant_race co_applicant_sex co_applicant loan_denied GISJOIN YEAR REGIONA DIVISIONA STATE STATEA COUNTY COUNTYA COUSUBA PLACEA TRACTA BLKGRPA CONCITA AIANHHA RES_ONLYA TRUSTA AITSCEA ANRCA CBSAA CSAA METDIVA NECTAA CNECTAA NECTADIVA UAA CDCURRA SLDUA SLDLA ZCTA5A SUBMCDA SDELMA SDSECA SDUNIA PUMA5A BTTRA BTBGA NAME_E pop_total ADK5E002 pop_white pop_black pop_amer_indian pop_asian pop_pac_islander ADK5E008 ADK5E009 ADK5E010 ADK5E011 pop_hispanic ADK5E013 ADK5E014 ADK5E015 ADK5E016 ADK5E017 ADK5E018 ADK5E019 ADK5E020 ADK5E021 NAME_M ADK5M001 ADK5M002 ADK5M003 ADK5M004 ADK5M005 ADK5M006 ADK5M007 ADK5M008 ADK5M009 ADK5M010 ADK5M011 ADK5M012 ADK5M013 ADK5M014 ADK5M015 ADK5M016 ADK5M017 ADK5M018 ADK5M019 ADK5M020 ADK5M021
0 401900 45 42 female 59 112 1 1 1 1 1 OCC 133.09000 white 5 no 0 G4200450401900 2011-2015 nan nan Pennsylvania 42 Delaware County 45 nan nan 401900 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan Census Tract 4019, Delaware County, Pennsylvania 4219 4181 2064 1719 0 224 0 0 174 21 153 38 21 0 17 0 0 0 0 0 0 Census Tract 4019, Delaware County, Pennsylvania 210 210 210 224 10 146 10 10 139 34 121 36 24 10 26 10 10 10 10 10 10
1 401900 45 42 male 63 192 1 1 1 1 1 NCUA 133.09000 na 5 no 0 G4200450401900 2011-2015 nan nan Pennsylvania 42 Delaware County 45 nan nan 401900 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan Census Tract 4019, Delaware County, Pennsylvania 4219 4181 2064 1719 0 224 0 0 174 21 153 38 21 0 17 0 0 0 0 0 0 Census Tract 4019, Delaware County, Pennsylvania 210 210 210 224 10 146 10 10 139 34 121 36 24 10 26 10 10 10 10 10 10
2 401900 45 42 female 80 105 1 1 1 1 1 FDIC 133.09000 black 1 yes 0 G4200450401900 2011-2015 nan nan Pennsylvania 42 Delaware County 45 nan nan 401900 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan Census Tract 4019, Delaware County, Pennsylvania 4219 4181 2064 1719 0 224 0 0 174 21 153 38 21 0 17 0 0 0 0 0 0 Census Tract 4019, Delaware County, Pennsylvania 210 210 210 224 10 146 10 10 139 34 121 36 24 10 26 10 10 10 10 10 10
3 401900 45 42 female 84 128 1 1 1 1 1 NCUA 133.09000 white 1 yes 0 G4200450401900 2011-2015 nan nan Pennsylvania 42 Delaware County 45 nan nan 401900 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan Census Tract 4019, Delaware County, Pennsylvania 4219 4181 2064 1719 0 224 0 0 174 21 153 38 21 0 17 0 0 0 0 0 0 Census Tract 4019, Delaware County, Pennsylvania 210 210 210 224 10 146 10 10 139 34 121 36 24 10 26 10 10 10 10 10 10
4 401900 45 42 female 26 168 1 1 1 3 1 NCUA 133.09000 black 5 no 1 G4200450401900 2011-2015 nan nan Pennsylvania 42 Delaware County 45 nan nan 401900 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan Census Tract 4019, Delaware County, Pennsylvania 4219 4181 2064 1719 0 224 0 0 174 21 153 38 21 0 17 0 0 0 0 0 0 Census Tract 4019, Delaware County, Pennsylvania 210 210 210 224 10 146 10 10 139 34 121 36 24 10 26 10 10 10 10 10 10

Confirm you have 10107 rows and 96 columns in the merged dataframe.

merged.shape
(10107, 97)

WE ARENT DROPPING NA DATA#

WE ARENT DROPPING NA DATA#

WE ARENT DROPPING NA DATA#

WE ARENT DROPPING NA DATA#

merged.head(2).T
0 1
census_tract 401900 401900
county_code 45 45
state_code 42 42
applicant_sex female male
income 59 63
loan_amount 112 192
loan_type 1 1
property_type 1 1
occupancy 1 1
action_type 1 1
loan_purpose 1 1
agency_code OCC NCUA
tract_to_msa_income_percent 133.09000 133.09000
applicant_race white na
co_applicant_sex 5 5
co_applicant no no
loan_denied 0 0
GISJOIN G4200450401900 G4200450401900
YEAR 2011-2015 2011-2015
REGIONA NaN NaN
DIVISIONA NaN NaN
STATE Pennsylvania Pennsylvania
STATEA 42 42
COUNTY Delaware County Delaware County
COUNTYA 45 45
COUSUBA NaN NaN
PLACEA NaN NaN
TRACTA 401900 401900
BLKGRPA NaN NaN
CONCITA NaN NaN
AIANHHA NaN NaN
RES_ONLYA NaN NaN
TRUSTA NaN NaN
AITSCEA NaN NaN
ANRCA NaN NaN
CBSAA NaN NaN
CSAA NaN NaN
METDIVA NaN NaN
NECTAA NaN NaN
CNECTAA NaN NaN
NECTADIVA NaN NaN
UAA NaN NaN
CDCURRA NaN NaN
SLDUA NaN NaN
SLDLA NaN NaN
ZCTA5A NaN NaN
SUBMCDA NaN NaN
SDELMA NaN NaN
SDSECA NaN NaN
SDUNIA NaN NaN
PUMA5A NaN NaN
BTTRA NaN NaN
BTBGA NaN NaN
NAME_E Census Tract 4019, Delaware County, Pennsylvania Census Tract 4019, Delaware County, Pennsylvania
pop_total 4219 4219
ADK5E002 4181 4181
pop_white 2064 2064
pop_black 1719 1719
pop_amer_indian 0 0
pop_asian 224 224
pop_pac_islander 0 0
ADK5E008 0 0
ADK5E009 174 174
ADK5E010 21 21
ADK5E011 153 153
pop_hispanic 38 38
ADK5E013 21 21
ADK5E014 0 0
ADK5E015 17 17
ADK5E016 0 0
ADK5E017 0 0
ADK5E018 0 0
ADK5E019 0 0
ADK5E020 0 0
ADK5E021 0 0
NAME_M Census Tract 4019, Delaware County, Pennsylvania Census Tract 4019, Delaware County, Pennsylvania
ADK5M001 210 210
ADK5M002 210 210
ADK5M003 210 210
ADK5M004 224 224
ADK5M005 10 10
ADK5M006 146 146
ADK5M007 10 10
ADK5M008 10 10
ADK5M009 139 139
ADK5M010 34 34
ADK5M011 121 121
ADK5M012 36 36
ADK5M013 24 24
ADK5M014 10 10
ADK5M015 26 26
ADK5M016 10 10
ADK5M017 10 10
ADK5M018 10 10
ADK5M019 10 10
ADK5M020 10 10
ADK5M021 10 10

Performing our regression#

Note: When working with statsmodels formulas, you don't need to drop missing data. It's handled automatically as part of the .fit() process.

Working with statsmodels formulas#

Statsmodels formulas are a fun (yes, fun! exciting! amazing!) way to write regressions. For example, I can write a formula to say "calculate the relationship between a loan being denied in relation to the loan amount and the applicant's income."

import statsmodels.formula.api as smf

model = smf.logit("""
    loan_denied ~ loan_amount + income
""", data=merged)

result = model.fit()
result.summary()
Optimization terminated successfully.
         Current function value: 0.365688
         Iterations 7
Logit Regression Results
Dep. Variable: loan_denied No. Observations: 10107
Model: Logit Df Residuals: 10104
Method: MLE Df Model: 2
Date: Thu, 07 Nov 2019 Pseudo R-squ.: 0.01192
Time: 13:13:01 Log-Likelihood: -3696.0
converged: True LL-Null: -3740.6
Covariance Type: nonrobust LLR p-value: 4.380e-20
coef std err z P>|z| [0.025 0.975]
Intercept -1.5269 0.058 -26.406 0.000 -1.640 -1.414
loan_amount -0.0015 0.000 -5.093 0.000 -0.002 -0.001
income -0.0010 0.000 -1.952 0.051 -0.002 4.04e-06

But that's child's play: we're here to get a little bit crazy.

Formulas and calculations in statsmodels formulas#

Let's do this! Instead of building new columns in pandas, we're just going to tell statsmodels to do it for us. This is using something called Patsy, imitating the programming language R.

description pandas style formula style
Multiply column df.colname * 100 np.multiply(colname, 100)
Divide columns df.loan_amount / df.income np.divide(loan_amount, income)
Percentage df.pop_black / pop_total * 100 np.multiply(pop_black / pop_total, 100)
Calculate log np.log(income) np.log(income)
One-hot encoding pd.get_dummies(df.agency_code).drop('FDIC', axis=1) C(agency_code, Treatment('FDIC')

If you haven't heard of one-hot encoding before, I recommend reading the longer version of this notebook! Or looking at what happens down below and thinking it through.

If we follow Reveal's methodology, we have a nice long list of features to include in our formula. Turning them all into a statsmodels/Patsy formula, the result looks like this:

import statsmodels.formula.api as smf

model = smf.logit("""
    loan_denied ~ 
        tract_to_msa_income_percent
        + np.log(income)
        + np.log(loan_amount)
        + np.divide(loan_amount, income)
        + C(co_applicant, Treatment('no'))
        + C(applicant_sex, Treatment('female'))
        + C(applicant_race, Treatment('white'))
        + C(agency_code, Treatment('FDIC'))
        + np.multiply(pop_hispanic / pop_total, 100)
        + np.multiply(pop_black / pop_total, 100)
        + np.multiply(pop_amer_indian / pop_total, 100)
        + np.multiply(pop_asian / pop_total, 100)
        + np.multiply(pop_pac_islander / pop_total, 100)
""", data=merged)

result = model.fit()
result.summary()
Optimization terminated successfully.
         Current function value: 0.334016
         Iterations 7
Logit Regression Results
Dep. Variable: loan_denied No. Observations: 10107
Model: Logit Df Residuals: 10082
Method: MLE Df Model: 24
Date: Tue, 05 Nov 2019 Pseudo R-squ.: 0.09749
Time: 12:37:44 Log-Likelihood: -3375.9
converged: True LL-Null: -3740.6
Covariance Type: nonrobust LLR p-value: 1.629e-138
coef std err z P>|z| [0.025 0.975]
Intercept -0.6024 0.306 -1.971 0.049 -1.202 -0.003
C(co_applicant, Treatment('no'))[T.unknown] 0.3957 0.207 1.914 0.056 -0.010 0.801
C(co_applicant, Treatment('no'))[T.yes] -0.0958 0.078 -1.222 0.222 -0.249 0.058
C(applicant_sex, Treatment('female'))[T.male] 0.1213 0.070 1.722 0.085 -0.017 0.259
C(applicant_sex, Treatment('female'))[T.na] -0.1091 0.176 -0.618 0.537 -0.455 0.237
C(applicant_race, Treatment('white'))[T.asian] 0.3783 0.105 3.595 0.000 0.172 0.584
C(applicant_race, Treatment('white'))[T.black] 0.7496 0.115 6.506 0.000 0.524 0.975
C(applicant_race, Treatment('white'))[T.hawaiian] 1.0989 0.464 2.368 0.018 0.189 2.008
C(applicant_race, Treatment('white'))[T.latino] 0.3232 0.164 1.970 0.049 0.002 0.645
C(applicant_race, Treatment('white'))[T.na] 0.4511 0.119 3.798 0.000 0.218 0.684
C(applicant_race, Treatment('white'))[T.native_amer] 1.0546 0.591 1.785 0.074 -0.103 2.213
C(agency_code, Treatment('FDIC'))[T.CFPB] 1.1066 0.136 8.158 0.000 0.841 1.372
C(agency_code, Treatment('FDIC'))[T.FRS] -0.1133 0.221 -0.513 0.608 -0.546 0.319
C(agency_code, Treatment('FDIC'))[T.HUD] 0.1170 0.140 0.837 0.402 -0.157 0.391
C(agency_code, Treatment('FDIC'))[T.NCUA] 1.3009 0.153 8.507 0.000 1.001 1.601
C(agency_code, Treatment('FDIC'))[T.OCC] 0.3160 0.201 1.575 0.115 -0.077 0.709
tract_to_msa_income_percent 0.0017 0.001 2.627 0.009 0.000 0.003
np.log(income) -0.3555 0.070 -5.104 0.000 -0.492 -0.219
np.log(loan_amount) -0.2283 0.056 -4.060 0.000 -0.338 -0.118
np.divide(loan_amount, income) 0.0110 0.007 1.476 0.140 -0.004 0.026
np.multiply(pop_hispanic / pop_total, 100) 0.0072 0.004 2.050 0.040 0.000 0.014
np.multiply(pop_black / pop_total, 100) 0.0062 0.002 3.884 0.000 0.003 0.009
np.multiply(pop_amer_indian / pop_total, 100) -0.2571 0.097 -2.652 0.008 -0.447 -0.067
np.multiply(pop_asian / pop_total, 100) 0.0107 0.004 2.427 0.015 0.002 0.019
np.multiply(pop_pac_islander / pop_total, 100) 0.0904 0.159 0.568 0.570 -0.221 0.402

It's beautiful! It's amazing! We just wrote our formulas inside of the statsmodels formula, and it did all the work for us.

I'm not crying, you're crying.

Sorting by odds ratio#

Since we're interested in the odds ratio, it makes sense to reformat our results, add an odds ratio column, and sort the output.

feature_names = result.params.index
coefficients = result.params.values

coefs = pd.DataFrame({
    'coef': coefficients,
    'odds ratio': np.exp(result.params.values),
    'pvalue': result.pvalues
}).sort_values(by='odds ratio', ascending=False)
coefs
coef odds ratio pvalue
C(agency_code, Treatment('FDIC'))[T.NCUA] 1.30091 3.67262 0.00000
C(agency_code, Treatment('FDIC'))[T.CFPB] 1.10661 3.02410 0.00000
C(applicant_race, Treatment('white'))[T.hawaiian] 1.09891 3.00090 0.01787
C(applicant_race, Treatment('white'))[T.native_amer] 1.05461 2.87087 0.07426
C(applicant_race, Treatment('white'))[T.black] 0.74963 2.11622 0.00000
C(applicant_race, Treatment('white'))[T.na] 0.45112 1.57007 0.00015
C(co_applicant, Treatment('no'))[T.unknown] 0.39570 1.48543 0.05568
C(applicant_race, Treatment('white'))[T.asian] 0.37826 1.45974 0.00032
C(applicant_race, Treatment('white'))[T.latino] 0.32318 1.38151 0.04881
C(agency_code, Treatment('FDIC'))[T.OCC] 0.31598 1.37160 0.11521
C(applicant_sex, Treatment('female'))[T.male] 0.12133 1.12900 0.08514
C(agency_code, Treatment('FDIC'))[T.HUD] 0.11701 1.12413 0.40245
np.multiply(pop_pac_islander / pop_total, 100) 0.09038 1.09459 0.56988
np.divide(loan_amount, income) 0.01096 1.01102 0.13984
np.multiply(pop_asian / pop_total, 100) 0.01069 1.01074 0.01522
np.multiply(pop_hispanic / pop_total, 100) 0.00719 1.00721 0.04041
np.multiply(pop_black / pop_total, 100) 0.00623 1.00625 0.00010
tract_to_msa_income_percent 0.00175 1.00175 0.00862
C(co_applicant, Treatment('no'))[T.yes] -0.09583 0.90862 0.22154
C(applicant_sex, Treatment('female'))[T.na] -0.10905 0.89668 0.53657
C(agency_code, Treatment('FDIC'))[T.FRS] -0.11333 0.89285 0.60766
np.log(loan_amount) -0.22825 0.79592 0.00005
np.multiply(pop_amer_indian / pop_total, 100) -0.25706 0.77332 0.00801
np.log(income) -0.35546 0.70085 0.00000
Intercept -0.60242 0.54748 0.04874

And there we go! Except, of course, it's very ugly.

While writing the formula was a lot nicer and less error-prone (I think) than building a dataframe of our features, the output is horrendous. Those features names are terrible! It looks so so so bad.

We thought we had a beautiful world ahead of us in terms of writing nice precise formulas instead of wrangling new columns, but it looks like it won't be quite so simple.

Renaming our output fields#

But don't worry: even if we love the formula method, we don't have to suffer through those results.

Yes, we can understand the columns, but it's a lot of work to really read what's going on. Life would be much nicer if C(applicant_race, Treatment('white'))[T.latino] were just applicant_race_latino. You'd have to remember what the reference category is, but if we think we can handle it, let's head onward.

While statsmodels doesn't make it easy, it's definitely possible to reach in and rename our features. We do it like this:

# Copy the names to a pd.Series for easy search/replace
# We'll also keep a safe copy to make double-checking easy later
names = pd.Series(model.data.xnames)
originals = list(names.copy())

# Reformat 'C(agency_code, Treatment('FDIC'))[T.FRS]' as 'agency_code_FRS'
names = names.str.replace(r", ?Treatment\(.*\)", r"")
names = names.str.replace(r"C\(([\w]+)", r"\1_")
names = names.str.replace(r"\[T.(.*)\]", r"\1")

# Manually replace other ones
names = names.replace({
    'np.multiply(pop_hispanic / pop_total, 100)': 'pop_hispanic',
    'np.multiply(pop_black / pop_total, 100)': 'pop_black',
    'np.multiply(pop_amer_indian / pop_total, 100)': 'pop_amer_indian',
    'np.multiply(pop_asian / pop_total, 100)': 'pct_asian',
    'np.multiply(pop_pac_islander / pop_total, 100)': 'pop_pac_islander',
    'np.log(income)': 'log_income',
    'np.log(loan_amount)': 'log_loan',
    'np.divide(loan_amount, income)': 'loan_income_ratio',    
})

original_names = model.data.xnames
# Assign back into the model for display
model.data.xnames = list(names)

# Redo our summary, and we get nice output!
result.summary()
Logit Regression Results
Dep. Variable: loan_denied No. Observations: 10107
Model: Logit Df Residuals: 10082
Method: MLE Df Model: 24
Date: Tue, 05 Nov 2019 Pseudo R-squ.: 0.09749
Time: 12:39:26 Log-Likelihood: -3375.9
converged: True LL-Null: -3740.6
Covariance Type: nonrobust LLR p-value: 1.629e-138
coef std err z P>|z| [0.025 0.975]
Intercept -0.6024 0.306 -1.971 0.049 -1.202 -0.003
co_applicant_unknown 0.3957 0.207 1.914 0.056 -0.010 0.801
co_applicant_yes -0.0958 0.078 -1.222 0.222 -0.249 0.058
applicant_sex_male 0.1213 0.070 1.722 0.085 -0.017 0.259
applicant_sex_na -0.1091 0.176 -0.618 0.537 -0.455 0.237
applicant_race_asian 0.3783 0.105 3.595 0.000 0.172 0.584
applicant_race_black 0.7496 0.115 6.506 0.000 0.524 0.975
applicant_race_hawaiian 1.0989 0.464 2.368 0.018 0.189 2.008
applicant_race_latino 0.3232 0.164 1.970 0.049 0.002 0.645
applicant_race_na 0.4511 0.119 3.798 0.000 0.218 0.684
applicant_race_native_amer 1.0546 0.591 1.785 0.074 -0.103 2.213
agency_code_CFPB 1.1066 0.136 8.158 0.000 0.841 1.372
agency_code_FRS -0.1133 0.221 -0.513 0.608 -0.546 0.319
agency_code_HUD 0.1170 0.140 0.837 0.402 -0.157 0.391
agency_code_NCUA 1.3009 0.153 8.507 0.000 1.001 1.601
agency_code_OCC 0.3160 0.201 1.575 0.115 -0.077 0.709
tract_to_msa_income_percent 0.0017 0.001 2.627 0.009 0.000 0.003
log_income -0.3555 0.070 -5.104 0.000 -0.492 -0.219
log_loan -0.2283 0.056 -4.060 0.000 -0.338 -0.118
loan_income_ratio 0.0110 0.007 1.476 0.140 -0.004 0.026
pop_hispanic 0.0072 0.004 2.050 0.040 0.000 0.014
pop_black 0.0062 0.002 3.884 0.000 0.003 0.009
pop_amer_indian -0.2571 0.097 -2.652 0.008 -0.447 -0.067
pct_asian 0.0107 0.004 2.427 0.015 0.002 0.019
pop_pac_islander 0.0904 0.159 0.568 0.570 -0.221 0.402

Everything still works great! We can rebuild our coefficient/odds ratio/p-value situation without any trouble at all.

feature_names = result.params.index
coefficients = result.params.values

coefs = pd.DataFrame({
    'coef': coefficients,
    'odds ratio': np.exp(result.params.values),
    'pvalue': result.pvalues,
    'original': originals
}).sort_values(by='odds ratio', ascending=False)
coefs
coef odds ratio pvalue original
agency_code_NCUA 1.30091 3.67262 0.00000 C(agency_code, Treatment('FDIC'))[T.NCUA]
agency_code_CFPB 1.10661 3.02410 0.00000 C(agency_code, Treatment('FDIC'))[T.CFPB]
applicant_race_hawaiian 1.09891 3.00090 0.01787 C(applicant_race, Treatment('white'))[T.hawaiian]
applicant_race_native_amer 1.05461 2.87087 0.07426 C(applicant_race, Treatment('white'))[T.native...
applicant_race_black 0.74963 2.11622 0.00000 C(applicant_race, Treatment('white'))[T.black]
applicant_race_na 0.45112 1.57007 0.00015 C(applicant_race, Treatment('white'))[T.na]
co_applicant_unknown 0.39570 1.48543 0.05568 C(co_applicant, Treatment('no'))[T.unknown]
applicant_race_asian 0.37826 1.45974 0.00032 C(applicant_race, Treatment('white'))[T.asian]
applicant_race_latino 0.32318 1.38151 0.04881 C(applicant_race, Treatment('white'))[T.latino]
agency_code_OCC 0.31598 1.37160 0.11521 C(agency_code, Treatment('FDIC'))[T.OCC]
applicant_sex_male 0.12133 1.12900 0.08514 C(applicant_sex, Treatment('female'))[T.male]
agency_code_HUD 0.11701 1.12413 0.40245 C(agency_code, Treatment('FDIC'))[T.HUD]
pop_pac_islander 0.09038 1.09459 0.56988 np.multiply(pop_pac_islander / pop_total, 100)
loan_income_ratio 0.01096 1.01102 0.13984 np.divide(loan_amount, income)
pct_asian 0.01069 1.01074 0.01522 np.multiply(pop_asian / pop_total, 100)
pop_hispanic 0.00719 1.00721 0.04041 np.multiply(pop_hispanic / pop_total, 100)
pop_black 0.00623 1.00625 0.00010 np.multiply(pop_black / pop_total, 100)
tract_to_msa_income_percent 0.00175 1.00175 0.00862 tract_to_msa_income_percent
co_applicant_yes -0.09583 0.90862 0.22154 C(co_applicant, Treatment('no'))[T.yes]
applicant_sex_na -0.10905 0.89668 0.53657 C(applicant_sex, Treatment('female'))[T.na]
agency_code_FRS -0.11333 0.89285 0.60766 C(agency_code, Treatment('FDIC'))[T.FRS]
log_loan -0.22825 0.79592 0.00005 np.log(loan_amount)
pop_amer_indian -0.25706 0.77332 0.00801 np.multiply(pop_amer_indian / pop_total, 100)
log_income -0.35546 0.70085 0.00000 np.log(income)
Intercept -0.60242 0.54748 0.04874 Intercept

Interpreting and thinking about the analysis#

Question 1#

Our results aren't exactly the same as Reveal's, as I pulled a slightly different number of rows from the database and I'm not sure what exact dataset they used for census information. How are we feeling about this reproduction? You might want check their 2015 results in the whitepaper.

# I mean come on it's pretty close

Question 2#

In the opening paragraph to the flagship piece, Aaron and Emmanuel write:

Fifty years after the federal Fair Housing Act banned racial discrimination in lending, African Americans and Latinos continue to be routinely denied conventional mortgage loans at rates far higher than their white counterparts.

If you look at the results, Hawaiian/Pacific Islanders (and maybe Native Americans) have an even higher odds ratio. Why do they choose to talk about African Americans and Latinos instead?

# Not nearly as many of those two groups
# And I mean like REALLY not that many
train_df.loc[:,"pct_hispanic":"pct_pac_islander"].median()
pct_hispanic        3.831509
pct_black           7.453754
pct_amer_indian     0.000000
pct_asian           5.071502
pct_pac_islander    0.000000
dtype: float64

Question 3#

Write a sentence expressing the meaning of the odds ratio statistic for Black mortgage applicants. Find a line in the Reveal piece where they use the odds ratio.

# “I had a fair amount of savings and still had so much trouble just left and
# right,” said Rachelle Faroul, a 33-year-old black woman who was rejected twice
# by lenders when she tried to buy a brick row house close to Malcolm X Park in
# Philadelphia, where Reveal found African Americans were 2.7 times as likely as
# whites to be denied a conventional mortgage.

Question 4#

Write a similar sentence about men.

# Men are 12% more likely to be denied a conventional mortgage

Question 5#

Why did Aaron and Emmanuel choose to include the loan-to-income ratio statistic? You might want to read the whitepaper.

# Loan-to-income ratio: Looking at the raw numbers of an applicant’s income and
# loan amount doesn’t tell the whole picture. We needed to look at how much money
# applicants wanted to take out in relation to their income. This provides a proxy
# for whether or not the loan amount was manageable compared with the applicant’s income.
# Experts agreed that this variable should be included.

Question 6#

Credit score is a common reason why loans are denied. Why are credit scores not included in our analysis? You might want to read the whitepaper.

# They aren't available!

Question 7#

This data was just sitting out there for anyone to look at, they didn't even need to FOIA it. Why do you think this issue had not come up before Reveal's analysis?

# This is just an opinion! but I asked them:
# emmanuel - the data is just so big
# aaron - it's a blind spot of journalism, business press only writes about profit

Question 8#

As a result of this series, a lot has happened, although recent changes don't look so good. If you were reporting this story, what groups of people would you want to talk to in order to make sure you're getting the story right?

# aaron and emmanuel talked to experts in research, along with enforcement officers and others at CFPB

Question 9#

When they were consulting experts, Aaron and Emmanuel received a lot of conflicting accounts about whether they should include the "N/A" values for race (they ended up including it). If the experts disagreed about something like that, why do you think they went forward with their analysis?

# Experts will rarely agree, it eventually comes down to editorial judgment. There's not always
# a very clear delineation of right/wrong

Question 10#

What if we were working on this story, and our logistic regression or input dataset were flawed? What would be the repercussions?

# We would be making pretty big claims that weren't backed up - it wouldn't just be us having
# to research more, it would be us actually staking our credibility on the line