Lending disparities using Logistic Regression#
The story: https://www.revealnews.org/article/for-people-of-color-banks-are-shutting-the-door-to-homeownership/
Author: Aaron Glantz and Emmanuel Martinez
Topics: Logistic regression, odds ratios
Datasets
- philadelphia-mortgages.csv: Philadelphia mortgage data for 2015
- A subset of HMDA LAR data from FFEIC
- Codebook is
2015HMDACodeSheet.pdf
- A guide to HMDA reporting
- I've massaged it slightly to make processing a bit easier
- nhgis0006_ds233_20175_2017_tract.csv:
- Table B03002: Hispanic or Latino Origin by Race
- 2013-2017 American Community Survey data US Census Bureau, from NHGIS
- Codebook is
nhgis0006_ds233_20175_2017_tract_codebook.txt
- lending_disparities_whitepaper_180214.pdf: the whitepaper outlining Reveal's methodology
What's the goal?#
Do banks provide mortgages at disparate rates between white applicants and people of color? We're going to look at the following variables to find out:
- Race/Ethnicity
- Native American
- Asian
- Black
- Native Hawaiian
- Hispanic/Latino
- Race and ethnicity were not reported
- Sex
- Whether there was a co-applicant
- Applicant’s annual income (includes co-applicant income)
- Loan amount
- Ratio between the loan amount and the applicant’s income
- Ratio between the median income of the census tract and the median income of the Metro area
- Racial and ethnic breakdown by percentage for each census tract
- Regulating agency of the lending institution
Setup#
Import pandas as usual, but also import numpy. We'll need it for logarithms and exponents.
Some of our datasets have a lot of columns, so you'll also want to use pd.set_option
to display up to 100 columns or so.
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format",'{:,.5f}'.format)
What is each row of our data?#
If you aren't sure, you might need to look at either the whitepaper or the codebook. You'll need to look at them both eventually, so might as well get started now.
Read in your data#
Read in our Philadelphia mortgage data and take a peek at the first few rows.
- Tip: As always, census tract columns like to cause problems if they're read in as numbers. Make sure pandas reads it in as a string.
# We're just looking at Philly
mortgage = pd.read_csv("data/philadelphia-mortgages.csv", dtype={ 'census_tract': 'str'})
mortgage.head(5)
Check your column types#
I mentioned it above, but make sure census_tract
is an object (a string) or merging isn't going to be any fun later on.
mortgage.dtypes
Engineering and cleaning up features#
Income-related columns#
When we plotted the number of applicants, how much money they made and the size of the loan, we found that it skewed to the left, meaning the majority of applicants were clustered on the lower end of the income and loan amount scales. This was especially true for applicants of color. We took the logarithm transformation of income and loan amount to normalize the distribution of those variables and limit the effect of extreme outliers.
Traditionally we would calculate these values and save them in new columns. Since it's just simple math, though, we'll be able to do it when we run the regression.
Co-applicants#
Right now we have a column about the co-applicant's sex (see the codebook for column details). We don't want the sex, though, we're interested in whether there is a co applicant or not. Use the co-applicant's sex to create a new column called co_applicant
that is either 'yes', 'no', or 'unknown'.
- Hint: If the co-applicant's sex was not provided or is not applicable, count it as unknown.
- Hint: The easiest way is to use
.replace
on the co-applicant sex column, but store the result in your new column
mortgage['co_applicant'] = mortgage.co_applicant_sex.replace({
1: 'yes',
2: 'yes',
3: 'unknown',
4: 'unknown',
5: 'no'
})
mortgage.head()
Filter loan applicants#
If you read the whitepaper - lending_disparities_whitepaper_180214.pdf
- many filters are used to get to the target dataset for analysis.
Loan type
While we recognize the substantial presence of applicants of color in the FHA market, we focused on conventional home loans for several reasons.
Property type
Prospective borrowers submit loan applications for various types of structures: one- to four-unit properties, multi-family properties and manufactured homes. For this analysis, we focused on one- to four-unit properties.
Occupancy
We included only borrowers who said they planned to live in the house they were looking to buy. We did this to exclude developers or individuals who were buying property as an investment or to subsequently flip it.
Action Type
We wanted to look at the reasons lending institutions deny people a mortgage. After conversations with former officials at HUD, we decided to include only those applications that resulted in originations (action type 1) or denials (action type 3)
Income
An applicant’s income isn’t always reported in the data. In other cases, the data cuts off any incomes over \$9.9 million and any loan amounts over \\$99.9 million, meaning there’s a value in the database, but it’s not precise. We focused only on those records where income and loan amount have an accurate estimation. This meant discarding about 1 percent of all conventional home loans in the country for 2016. [Note: I already edited this]
When we plotted the number of applicants, how much money they made and the size of the loan, we found that it skewed to the left, meaning the majority of applicants were clustered on the lower end of the income and loan amount scales. This was especially true for applicants of color. We took the logarithm transformation of income and loan amount to normalize the distribution of those variables and limit the effect of extreme outliers.
Lien status
We included all cases in our analysis regardless of lien status.
Race and ethnicity
At first, we looked at race separate from ethnicity, but that approach introduced too many instances in which either the ethnicity or race was unknown. So we decided to combine race and ethnicity. Applicants who marked their ethnicity as Hispanic were grouped together as Hispanic/Latino regardless of race. Non-Hispanic applicants, as well as those who didn’t provide an ethnicity, were grouped together by race: non-Hispanic white, non-Hispanic black, etc. [Note: This has already been taken care of]
Loan purpose
We decided to look at home purchase, home improvement and refinance loans separately from each other. [Note: please look at home purchase loans.]
Use the text above (it's from the whitepaper) and the 2015HMDACodeSheet.pdf code book to filter the dataset.
- Tip: there should be between 5-8 filters, depending on how you write them.
mortgage = mortgage[(mortgage.loan_type == 1) & \
(mortgage.property_type == 1) & \
(mortgage.occupancy == 1) & \
mortgage.action_type.isin([1,3]) & \
(mortgage.income != 9999) & \
(mortgage.loan_amount != 99999) & \
(mortgage.loan_purpose == 1)]
mortgage = mortgage.copy()
mortgage.shape
When you're done filtering, save your dataframe as a "copy" with df = df.copy()
(if it's called df
, of course). This will prevent irritating warnings when you're trying to create new columns.
Confirm that you have 10,107 loans with 19 columns#
mortgage.shape
Create a "loan denied" column#
Right now the action_type
category reflects whether the loan was granted or not, and either has a value of 1
or 3
. We'll need to create a new column called loan_denied
, where the value is 0
if the loan was accepted and 1
if the loan was denied.
While we're eventually going to do a bunch of crazy comparisons and math inside of our statsmodels formula, we do need the target of our regression to be a number. You'll see what I mean later on!
mortgage['loan_denied'] = (mortgage.action_type == 3).astype(int)
Deal with categorical variables#
Let's go ahead and take a look at our categorical variables:
- Applicant sex (male, female, na)
- Applicant race
- Mortgage agency
- Co-applicant (yes, no, unknown)
Before we do anything crazy, let's use the codebook to turn them into strings.
- Tip: We already did this with the
co_applicant
column, you only need to do the rest - Tip: Just use
.replace
mortgage.applicant_sex = mortgage.applicant_sex.replace({
1: 'male',
2: 'female',
3: 'na'
})
mortgage.applicant_race = mortgage.applicant_race.replace({
1: 'native_amer',
2: 'asian',
3: 'black',
4: 'hawaiian',
5: 'white',
6: 'na',
7: 'na',
8: 'na',
99: 'latino'
})
mortgage.agency_code = mortgage.agency_code.replace({
1: 'OCC',
2: 'FRS',
3: 'FDIC',
5: 'NCUA',
7: 'HUD',
9: 'CFPB'
})
mortgage.head(3)
Double-check these columns match these values in the first three rows (and yes, you should have a lot of other columns, too).
applicant_sex | agency_code | applicant_race | co_applicant |
---|---|---|---|
female | OCC | white | no |
na | OCC | na | unknown |
male | OCC | white | no |
Double-check our mortage data#
mortgage.head()
mortgage.shape
Census data#
Now we just need the final piece to the puzzle, the census data. Read in the census data file, calling the dataframe census
.
Tip: As always, be sure to read the tract column in as a string. Interestingly, this time we don't need to worry about the state or county codes in the same way.
Tip: You're going to encounter a problem that you find every time you read in a file from the US government!
census = pd.read_csv("data/nhgis0007_ds215_20155_2015_tract.csv", encoding='latin-1', dtype={'TRACTA': 'str'})
census.head(2)
Rename some columns#
If you like to keep your data extra clean, feel free to rename the columns you're interested in. If not, feel free to skip it!
Tip: Make sure you're using the estimates columns, not the margin of error columns
# join on STATEA-state code, COUNTYA-county code, TRACTA-census tract (cleaned)
census = census.rename(columns={
'ADK5E001': 'pop_total',
'ADK5E003': 'pop_white',
'ADK5E004': 'pop_black',
'ADK5E005': 'pop_amer_indian',
'ADK5E006': 'pop_asian',
'ADK5E007': 'pop_pac_islander',
'ADK5E012': 'pop_hispanic'
})
census.head(2).T
Computed columns#
According to Reveal's regression output, you'll want to create the following columns:
- Percent Black in tract
- Percent Hispanic/Latino in tract (I hope you know how Hispanic/Latino + census data works by now)
- Percent Asian in tract
- Percent Native American in tract
- Percent Native Hawaiian in tract
Notice that we don't include percent white - because all of the other columns add up to percent white, we ignore it! It's similar to a reference category.
If we want to use buzzwords here, the technical reason we're not using percent white is called collinearity. We'll talk more about it on Friday.
mortgage.head(2)
census.head(2)
mortgage['census_tract'] = mortgage['census_tract'].str.replace(".", "")
mortgage.head(2)
Do the merge#
merged = mortgage.merge(census,
left_on=['state_code', 'county_code', 'census_tract'],
right_on=['STATEA', 'COUNTYA', 'TRACTA'])
merged.head()
Confirm you have 10107 rows and 96 columns in the merged dataframe.
merged.shape
merged.to_csv("data/mortgage-census-cleaned-merged.csv", index=False)