Lending disparities using Logistic Regression#

The story: https://www.revealnews.org/article/for-people-of-color-banks-are-shutting-the-door-to-homeownership/

Author: Aaron Glantz and Emmanuel Martinez

Topics: Logistic regression, odds ratios

Read online Download notebook Interactive version

Our data#

philadelphia-mortgages.csv: Philadelphia mortgage data for 2015
- A subset of HMDA LAR data from FFEIC
- Codebook is 2015HMDACodeSheet.pdf
- A guide to HMDA reporting
- I've massaged it slightly to make processing a bit easier
nhgis0006_ds233_20175_2017_tract.csv:
- Table B03002: Hispanic or Latino Origin by Race
- 2013-2017 American Community Survey data US Census Bureau, from NHGIS
- Codebook is nhgis0006_ds233_20175_2017_tract_codebook.txt
lending_disparities_whitepaper_180214.pdf: the whitepaper outlining Reveal's methodology

What's the goal?#

Do banks provide mortgages at disparate rates between white applicants and people of color? We're going to look at the following variables to find out:

Race/Ethnicity
- Native American
- Asian
- Black
- Native Hawaiian
- Hispanic/Latino
- Race and ethnicity were not reported
Sex
Whether there was a co-applicant
Applicant’s annual income (includes co-applicant income)
Loan amount
Ratio between the loan amount and the applicant’s income
Ratio between the median income of the census tract and the median income of the Metro area
Racial and ethnic breakdown by percentage for each census tract
Regulating agency of the lending institution

Setup#

Import pandas as usual, but also import numpy. We'll need it for logarithms and exponents.

Some of our datasets have a lot of columns, so you'll also want to use pd.set_option to display up to 100 columns or so.

import numpy as np
import pandas as pd

pd.set_option("display.max_columns", 100)

What is each row of our data?#

If you aren't sure, you might need to look at either the whitepaper or the codebook. You'll need to look at them both eventually, so might as well get started now.

Read in your data#

Read in our Philadelphia mortgage data and take a peek at the first few rows.

Tip: As always, census tract columns like to cause problems if they're read in as numbers. Make sure pandas reads it in as a string.

# We're just looking at Philly
df = pd.read_csv("data/philadelphia-mortgages.csv", dtype={ 'census_tract': 'str'})
df.head(5)

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex
0	0101.00	101	42	3	26	5	1	1	1	4	2	5	97.09	6	5
1	0264.00	101	42	2	26	40	1	1	1	4	2	5	98.27	3	5
2	0281.00	101	42	2	22	20	1	1	1	5	2	5	72.28	6	5
3	0158.00	101	42	2	57	36	1	1	1	5	3	5	105.87	6	5
4	0358.00	101	42	1	80	34	1	1	1	1	3	5	139.62	5	2

Check your column types#

I mentioned it above, but make sure census_tract is an object (a string) or merging isn't going to be any fun later on.

df.dtypes

census_tract                    object
county_code                      int64
state_code                       int64
applicant_sex                    int64
income                           int64
loan_amount                      int64
loan_type                        int64
property_type                    int64
occupancy                        int64
action_type                      int64
loan_purpose                     int64
agency_code                      int64
tract_to_msa_income_percent    float64
applicant_race                   int64
co_applicant_sex                 int64
dtype: object

Engineering and cleaning up features#

When we plotted the number of applicants, how much money they made and the size of the loan, we found that it skewed to the left, meaning the majority of applicants were clustered on the lower end of the income and loan amount scales. This was especially true for applicants of color. We took the logarithm transformation of income and loan amount to normalize the distribution of those variables and limit the effect of extreme outliers.

A few of the columns you'll need to calculate yourselves. Calculate these values and assign them to three new columns.

Applicant’s adjusted annual income (includes co-applicant income)
Adjusted loan amount
Ratio between the loan amount and the applicant’s income

Instead of using the raw income and loan amount, you'll want the log of both income and loan amount. Call these new columns log_income and log_loan_amount. The third column will be loan_income_ratio.

Tip: np.log gives you the logarithm

df['log_income'] = np.log(df.income)
df['log_loan_amount'] = np.log(df.loan_amount)
df['loan_income_ratio'] = df.loan_amount / df.income
df.head()

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex	log_income	log_loan_amount	loan_income_ratio
0	0101.00	101	42	3	26	5	1	1	1	4	2	5	97.09	6	5	3.258097	1.609438	0.192308
1	0264.00	101	42	2	26	40	1	1	1	4	2	5	98.27	3	5	3.258097	3.688879	1.538462
2	0281.00	101	42	2	22	20	1	1	1	5	2	5	72.28	6	5	3.091042	2.995732	0.909091
3	0158.00	101	42	2	57	36	1	1	1	5	3	5	105.87	6	5	4.043051	3.583519	0.631579
4	0358.00	101	42	1	80	34	1	1	1	1	3	5	139.62	5	2	4.382027	3.526361	0.425000

Co-applicants#

Right now we have a column about the co-applicant's sex (see the codebook for column details). We don't want the sex, though, we're interested in whether there is a co applicant or not. Use the co-applicant's sex to create a new column called co_applicant that is either 'yes', 'no', or 'unknown'.

Hint: If the co-applicant's sex was not provided or is not applicable, count it as unknown.
Hint: The easiest way is to use .replace on the co-applicant sex column, but store the result in your new column

df['co_applicant'] = df.co_applicant_sex.replace({
    1: 'yes',
    2: 'yes',
    3: 'unknown',
    4: 'unknown',
    5: 'no'
})
df.head()

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex	log_income	log_loan_amount	loan_income_ratio	co_applicant
0	0101.00	101	42	3	26	5	1	1	1	4	2	5	97.09	6	5	3.258097	1.609438	0.192308	no
1	0264.00	101	42	2	26	40	1	1	1	4	2	5	98.27	3	5	3.258097	3.688879	1.538462	no
2	0281.00	101	42	2	22	20	1	1	1	5	2	5	72.28	6	5	3.091042	2.995732	0.909091	no
3	0158.00	101	42	2	57	36	1	1	1	5	3	5	105.87	6	5	4.043051	3.583519	0.631579	no
4	0358.00	101	42	1	80	34	1	1	1	1	3	5	139.62	5	2	4.382027	3.526361	0.425000	yes

Filter loan applicants#

If you read the whitepaper - lending_disparities_whitepaper_180214.pdf - many filters are used to get to the target dataset for analysis.

Loan type

While we recognize the substantial presence of applicants of color in the FHA market, we focused on conventional home loans for several reasons.

Property type

Prospective borrowers submit loan applications for various types of structures: one- to four-unit properties, multi-family properties and manufactured homes. For this analysis, we focused on one- to four-unit properties.

Occupancy

We included only borrowers who said they planned to live in the house they were looking to buy. We did this to exclude developers or individuals who were buying property as an investment or to subsequently flip it.

Action Type

We wanted to look at the reasons lending institutions deny people a mortgage. After conversations with former officials at HUD, we decided to include only those applications that resulted in originations (action type 1) or denials (action type 3)

Income

An applicant’s income isn’t always reported in the data. In other cases, the data cuts off any incomes over \$9.9 million and any loan amounts over \\$99.9 million, meaning there’s a value in the database, but it’s not precise. We focused only on those records where income and loan amount have an accurate estimation. This meant discarding about 1 percent of all conventional home loans in the country for 2016. [Note: I already edited this]

When we plotted the number of applicants, how much money they made and the size of the loan, we found that it skewed to the left, meaning the majority of applicants were clustered on the lower end of the income and loan amount scales. This was especially true for applicants of color. We took the logarithm transformation of income and loan amount to normalize the distribution of those variables and limit the effect of extreme outliers.

Lien status

We included all cases in our analysis regardless of lien status.

Race and ethnicity

At first, we looked at race separate from ethnicity, but that approach introduced too many instances in which either the ethnicity or race was unknown. So we decided to combine race and ethnicity. Applicants who marked their ethnicity as Hispanic were grouped together as Hispanic/Latino regardless of race. Non-Hispanic applicants, as well as those who didn’t provide an ethnicity, were grouped together by race: non-Hispanic white, non-Hispanic black, etc. [Note: This has already been taken care of]

Loan purpose

We decided to look at home purchase, home improvement and refinance loans separately from each other. [Note: please look at home purchase loans.]

Use the text above (it's from the whitepaper) and the 2015HMDACodeSheet.pdf code book to filter the dataset.

Tip: there should be between 5-8 filters, depending on how you write them.

df = df[(df.loan_type == 1) & \
        (df.property_type == 1) & \
        (df.occupancy == 1) & \
        df.action_type.isin([1,3]) & \
        (df.income != 9999) & \
        (df.loan_amount != 99999) & \
        (df.loan_purpose == 1)]
df = df.copy()
df.shape

(10107, 19)

When you're done filtering, save your dataframe as a "copy" with df = df.copy() (if it's called df, of course). This will prevent irritating warnings when you're trying to create new columns.

Confirm that you have 10,107 loans with 19 columns#

df.shape

(10107, 19)

Create a "loan denied" column#

Right now the action_type category reflects whether the loan was granted or not, and either has a value of 1 or 3.

Create a new column called loan_denied, where the value is 0 if the loan was accepted and 1 if the loan was denied. This will be our target for the machine learning algorithm.

Tip: You should have 8,878 successful loans and 1,229 denied loans

df['loan_denied'] = (df.action_type == 3).astype(int)
df.loan_denied.value_counts()

0    8878
1    1229
Name: loan_denied, dtype: int64

Deal with categorical variables#

Let's go ahead and take a look at our categorical variables:

Applicant sex (male, female, na)
Applicant race
Mortgage agency
Co-applicant (yes, no, unknown)

Before we do anything crazy, let's use the codebook to turn them into strings.

Tip: We already did this with the co_applicant column, you only need to do the rest
Tip: Just use .replace

df.applicant_sex = df.applicant_sex.replace({
    1: 'male',
    2: 'female',
    3: 'na'
})
df.applicant_race = df.applicant_race.replace({
    1: 'native_amer',
    2: 'asian',
    3: 'black',
    4: 'hawaiian',
    5: 'white',
    6: 'na',
    7: 'na',
    8: 'na',
    99: 'latino'
})
df.agency_code = df.agency_code.replace({
    1: 'OCC',
    2: 'FRS',
    3: 'FDIC',
    5: 'NCUA',
    7: 'HUD',
    9: 'CFPB'
})
df.head(3)

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex	log_income	log_loan_amount	loan_income_ratio	co_applicant
42	4019.00	45	42	female	59	112	1	1	1	1	1	OCC	133.09	white	5	4.077537	4.718499	1.898305	no
43	4099.02	45	42	na	177	375	1	1	1	1	1	OCC	208.56	na	3	5.176150	5.926926	2.118644	unknown
46	4102.00	45	42	male	150	381	1	1	1	1	1	OCC	215.35	white	5	5.010635	5.942799	2.540000	no

Double-check these columns match these values in the first three rows (and yes, you should have a lot of other columns, too).

applicant_sex	agency_code	applicant_race	co_applicant
female	OCC	white	no
na	OCC	na	unknown
male	OCC	white	no

Dummy variables#

Let's say we're at the end of this, and we have a column called sex, where 0 is female and 1 is male. After we've done our regression, we can look at the coefficient/odds ratio for sex and say something like "being male gives you a 1.5x odds of being denied a loan."

We can say this because we're looking at one column, and changing sex from 0 to 1 would turn the applicant male and give them a 1.5x chance of being denied (the odds ratio).

But let's say we're looking at a column called race instead. We could do the same 0/1 thing with white/minority, but what about white/black/asian? If we try to give them 0/1/2 our coefficient/odds ratio interpretation stops working, because we don't have a nice True/False dichotomy any more, it's now a real number.

0: White
1: Black
2: Asian

Usually with numbers you can say "...for every increase of 1...", but we can't anymore - changing from White to Black (+1) isn't the same as changing from Black to Asian (+1). And you can't subtract Black from Asian to get White. And no, you also can't average together White and Asian to get Black. Just recognize that these aren't numbers, they're categories!

How can we turn races off and on like we can turn the sex variable off and on? A good option is to make a 0/1 column for each race. We can then flip each race off and on. These are called dummy variables.

pd.get_dummies(df.applicant_race, prefix='race').head()

	race_asian	race_na	race_white
42	0	0	1
43	0	1	0
46	0	0	1
48	1	0	0
51	1	0	0

Seems to take up a lot of space, but it works a lot better.

The first person is white, so they have a 1 for white and a 0 for every other race
The second person is N/A, so they have a 1 for N/A and a 0 for every other race
The next three are white, asian, and asian, so they have a 1 under the appropriate column.

When you're looking at the regression output, each column has its own coefficient (and odds ratio). Since each race now has a column, each race will also have its own odds ratio. Asian would have one, Black would have one, Latino would have one - now we can look at the effect of each race separately. For example, you could then say something like "being Asian (e.g., race_asian going from 0 to 1) gives you a 1.2x greater chance of being denied, and being Black gets you a 2.1x chance of being denied."

And no, you're never going to have more than one 1 in a row at the same time.

After you've created your dummy variables, there's one more step which has a real fun name: one-hot encoding.

One-hot encoding#

When we have two sexes - male and female - we can flip between them with one binary digit, 0 and 1.

If we had three races - White, Asian, Black - using pd.get_dummies would make three columns, which makes sense on the surface. But why can we put TWO values in ONE column for sex, and it takes THREE columns for the THREE race values?

The truth is, it doesn't have to!

Instead of having three columns, we're only going to have two: asian and black. And if both of them are 0? The applicant is white! This is called a reference category, and it means the coefficients/odds ratios for asian and black are in reference to a white applicant. So it isn't "being black gets you a 2.1x chance of being denied," it's being black gets you a 2.1x chance of being denied compared to a white person. For example:

race_asian	race_black	person's race
1	0	Asian
0	1	Black
0	0	White
1	1	Not possible if your source is a single race column

To create a one-hot encoded variable with a reference category, you write code like this:

pd.get_dummies(df.applicant_race, prefix='race').drop('race_white', axis=1).head()

	race_asian	race_na
42	0	0
43	0	1
46	0	0
48	1	0
51	1	0

We usually use .drop(columns=...) to drop columns, but I'm using axis=1 here because you should be familiar with it

Make a one-hot encoded `sex` category with `female` as the reference category#

You should end up with two columns: sex_male and sex_na.

pd.get_dummies(df.applicant_sex, prefix='sex').drop('sex_female', axis=1).head()

	sex_male	sex_na
42	0	0
43	0	1
46	1	0
48	0	0
51	0	0

Using one-hot encoded columns#

Since these one-hot encoded variables are standalone dataframes, we eventually need to combine them into our original dataframe.

We have four categorical variables - sex, race, co-applicant, and the loan agency - so we need you to make four one-hot encoded variables. Name them like this:

dummies_sex - reference category of white
dummies_race - reference category of female
dummies_co_applicant - reference category of no
dummies_agency - reference category of FDIC

Typically your reference category is the most common category, because it makes for the most interesting comparisons.

Tip: if you're cutting and pasting from above, watch out for .head()

Tip: After you've made them, use .head(2) to check the first couple rows of each to make sure they look okay

dummies_sex = pd.get_dummies(df.applicant_sex, prefix='sex').drop('sex_female', axis=1)
dummies_sex.head(2)

	sex_male	sex_na
42	0	0
43	0	1

dummies_race = pd.get_dummies(df.applicant_race, prefix='race').drop('race_white', axis=1)
dummies_race.head(2)

	race_asian	race_black	race_hawaiian	race_latino	race_na	race_native_amer
42	0	0	0	0	0	0
43	0	0	0	0	1	0

dummies_co_applicant = pd.get_dummies(df.co_applicant, prefix='co_applicant').drop('co_applicant_no', axis=1)
dummies_co_applicant.head(2)

	co_applicant_unknown	co_applicant_yes
42	0	0
43	1	0

dummies_agency = pd.get_dummies(df.agency_code, prefix='agency').drop('agency_FDIC', axis=1)
dummies_agency.head(2)

	agency_CFPB	agency_FRS	agency_HUD	agency_NCUA	agency_OCC
42	0	0	0	0	1
43	0	0	0	0	1

Cleaning up our old dataframe#

Take a look at your original dataframe real quick.

df.head(2)

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex	log_income	log_loan_amount	loan_income_ratio	co_applicant	loan_denied
42	4019.00	45	42	female	59	112	1	1	1	1	1	OCC	133.09	white	5	4.077537	4.718499	1.898305	no	0
43	4099.02	45	42	na	177	375	1	1	1	1	1	OCC	208.56	na	3	5.176150	5.926926	2.118644	unknown	0

We don't need all of those columns! If we look at the list of columns we'll be using for the regression:

Race/Ethnicity
Sex
Whether there was a co-applicant
Applicant’s annual income (includes co-applicant income)
Loan amount
Ratio between the loan amount and the applicant’s income
Ratio between the median income of the census tract and the median income of the Metro area
Racial and ethnic breakdown by percentage for each census tract
Regulating agency of the lending institution

We can keep anything in that list, and remove everything else. For example, we can drop the variables we used to create the dummy variables, as we'll be adding the one-hot encoded versions in for the next step.

For "Racial and ethnic breakdown by percentage for each census tract" we'll need to join with some census data later, so we need to also keep census tract, county code and state code.

Build a new dataframe with only the columns we're interested in, call it numeric. We're calling it numeric because it's mostly numeric columns after the categorical ones have been removed.

Tip: You can either use .drop(columns= to remove unwanted columns or df = df[['col1', 'col2', ... 'col12']] to only select the ones you're interested in

numeric = df[['census_tract', 'county_code', 'state_code', \
         'tract_to_msa_income_percent', 'log_income', \
         'log_loan_amount', 'loan_income_ratio', 'loan_denied']]
numeric.head()

	census_tract	county_code	state_code	tract_to_msa_income_percent	log_income	log_loan_amount	loan_income_ratio
42	4019.00	45	42	133.09	4.077537	4.718499	1.898305
43	4099.02	45	42	208.56	5.176150	5.926926	2.118644
46	4102.00	45	42	215.35	5.010635	5.942799	2.540000
48	0312.00	101	42	93.11	4.174387	4.912655	2.092308
51	4036.01	45	42	141.83	4.007333	5.278115	3.563636

Confirm that numeric has 8 columns.

numeric.shape

(10107, 8)

Combining our features#

We now have 1 dataframe of numeric features (and some merge columns), and 4 one-hot-encoded variables (each with their own dataframe). Combine all five dataframes into one large dataframe called loan_features.

loan_features = pd.concat([
    numeric,
    dummies_co_applicant,
    dummies_sex,
    dummies_race,
    dummies_agency
], axis=1)
loan_features.head()

	census_tract	county_code	state_code	tract_to_msa_income_percent	log_income	log_loan_amount	loan_income_ratio	co_applicant_unknown	sex_male	sex_na	race_asian	race_na	agency_OCC
42	4019.00	45	42	133.09	4.077537	4.718499	1.898305	0	0	0	0	0	1
43	4099.02	45	42	208.56	5.176150	5.926926	2.118644	1	0	1	0	1	1
46	4102.00	45	42	215.35	5.010635	5.942799	2.540000	0	1	0	0	0	1
48	0312.00	101	42	93.11	4.174387	4.912655	2.092308	0	0	0	1	0	1
51	4036.01	45	42	141.83	4.007333	5.278115	3.563636	0	0	0	1	0	1

Confirm that loan_features has 10,107 rows and 23 columns.

loan_features.shape

(10107, 23)

Census data#

Now we just need the final piece to the puzzle, the census data. Read in the census data file, calling the dataframe census.

Tip: As always, be sure to read the tract column in as a string. Interestingly, this time we don't need to worry about the state or county codes in the same way.

Tip: You're going to encounter a problem that you find every time you read in a file from the US government!

census = pd.read_csv("data/nhgis0007_ds215_20155_2015_tract.csv", encoding='latin-1', dtype={'TRACTA': 'str'})
census.head(2)

	GISJOIN	YEAR	REGIONA	DIVISIONA	STATE	STATEA	COUNTY	COUNTYA	COUSUBA	PLACEA	TRACTA	BLKGRPA	CONCITA	AIANHHA	RES_ONLYA	TRUSTA	AITSCEA	ANRCA	CBSAA	CSAA	METDIVA	NECTAA	CNECTAA	NECTADIVA	UAA	CDCURRA	SLDUA	SLDLA	ZCTA5A	SUBMCDA	SDELMA	SDSECA	SDUNIA	PUMA5A	BTTRA	BTBGA	NAME_E	ADK5E001	ADK5E002	ADK5E003	ADK5E004	ADK5E005	ADK5E006	ADK5E007	ADK5E008	ADK5E009	ADK5E010	ADK5E011	ADK5E012	ADK5E013	ADK5E014	ADK5E015	ADK5E016	ADK5E017	ADK5E018	ADK5E019	ADK5E020	ADK5E021	NAME_M	ADK5M001	ADK5M002	ADK5M003	ADK5M004	ADK5M005	ADK5M006	ADK5M007	ADK5M008	ADK5M009	ADK5M010	ADK5M011	ADK5M012	ADK5M013	ADK5M014	ADK5M015	ADK5M016	ADK5M017	ADK5M018	ADK5M019	ADK5M020	ADK5M021
0	G0100010020100	2011-2015	NaN	NaN	Alabama	1	Autauga County	1	NaN	NaN	020100	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Census Tract 201, Autauga County, Alabama	1948	1931	1703	150	6	12	0	0	60	0	60	17	17	0	0	0	0	0	0	0	0	Census Tract 201, Autauga County, Alabama	203	212	229	126	8	16	11	11	44	11	44	21	21	11	11	11	11	11	11	11	11
1	G0100010020200	2011-2015	NaN	NaN	Alabama	1	Autauga County	1	NaN	NaN	020200	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Census Tract 202, Autauga County, Alabama	2156	2139	872	1149	0	50	0	0	68	0	68	17	14	0	0	0	0	3	0	0	0	Census Tract 202, Autauga County, Alabama	268	268	207	250	11	61	11	11	62	11	62	25	23	11	11	11	11	7	11	11	11

Rename some columns#

If you like to keep your data extra clean, feel free to rename the columns you're interested in. If not, feel free to skip it!

Tip: Make sure you're using the estimates columns, not the margin of error columns

# join on STATEA-state code, COUNTYA-county code, TRACTA-census tract (cleaned)
census = census.rename(columns={
    'ADK5E001': 'Total',
    'ADK5E003': 'White alone',
    'ADK5E004': 'Black or African American alone',
    'ADK5E005': 'American Indian and Alaska Native alone',
    'ADK5E006': 'Asian alone',
    'ADK5E007': 'Native Hawaiian and Other Pacific Islander alone',
    'ADK5E012': 'Hispanic or Latino'
})
census.head(2)

	GISJOIN	YEAR	REGIONA	DIVISIONA	STATE	STATEA	COUNTY	COUNTYA	COUSUBA	PLACEA	TRACTA	BLKGRPA	CONCITA	AIANHHA	RES_ONLYA	TRUSTA	AITSCEA	ANRCA	CBSAA	CSAA	METDIVA	NECTAA	CNECTAA	NECTADIVA	UAA	CDCURRA	SLDUA	SLDLA	ZCTA5A	SUBMCDA	SDELMA	SDSECA	SDUNIA	PUMA5A	BTTRA	BTBGA	NAME_E	Total	ADK5E002	White alone	Black or African American alone	American Indian and Alaska Native alone	Asian alone	Native Hawaiian and Other Pacific Islander alone	ADK5E008	ADK5E009	ADK5E010	ADK5E011	Hispanic or Latino	ADK5E013	ADK5E014	ADK5E015	ADK5E016	ADK5E017	ADK5E018	ADK5E019	ADK5E020	ADK5E021	NAME_M	ADK5M001	ADK5M002	ADK5M003	ADK5M004	ADK5M005	ADK5M006	ADK5M007	ADK5M008	ADK5M009	ADK5M010	ADK5M011	ADK5M012	ADK5M013	ADK5M014	ADK5M015	ADK5M016	ADK5M017	ADK5M018	ADK5M019	ADK5M020	ADK5M021
0	G0100010020100	2011-2015	NaN	NaN	Alabama	1	Autauga County	1	NaN	NaN	020100	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Census Tract 201, Autauga County, Alabama	1948	1931	1703	150	6	12	0	0	60	0	60	17	17	0	0	0	0	0	0	0	0	Census Tract 201, Autauga County, Alabama	203	212	229	126	8	16	11	11	44	11	44	21	21	11	11	11	11	11	11	11	11
1	G0100010020200	2011-2015	NaN	NaN	Alabama	1	Autauga County	1	NaN	NaN	020200	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Census Tract 202, Autauga County, Alabama	2156	2139	872	1149	0	50	0	0	68	0	68	17	14	0	0	0	0	3	0	0	0	Census Tract 202, Autauga County, Alabama	268	268	207	250	11	61	11	11	62	11	62	25	23	11	11	11	11	7	11	11	11

Computed columns#

According to Reveal's regression output, you'll want to create the following columns:

Percent Black in tract
Percent Hispanic/Latino in tract (I hope you know how Hispanic/Latino + census data works by now)
Percent Asian in tract
Percent Native American in tract
Percent Native Hawaiian in tract

Notice that we don't include percent white - because all of the other columns add up to percent white, we ignore it! It's similar to a reference category.

If we want to use buzzwords here, the technical reason we're not using percent white is called collinearity. We'll talk more about it on Friday.

census['pct_black'] =  census['Black or African American alone'] / census['Total'] * 100
census['pct_hispanic'] = census['Hispanic or Latino'] / census['Total'] * 100
census['pct_amer_indian'] =  census['American Indian and Alaska Native alone'] / census['Total'] * 100
census['pct_asian'] =  census['Asian alone'] / census['Total'] * 100
census['pct_pac_islander'] = census['Native Hawaiian and Other Pacific Islander alone'] / census['Total'] * 100
census.head(3)

	GISJOIN	YEAR	REGIONA	DIVISIONA	STATE	STATEA	COUNTY	COUNTYA	COUSUBA	PLACEA	TRACTA	BLKGRPA	CONCITA	AIANHHA	RES_ONLYA	TRUSTA	AITSCEA	ANRCA	CBSAA	CSAA	METDIVA	NECTAA	CNECTAA	NECTADIVA	UAA	CDCURRA	SLDUA	SLDLA	ZCTA5A	SUBMCDA	SDELMA	SDSECA	SDUNIA	PUMA5A	BTTRA	BTBGA	NAME_E	Total	ADK5E002	White alone	Black or African American alone	American Indian and Alaska Native alone	Asian alone	Native Hawaiian and Other Pacific Islander alone	ADK5E009	ADK5E011	Hispanic or Latino	ADK5E013	ADK5E018	NAME_M	ADK5M001	ADK5M002	ADK5M003	ADK5M004	ADK5M005	ADK5M006	ADK5M007	ADK5M008	ADK5M009	ADK5M010	ADK5M011	ADK5M012	ADK5M013	ADK5M014	ADK5M015	ADK5M016	ADK5M017	ADK5M018	ADK5M019	ADK5M020	ADK5M021	pct_black	pct_hispanic	pct_amer_indian	pct_asian	pct_pac_islander
0	G0100010020100	2011-2015	NaN	NaN	Alabama	1	Autauga County	1	NaN	NaN	020100	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Census Tract 201, Autauga County, Alabama	1948	1931	1703	150	6	12	0	60	60	17	17	0	Census Tract 201, Autauga County, Alabama	203	212	229	126	8	16	11	11	44	11	44	21	21	11	11	11	11	11	11	11	11	7.700205	0.872690	0.308008	0.616016	0.000000
1	G0100010020200	2011-2015	NaN	NaN	Alabama	1	Autauga County	1	NaN	NaN	020200	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Census Tract 202, Autauga County, Alabama	2156	2139	872	1149	0	50	0	68	68	17	14	3	Census Tract 202, Autauga County, Alabama	268	268	207	250	11	61	11	11	62	11	62	25	23	11	11	11	11	7	11	11	11	53.293135	0.788497	0.000000	2.319109	0.000000
2	G0100010020300	2011-2015	NaN	NaN	Alabama	1	Autauga County	1	NaN	NaN	020300	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Census Tract 203, Autauga County, Alabama	2968	2968	2212	551	15	41	8	141	141	0	0	0	Census Tract 203, Autauga County, Alabama	404	404	372	190	22	62	14	11	135	11	135	11	11	11	11	11	11	11	11	11	11	18.564690	0.000000	0.505391	1.381402	0.269542

Only keep what we need to join and process#

We're only interested in the percentage columns that we computed. Create a new dataframe called census_features that is only those columns along with the one we'll need for joining with the mortgage data.

Tip: Remember we saved state, county and tract codes when working on the loan data

census_features = census[['STATEA', 'COUNTYA', 'TRACTA', \
                          'pct_hispanic', 'pct_black', 'pct_amer_indian', \
                          'pct_asian', 'pct_pac_islander']]
census_features.head()

	STATEA	COUNTYA	TRACTA	pct_hispanic	pct_black	pct_amer_indian	pct_asian	pct_pac_islander
0	1	1	020100	0.872690	7.700205	0.308008	0.616016	0.000000
1	1	1	020200	0.788497	53.293135	0.000000	2.319109	0.000000
2	1	1	020300	0.000000	18.564690	0.505391	1.381402	0.269542
3	1	1	020400	10.490617	3.662672	1.560027	0.000000	0.000000
4	1	1	020500	0.743287	24.844374	0.000000	3.827929	0.000000

Confirm that your first few rows look something like this:

STATEA	COUNTYA	TRACTA	pct_hispanic	pct_black	pct_amer_indian	pct_asian	pct_pac_islander
1	1	020100	0.872690	7.700205	0.308008	0.616016	0.000000
1	1	020200	0.788497	53.293135	0.000000	2.319109	0.000000
1	1	020300	0.000000	18.564690	0.505391	1.381402	0.269542
1	1	020400	10.490617	3.662672	1.560027	0.000000	0.000000
1	1	020500	0.743287	24.844374	0.000000	3.827929	0.000000

Your column headers might be different but your numbers should match.

Merge datasets#

Merge loan_features and census_features into a new dataframe called merged.

Unfortunately something is a little different between our loan_features and census_features census tract columns. You'll need to fix it before you can merge.

Cleaning#

loan_features.head(2)

	census_tract	county_code	state_code	tract_to_msa_income_percent	log_income	log_loan_amount	loan_income_ratio	loan_denied	co_applicant_unknown	co_applicant_yes	sex_male	sex_na	race_asian	race_black	race_hawaiian	race_latino	race_na	race_native_amer	agency_CFPB	agency_FRS	agency_HUD	agency_NCUA	agency_OCC
42	4019.00	45	42	133.09	4.077537	4.718499	1.898305	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
43	4099.02	45	42	208.56	5.176150	5.926926	2.118644	0	1	0	0	1	0	0	0	0	1	0	0	0	0	0	1

census_features.head(2)

	STATEA	COUNTYA	TRACTA	pct_hispanic	pct_black	pct_amer_indian	pct_asian	pct_pac_islander
0	1	1	020100	0.872690	7.700205	0.308008	0.616016	0.0
1	1	1	020200	0.788497	53.293135	0.000000	2.319109	0.0

loan_features['census_tract'] = loan_features['census_tract'].str.replace(".", "")
loan_features.head(2)

	census_tract	county_code	state_code	tract_to_msa_income_percent	log_income	log_loan_amount	loan_income_ratio	loan_denied	co_applicant_unknown	co_applicant_yes	sex_male	sex_na	race_asian	race_black	race_hawaiian	race_latino	race_na	race_native_amer	agency_CFPB	agency_FRS	agency_HUD	agency_NCUA	agency_OCC
42	401900	45	42	133.09	4.077537	4.718499	1.898305	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
43	409902	45	42	208.56	5.176150	5.926926	2.118644	0	1	0	0	1	0	0	0	0	1	0	0	0	0	0	1

Do the merge#

merged = loan_features.merge(census_features,
                             left_on=['state_code', 'county_code', 'census_tract'],
                             right_on=['STATEA', 'COUNTYA', 'TRACTA'])
merged.head()

	census_tract	county_code	state_code	tract_to_msa_income_percent	log_income	log_loan_amount	loan_income_ratio	loan_denied	co_applicant_yes	sex_male	race_black	race_na	agency_NCUA	agency_OCC	STATEA	COUNTYA	TRACTA	pct_hispanic	pct_black	pct_asian
0	401900	45	42	133.09	4.077537	4.718499	1.898305	0	0	0	0	0	0	1	42	45	401900	0.900687	40.744252	5.309315
1	401900	45	42	133.09	4.143135	5.257495	3.047619	0	0	1	0	1	1	0	42	45	401900	0.900687	40.744252	5.309315
2	401900	45	42	133.09	4.382027	4.653960	1.312500	0	1	0	1	0	0	0	42	45	401900	0.900687	40.744252	5.309315
3	401900	45	42	133.09	4.430817	4.852030	1.523810	0	1	0	0	0	1	0	42	45	401900	0.900687	40.744252	5.309315
4	401900	45	42	133.09	3.258097	5.123964	6.461538	1	0	0	1	0	1	0	42	45	401900	0.900687	40.744252	5.309315

Confirm you have 10107 rows and 31 columns in the merged dataframe.

merged.shape

(10107, 31)

Our final dataframe#

Drop all of the columns we merged on and save it as train_df.

train_df = merged.drop(columns=['census_tract','county_code','state_code','STATEA','COUNTYA','TRACTA'])
train_df.head()

	tract_to_msa_income_percent	log_income	log_loan_amount	loan_income_ratio	loan_denied	co_applicant_yes	sex_male	race_black	race_na	agency_NCUA	agency_OCC	pct_hispanic	pct_black	pct_asian
0	133.09	4.077537	4.718499	1.898305	0	0	0	0	0	0	1	0.900687	40.744252	5.309315
1	133.09	4.143135	5.257495	3.047619	0	0	1	0	1	1	0	0.900687	40.744252	5.309315
2	133.09	4.382027	4.653960	1.312500	0	1	0	1	0	0	0	0.900687	40.744252	5.309315
3	133.09	4.430817	4.852030	1.523810	0	1	0	0	0	1	0	0.900687	40.744252	5.309315
4	133.09	3.258097	5.123964	6.461538	1	0	0	1	0	1	0	0.900687	40.744252	5.309315

Confirm that train_df has 10107 rows and 25 columns.

train_df.shape

(10107, 25)

Final cleanup#

Because we can't have missing data before we run a regression, check the size of train_df, then drop any missing data and check the size again. Confirm you don't lose any rows.

train_df.shape

(10107, 25)

train_df = train_df.dropna()
train_df.shape

(10107, 25)

train_df.head()

	tract_to_msa_income_percent	log_income	log_loan_amount	loan_income_ratio	loan_denied	co_applicant_yes	sex_male	race_black	race_na	agency_NCUA	agency_OCC	pct_hispanic	pct_black	pct_asian
0	133.09	4.077537	4.718499	1.898305	0	0	0	0	0	0	1	0.900687	40.744252	5.309315
1	133.09	4.143135	5.257495	3.047619	0	0	1	0	1	1	0	0.900687	40.744252	5.309315
2	133.09	4.382027	4.653960	1.312500	0	1	0	1	0	0	0	0.900687	40.744252	5.309315
3	133.09	4.430817	4.852030	1.523810	0	1	0	0	0	1	0	0.900687	40.744252	5.309315
4	133.09	3.258097	5.123964	6.461538	1	0	0	1	0	1	0	0.900687	40.744252	5.309315

Performing our regression#

Try with statsmodels#

First try to run a linear regression with statsmodels, because even though sometimes it complains and breaks, the output just looks so nice. Instead of sm.OLS we'll use sm.Logit.

import statsmodels.api as sm

X = train_df.drop(columns='loan_denied')
y = train_df.loan_denied

logit = sm.Logit(y, sm.add_constant(X))

result = logit.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.334016
         Iterations 7

Logit Regression Results
Dep. Variable:	loan_denied	No. Observations:	10107
Model:	Logit	Df Residuals:	10082
Method:	MLE	Df Model:	24
Date:	Mon, 06 Jan 2020	Pseudo R-squ.:	0.09749
Time:	12:59:39	Log-Likelihood:	-3375.9
converged:	True	LL-Null:	-3740.6
Covariance Type:	nonrobust	LLR p-value:	1.629e-138

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-0.6024	0.306	-1.971	0.049	-1.202	-0.003
tract_to_msa_income_percent	0.0017	0.001	2.627	0.009	0.000	0.003
log_income	-0.3555	0.070	-5.104	0.000	-0.492	-0.219
log_loan_amount	-0.2283	0.056	-4.060	0.000	-0.338	-0.118
loan_income_ratio	0.0110	0.007	1.476	0.140	-0.004	0.026
co_applicant_unknown	0.3957	0.207	1.914	0.056	-0.010	0.801
co_applicant_yes	-0.0958	0.078	-1.222	0.222	-0.249	0.058
sex_male	0.1213	0.070	1.722	0.085	-0.017	0.259
sex_na	-0.1091	0.176	-0.618	0.537	-0.455	0.237
race_asian	0.3783	0.105	3.595	0.000	0.172	0.584
race_black	0.7496	0.115	6.506	0.000	0.524	0.975
race_hawaiian	1.0989	0.464	2.368	0.018	0.189	2.008
race_latino	0.3232	0.164	1.970	0.049	0.002	0.645
race_na	0.4511	0.119	3.798	0.000	0.218	0.684
race_native_amer	1.0546	0.591	1.785	0.074	-0.103	2.213
agency_CFPB	1.1066	0.136	8.158	0.000	0.841	1.372
agency_FRS	-0.1133	0.221	-0.513	0.608	-0.546	0.319
agency_HUD	0.1170	0.140	0.837	0.402	-0.157	0.391
agency_NCUA	1.3009	0.153	8.507	0.000	1.001	1.601
agency_OCC	0.3160	0.201	1.575	0.115	-0.077	0.709
pct_hispanic	0.0072	0.004	2.050	0.040	0.000	0.014
pct_black	0.0062	0.002	3.884	0.000	0.003	0.009
pct_amer_indian	-0.2571	0.097	-2.652	0.008	-0.447	-0.067
pct_asian	0.0107	0.004	2.427	0.015	0.002	0.019
pct_pac_islander	0.0904	0.159	0.568	0.570	-0.221	0.402

feature_names = result.params.index
coefficients = result.params.values

coefs = pd.DataFrame({
    'coef': coefficients,
    'odds ratio': np.exp(result.params.values),
    'pvalue': result.pvalues
}).sort_values(by='odds ratio', ascending=False)
coefs

	coef	odds ratio	pvalue
agency_NCUA	1.300906	3.672623	1.784062e-17
agency_CFPB	1.106615	3.024103	3.400888e-16
race_hawaiian	1.098912	3.000899	1.786862e-02
race_native_amer	1.054614	2.870866	7.426412e-02
race_black	0.749633	2.116223	7.716072e-11
race_na	0.451117	1.570065	1.457235e-04
co_applicant_unknown	0.395704	1.485429	5.567672e-02
race_asian	0.378256	1.459737	3.244658e-04
race_latino	0.323176	1.381509	4.881353e-02
agency_OCC	0.315975	1.371597	1.152077e-01
sex_male	0.121333	1.129001	8.514478e-02
agency_HUD	0.117006	1.124126	4.024454e-01
pct_pac_islander	0.090384	1.094594	5.698835e-01
loan_income_ratio	0.010958	1.011018	1.398407e-01
pct_asian	0.010686	1.010744	1.521999e-02
pct_hispanic	0.007186	1.007211	4.040907e-02
pct_black	0.006232	1.006252	1.026711e-04
tract_to_msa_income_percent	0.001746	1.001748	8.622169e-03
co_applicant_yes	-0.095829	0.908620	2.215427e-01
sex_na	-0.109055	0.896681	5.365732e-01
agency_FRS	-0.113333	0.892854	6.076593e-01
log_loan_amount	-0.228254	0.795922	4.899515e-05
pct_amer_indian	-0.257062	0.773320	8.010064e-03
log_income	-0.355460	0.700851	3.330815e-07
const	-0.602424	0.547483	4.874236e-02

Try again with sci-kit learn#

But some people might like sklearn. Using the coefficient to build a dataframe seems much different, so let's give it a shot.

Tip: When you build your model, use LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000) - for if you don't increase max_iter (how long/hard it works) it'll complain it can't find an answer.

from sklearn.linear_model import LogisticRegression

X = train_df.drop(columns=['loan_denied'])
y = train_df.loan_denied

# In this case, if we don't increase max_iter it yells at us about 'not converging'
# but if we make it work harder/longer then it stops complaining and finds an answer
clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)
clf.fit(X, y)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=4000, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Getting your coefficients and odds ratios#

After you run your regression using sklearn, you can use code like the below to print out an ordered list of features, coefficients, and odds ratios.

feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients)
}).sort_values(by='odds ratio', ascending=False)

feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients)
}).sort_values(by='odds ratio', ascending=False)

	feature	coefficient (log odds ratio)	odds ratio
17	agency_NCUA	1.301626	3.675267
14	agency_CFPB	1.116460	3.054024
10	race_hawaiian	1.001828	2.723255
9	race_black	0.761099	2.140628
13	race_native_amer	0.754343	2.126214
12	race_na	0.444830	1.560226
8	race_asian	0.385108	1.469773
4	co_applicant_unknown	0.367030	1.443441
11	race_latino	0.336398	1.399896
18	agency_OCC	0.326864	1.386612
6	sex_male	0.123323	1.131250
16	agency_HUD	0.120085	1.127593
23	pct_pac_islander	0.094454	1.099058
3	loan_income_ratio	0.011171	1.011234
22	pct_asian	0.010573	1.010629
19	pct_hispanic	0.007106	1.007131
20	pct_black	0.006154	1.006173
0	tract_to_msa_income_percent	0.001716	1.001718
7	sex_na	-0.080131	0.922995
5	co_applicant_yes	-0.086285	0.917333
15	agency_FRS	-0.114435	0.891870
2	log_loan_amount	-0.227531	0.796498
21	pct_amer_indian	-0.256860	0.773476
1	log_income	-0.353471	0.702246

Wait, what's the odds ratio again?#

It's how much that variable affects the outcome if all other variables stay the same.

Interpreting and thinking about the analysis#

Question 1#

Our results aren't exactly the same as Reveal's, as I pulled a slightly different number of rows from the database and I'm not sure what exact dataset they used for census information. How are we feeling about this reproduction? You might want check their 2015 results in the whitepaper.

# I mean come on it's pretty close

Question 2#

In the opening paragraph to the flagship piece, Aaron and Emmanuel write:

Fifty years after the federal Fair Housing Act banned racial discrimination in lending, African Americans and Latinos continue to be routinely denied conventional mortgage loans at rates far higher than their white counterparts.

If you look at the results, Hawaiian/Pacific Islanders (and maybe Native Americans) have an even higher odds ratio. Why do they choose to talk about African Americans and Latinos instead?

# Not nearly as many of those two groups
# And I mean like REALLY not that many
train_df.loc[:,"pct_hispanic":"pct_pac_islander"].median()

pct_hispanic        3.831509
pct_black           7.453754
pct_amer_indian     0.000000
pct_asian           5.071502
pct_pac_islander    0.000000
dtype: float64

Question 3#

Write a sentence expressing the meaning of the odds ratio statistic for Black mortgage applicants. Find a line in the Reveal piece where they use the odds ratio.

# “I had a fair amount of savings and still had so much trouble just left and
# right,” said Rachelle Faroul, a 33-year-old black woman who was rejected twice
# by lenders when she tried to buy a brick row house close to Malcolm X Park in
# Philadelphia, where Reveal found African Americans were 2.7 times as likely as
# whites to be denied a conventional mortgage.

Question 4#

Write a similar sentence about men.

# Men are 12% more likely to be denied a conventional mortgage

Question 5#

Why did Aaron and Emmanuel choose to include the loan-to-income ratio statistic? You might want to read the whitepaper.

# Loan-to-income ratio: Looking at the raw numbers of an applicant’s income and
# loan amount doesn’t tell the whole picture. We needed to look at how much money
# applicants wanted to take out in relation to their income. This provides a proxy
# for whether or not the loan amount was manageable compared with the applicant’s income.
# Experts agreed that this variable should be included.

Question 6#

Credit score is a common reason why loans are denied. Why are credit scores not included in our analysis? You might want to read the whitepaper.

# They aren't available!

Question 7#

This data was just sitting out there for anyone to look at, they didn't even need to FOIA it. Why do you think this issue had not come up before Reveal's analysis?

# This is just an opinion! but I asked them:
# emmanuel - the data is just so big
# aaron - it's a blind spot of journalism, business press only writes about profit

Question 8#

As a result of this series, a lot has happened, although recent changes don't look so good. If you were reporting this story, what groups of people would you want to talk to in order to make sure you're getting the story right?

# aaron and emmanuel talked to experts in research, along with enforcement officers and others at CFPB

Question 9#

When they were consulting experts, Aaron and Emmanuel received a lot of conflicting accounts about whether they should include the "N/A" values for race (they ended up including it). If the experts disagreed about something like that, why do you think they went forward with their analysis?

# Experts will rarely agree, it eventually comes down to editorial judgment. There's not always
# a very clear delineation of right/wrong

Question 10#

What if we were working on this story, and our logistic regression or input dataset were flawed? What would be the repercussions?

# We would be making pretty big claims that weren't backed up - it wouldn't just be us having
# to research more, it would be us actually staking our credibility on the line