Lending disparities using Logistic Regression#

The story: https://www.revealnews.org/article/for-people-of-color-banks-are-shutting-the-door-to-homeownership/

Author: Aaron Glantz and Emmanuel Martinez

Topics: Logistic regression, odds ratios

Datasets

philadelphia-mortgages.csv: Philadelphia mortgage data for 2015
- A subset of HMDA LAR data from FFEIC
- Codebook is 2015HMDACodeSheet.pdf
- A guide to HMDA reporting
- I've massaged it slightly to make processing a bit easier
nhgis0006_ds233_20175_2017_tract.csv:
- Table B03002: Hispanic or Latino Origin by Race
- 2013-2017 American Community Survey data US Census Bureau, from NHGIS
- Codebook is nhgis0006_ds233_20175_2017_tract_codebook.txt
lending_disparities_whitepaper_180214.pdf: the whitepaper outlining Reveal's methodology

What's the goal?#

Do banks provide mortgages at disparate rates between white applicants and people of color? We're going to look at the following variables to find out:

Race/Ethnicity
- Native American
- Asian
- Black
- Native Hawaiian
- Hispanic/Latino
- Race and ethnicity were not reported
Sex
Whether there was a co-applicant
Applicant’s annual income (includes co-applicant income)
Loan amount
Ratio between the loan amount and the applicant’s income
Ratio between the median income of the census tract and the median income of the Metro area
Racial and ethnic breakdown by percentage for each census tract
Regulating agency of the lending institution

Read online Download notebook Interactive version

Setup#

Import pandas as usual, but also import numpy. We'll need it for logarithms and exponents.

Some of our datasets have a lot of columns, so you'll also want to use pd.set_option to display up to 100 columns or so.

import numpy as np
import pandas as pd

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format",'{:,.5f}'.format)

What is each row of our data?#

If you aren't sure, you might need to look at either the whitepaper or the codebook. You'll need to look at them both eventually, so might as well get started now.

Read in your data#

Read in our Philadelphia mortgage data and take a peek at the first few rows.

Tip: As always, census tract columns like to cause problems if they're read in as numbers. Make sure pandas reads it in as a string.

# We're just looking at Philly
mortgage = pd.read_csv("data/philadelphia-mortgages.csv", dtype={ 'census_tract': 'str'})
mortgage.head(5)

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex
0	0101.00	101	42	3	26	5	1	1	1	4	2	5	97.09000	6	5
1	0264.00	101	42	2	26	40	1	1	1	4	2	5	98.27000	3	5
2	0281.00	101	42	2	22	20	1	1	1	5	2	5	72.28000	6	5
3	0158.00	101	42	2	57	36	1	1	1	5	3	5	105.87000	6	5
4	0358.00	101	42	1	80	34	1	1	1	1	3	5	139.62000	5	2

Check your column types#

I mentioned it above, but make sure census_tract is an object (a string) or merging isn't going to be any fun later on.

mortgage.dtypes

census_tract                    object
county_code                      int64
state_code                       int64
applicant_sex                    int64
income                           int64
loan_amount                      int64
loan_type                        int64
property_type                    int64
occupancy                        int64
action_type                      int64
loan_purpose                     int64
agency_code                      int64
tract_to_msa_income_percent    float64
applicant_race                   int64
co_applicant_sex                 int64
dtype: object

Engineering and cleaning up features#

When we plotted the number of applicants, how much money they made and the size of the loan, we found that it skewed to the left, meaning the majority of applicants were clustered on the lower end of the income and loan amount scales. This was especially true for applicants of color. We took the logarithm transformation of income and loan amount to normalize the distribution of those variables and limit the effect of extreme outliers.

Traditionally we would calculate these values and save them in new columns. Since it's just simple math, though, we'll be able to do it when we run the regression.

Co-applicants#

Right now we have a column about the co-applicant's sex (see the codebook for column details). We don't want the sex, though, we're interested in whether there is a co applicant or not. Use the co-applicant's sex to create a new column called co_applicant that is either 'yes', 'no', or 'unknown'.

Hint: If the co-applicant's sex was not provided or is not applicable, count it as unknown.
Hint: The easiest way is to use .replace on the co-applicant sex column, but store the result in your new column

mortgage['co_applicant'] = mortgage.co_applicant_sex.replace({
    1: 'yes',
    2: 'yes',
    3: 'unknown',
    4: 'unknown',
    5: 'no'
})
mortgage.head()

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex	co_applicant
0	0101.00	101	42	3	26	5	1	1	1	4	2	5	97.09000	6	5	no
1	0264.00	101	42	2	26	40	1	1	1	4	2	5	98.27000	3	5	no
2	0281.00	101	42	2	22	20	1	1	1	5	2	5	72.28000	6	5	no
3	0158.00	101	42	2	57	36	1	1	1	5	3	5	105.87000	6	5	no
4	0358.00	101	42	1	80	34	1	1	1	1	3	5	139.62000	5	2	yes

Filter loan applicants#

If you read the whitepaper - lending_disparities_whitepaper_180214.pdf - many filters are used to get to the target dataset for analysis.

Loan type

While we recognize the substantial presence of applicants of color in the FHA market, we focused on conventional home loans for several reasons.

Property type

Prospective borrowers submit loan applications for various types of structures: one- to four-unit properties, multi-family properties and manufactured homes. For this analysis, we focused on one- to four-unit properties.

Occupancy

We included only borrowers who said they planned to live in the house they were looking to buy. We did this to exclude developers or individuals who were buying property as an investment or to subsequently flip it.

Action Type

We wanted to look at the reasons lending institutions deny people a mortgage. After conversations with former officials at HUD, we decided to include only those applications that resulted in originations (action type 1) or denials (action type 3)

Income

An applicant’s income isn’t always reported in the data. In other cases, the data cuts off any incomes over \$9.9 million and any loan amounts over \\$99.9 million, meaning there’s a value in the database, but it’s not precise. We focused only on those records where income and loan amount have an accurate estimation. This meant discarding about 1 percent of all conventional home loans in the country for 2016. [Note: I already edited this]

When we plotted the number of applicants, how much money they made and the size of the loan, we found that it skewed to the left, meaning the majority of applicants were clustered on the lower end of the income and loan amount scales. This was especially true for applicants of color. We took the logarithm transformation of income and loan amount to normalize the distribution of those variables and limit the effect of extreme outliers.

Lien status

We included all cases in our analysis regardless of lien status.

Race and ethnicity

At first, we looked at race separate from ethnicity, but that approach introduced too many instances in which either the ethnicity or race was unknown. So we decided to combine race and ethnicity. Applicants who marked their ethnicity as Hispanic were grouped together as Hispanic/Latino regardless of race. Non-Hispanic applicants, as well as those who didn’t provide an ethnicity, were grouped together by race: non-Hispanic white, non-Hispanic black, etc. [Note: This has already been taken care of]

Loan purpose

We decided to look at home purchase, home improvement and refinance loans separately from each other. [Note: please look at home purchase loans.]

Use the text above (it's from the whitepaper) and the 2015HMDACodeSheet.pdf code book to filter the dataset.

Tip: there should be between 5-8 filters, depending on how you write them.

mortgage = mortgage[(mortgage.loan_type == 1) & \
            (mortgage.property_type == 1) & \
            (mortgage.occupancy == 1) & \
            mortgage.action_type.isin([1,3]) & \
            (mortgage.income != 9999) & \
            (mortgage.loan_amount != 99999) & \
            (mortgage.loan_purpose == 1)]
mortgage = mortgage.copy()
mortgage.shape

(10107, 16)

When you're done filtering, save your dataframe as a "copy" with df = df.copy() (if it's called df, of course). This will prevent irritating warnings when you're trying to create new columns.

Confirm that you have 10,107 loans with 19 columns#

mortgage.shape

(10107, 16)

Create a "loan denied" column#

Right now the action_type category reflects whether the loan was granted or not, and either has a value of 1 or 3. We'll need to create a new column called loan_denied, where the value is 0 if the loan was accepted and 1 if the loan was denied.

While we're eventually going to do a bunch of crazy comparisons and math inside of our statsmodels formula, we do need the target of our regression to be a number. You'll see what I mean later on!

mortgage['loan_denied'] = (mortgage.action_type == 3).astype(int)

Deal with categorical variables#

Let's go ahead and take a look at our categorical variables:

Applicant sex (male, female, na)
Applicant race
Mortgage agency
Co-applicant (yes, no, unknown)

Before we do anything crazy, let's use the codebook to turn them into strings.

Tip: We already did this with the co_applicant column, you only need to do the rest
Tip: Just use .replace

mortgage.applicant_sex = mortgage.applicant_sex.replace({
    1: 'male',
    2: 'female',
    3: 'na'
})
mortgage.applicant_race = mortgage.applicant_race.replace({
    1: 'native_amer',
    2: 'asian',
    3: 'black',
    4: 'hawaiian',
    5: 'white',
    6: 'na',
    7: 'na',
    8: 'na',
    99: 'latino'
})
mortgage.agency_code = mortgage.agency_code.replace({
    1: 'OCC',
    2: 'FRS',
    3: 'FDIC',
    5: 'NCUA',
    7: 'HUD',
    9: 'CFPB'
})
mortgage.head(3)

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex	co_applicant
42	4019.00	45	42	female	59	112	1	1	1	1	1	OCC	133.09000	white	5	no
43	4099.02	45	42	na	177	375	1	1	1	1	1	OCC	208.56000	na	3	unknown
46	4102.00	45	42	male	150	381	1	1	1	1	1	OCC	215.35000	white	5	no

Double-check these columns match these values in the first three rows (and yes, you should have a lot of other columns, too).

applicant_sex	agency_code	applicant_race	co_applicant
female	OCC	white	no
na	OCC	na	unknown
male	OCC	white	no

Double-check our mortage data#

mortgage.head()

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex	co_applicant
42	4019.00	45	42	female	59	112	1	1	1	1	1	OCC	133.09000	white	5	no
43	4099.02	45	42	na	177	375	1	1	1	1	1	OCC	208.56000	na	3	unknown
46	4102.00	45	42	male	150	381	1	1	1	1	1	OCC	215.35000	white	5	no
48	0312.00	101	42	female	65	136	1	1	1	1	1	OCC	93.11000	asian	5	no
51	4036.01	45	42	female	55	196	1	1	1	1	1	OCC	141.83000	asian	5	no

mortgage.shape

(10107, 17)

Census data#

Now we just need the final piece to the puzzle, the census data. Read in the census data file, calling the dataframe census.

Tip: As always, be sure to read the tract column in as a string. Interestingly, this time we don't need to worry about the state or county codes in the same way.

Tip: You're going to encounter a problem that you find every time you read in a file from the US government!

census = pd.read_csv("data/nhgis0007_ds215_20155_2015_tract.csv", encoding='latin-1', dtype={'TRACTA': 'str'})
census.head(2)

	GISJOIN	YEAR	REGIONA	DIVISIONA	STATE	STATEA	COUNTY	COUNTYA	COUSUBA	PLACEA	TRACTA	BLKGRPA	CONCITA	AIANHHA	RES_ONLYA	TRUSTA	AITSCEA	ANRCA	CBSAA	CSAA	METDIVA	NECTAA	CNECTAA	NECTADIVA	UAA	CDCURRA	SLDUA	SLDLA	ZCTA5A	SUBMCDA	SDELMA	SDSECA	SDUNIA	PUMA5A	BTTRA	BTBGA	NAME_E	ADK5E001	ADK5E002	ADK5E003	ADK5E004	ADK5E005	ADK5E006	ADK5E007	ADK5E008	ADK5E009	ADK5E010	ADK5E011	ADK5E012	ADK5E013	ADK5E014	ADK5E015	ADK5E016	ADK5E017	ADK5E018	ADK5E019	ADK5E020	ADK5E021	NAME_M	ADK5M001	ADK5M002	ADK5M003	ADK5M004	ADK5M005	ADK5M006	ADK5M007	ADK5M008	ADK5M009	ADK5M010	ADK5M011	ADK5M012	ADK5M013	ADK5M014	ADK5M015	ADK5M016	ADK5M017	ADK5M018	ADK5M019	ADK5M020	ADK5M021
0	G0100010020100	2011-2015	nan	nan	Alabama	1	Autauga County	1	nan	nan	020100	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Census Tract 201, Autauga County, Alabama	1948	1931	1703	150	6	12	0	0	60	0	60	17	17	0	0	0	0	0	0	0	0	Census Tract 201, Autauga County, Alabama	203	212	229	126	8	16	11	11	44	11	44	21	21	11	11	11	11	11	11	11	11
1	G0100010020200	2011-2015	nan	nan	Alabama	1	Autauga County	1	nan	nan	020200	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Census Tract 202, Autauga County, Alabama	2156	2139	872	1149	0	50	0	0	68	0	68	17	14	0	0	0	0	3	0	0	0	Census Tract 202, Autauga County, Alabama	268	268	207	250	11	61	11	11	62	11	62	25	23	11	11	11	11	7	11	11	11

Rename some columns#

If you like to keep your data extra clean, feel free to rename the columns you're interested in. If not, feel free to skip it!

Tip: Make sure you're using the estimates columns, not the margin of error columns

# join on STATEA-state code, COUNTYA-county code, TRACTA-census tract (cleaned)
census = census.rename(columns={
    'ADK5E001': 'pop_total',
    'ADK5E003': 'pop_white',
    'ADK5E004': 'pop_black',
    'ADK5E005': 'pop_amer_indian',
    'ADK5E006': 'pop_asian',
    'ADK5E007': 'pop_pac_islander',
    'ADK5E012': 'pop_hispanic'
})
census.head(2).T

	0	1
GISJOIN	G0100010020100	G0100010020200
YEAR	2011-2015	2011-2015
REGIONA	NaN	NaN
DIVISIONA	NaN	NaN
STATE	Alabama	Alabama
STATEA	1	1
COUNTY	Autauga County	Autauga County
COUNTYA	1	1
COUSUBA	NaN	NaN
PLACEA	NaN	NaN
TRACTA	020100	020200
BLKGRPA	NaN	NaN
CONCITA	NaN	NaN
AIANHHA	NaN	NaN
RES_ONLYA	NaN	NaN
TRUSTA	NaN	NaN
AITSCEA	NaN	NaN
ANRCA	NaN	NaN
CBSAA	NaN	NaN
CSAA	NaN	NaN
METDIVA	NaN	NaN
NECTAA	NaN	NaN
CNECTAA	NaN	NaN
NECTADIVA	NaN	NaN
UAA	NaN	NaN
CDCURRA	NaN	NaN
SLDUA	NaN	NaN
SLDLA	NaN	NaN
ZCTA5A	NaN	NaN
SUBMCDA	NaN	NaN
SDELMA	NaN	NaN
SDSECA	NaN	NaN
SDUNIA	NaN	NaN
PUMA5A	NaN	NaN
BTTRA	NaN	NaN
BTBGA	NaN	NaN
NAME_E	Census Tract 201, Autauga County, Alabama	Census Tract 202, Autauga County, Alabama
pop_total	1948	2156
ADK5E002	1931	2139
pop_white	1703	872
pop_black	150	1149
pop_amer_indian	6	0
pop_asian	12	50
pop_pac_islander	0	0
ADK5E008	0	0
ADK5E009	60	68
ADK5E010	0	0
ADK5E011	60	68
pop_hispanic	17	17
ADK5E013	17	14
ADK5E014	0	0
ADK5E015	0	0
ADK5E016	0	0
ADK5E017	0	0
ADK5E018	0	3
ADK5E019	0	0
ADK5E020	0	0
ADK5E021	0	0
NAME_M	Census Tract 201, Autauga County, Alabama	Census Tract 202, Autauga County, Alabama
ADK5M001	203	268
ADK5M002	212	268
ADK5M003	229	207
ADK5M004	126	250
ADK5M005	8	11
ADK5M006	16	61
ADK5M007	11	11
ADK5M008	11	11
ADK5M009	44	62
ADK5M010	11	11
ADK5M011	44	62
ADK5M012	21	25
ADK5M013	21	23
ADK5M014	11	11
ADK5M015	11	11
ADK5M016	11	11
ADK5M017	11	11
ADK5M018	11	7
ADK5M019	11	11
ADK5M020	11	11
ADK5M021	11	11

Computed columns#

According to Reveal's regression output, you'll want to create the following columns:

Percent Black in tract
Percent Hispanic/Latino in tract (I hope you know how Hispanic/Latino + census data works by now)
Percent Asian in tract
Percent Native American in tract
Percent Native Hawaiian in tract

Notice that we don't include percent white - because all of the other columns add up to percent white, we ignore it! It's similar to a reference category.

If we want to use buzzwords here, the technical reason we're not using percent white is called collinearity. We'll talk more about it on Friday.

Merge datasets#

Merge mortgage and census into a new dataframe called merged.

Unfortunately something is a little different between our mortgage and census census tract columns. You'll need to fix it before you can merge.

Cleaning#

mortgage.head(2)

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex	co_applicant	loan_denied
42	4019.00	45	42	female	59	112	1	1	1	1	1	OCC	133.09000	white	5	no	0
43	4099.02	45	42	na	177	375	1	1	1	1	1	OCC	208.56000	na	3	unknown	0

census.head(2)

	GISJOIN	YEAR	REGIONA	DIVISIONA	STATE	STATEA	COUNTY	COUNTYA	COUSUBA	PLACEA	TRACTA	BLKGRPA	CONCITA	AIANHHA	RES_ONLYA	TRUSTA	AITSCEA	ANRCA	CBSAA	CSAA	METDIVA	NECTAA	CNECTAA	NECTADIVA	UAA	CDCURRA	SLDUA	SLDLA	ZCTA5A	SUBMCDA	SDELMA	SDSECA	SDUNIA	PUMA5A	BTTRA	BTBGA	NAME_E	pop_total	ADK5E002	pop_white	pop_black	pop_amer_indian	pop_asian	pop_pac_islander	ADK5E008	ADK5E009	ADK5E010	ADK5E011	pop_hispanic	ADK5E013	ADK5E014	ADK5E015	ADK5E016	ADK5E017	ADK5E018	ADK5E019	ADK5E020	ADK5E021	NAME_M	ADK5M001	ADK5M002	ADK5M003	ADK5M004	ADK5M005	ADK5M006	ADK5M007	ADK5M008	ADK5M009	ADK5M010	ADK5M011	ADK5M012	ADK5M013	ADK5M014	ADK5M015	ADK5M016	ADK5M017	ADK5M018	ADK5M019	ADK5M020	ADK5M021
0	G0100010020100	2011-2015	nan	nan	Alabama	1	Autauga County	1	nan	nan	020100	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Census Tract 201, Autauga County, Alabama	1948	1931	1703	150	6	12	0	0	60	0	60	17	17	0	0	0	0	0	0	0	0	Census Tract 201, Autauga County, Alabama	203	212	229	126	8	16	11	11	44	11	44	21	21	11	11	11	11	11	11	11	11
1	G0100010020200	2011-2015	nan	nan	Alabama	1	Autauga County	1	nan	nan	020200	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Census Tract 202, Autauga County, Alabama	2156	2139	872	1149	0	50	0	0	68	0	68	17	14	0	0	0	0	3	0	0	0	Census Tract 202, Autauga County, Alabama	268	268	207	250	11	61	11	11	62	11	62	25	23	11	11	11	11	7	11	11	11

mortgage['census_tract'] = mortgage['census_tract'].str.replace(".", "")
mortgage.head(2)

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex	co_applicant	loan_denied
42	401900	45	42	female	59	112	1	1	1	1	1	OCC	133.09000	white	5	no	0
43	409902	45	42	na	177	375	1	1	1	1	1	OCC	208.56000	na	3	unknown	0

Do the merge#

merged = mortgage.merge(census,
                         left_on=['state_code', 'county_code', 'census_tract'],
                         right_on=['STATEA', 'COUNTYA', 'TRACTA'])
merged.head()

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex	co_applicant	loan_denied	GISJOIN	YEAR	REGIONA	DIVISIONA	STATE	STATEA	COUNTY	COUNTYA	COUSUBA	PLACEA	TRACTA	BLKGRPA	CONCITA	AIANHHA	RES_ONLYA	TRUSTA	AITSCEA	ANRCA	CBSAA	CSAA	METDIVA	NECTAA	CNECTAA	NECTADIVA	UAA	CDCURRA	SLDUA	SLDLA	ZCTA5A	SUBMCDA	SDELMA	SDSECA	SDUNIA	PUMA5A	BTTRA	BTBGA	NAME_E	pop_total	ADK5E002	pop_white	pop_black	pop_asian	ADK5E009	ADK5E010	ADK5E011	pop_hispanic	ADK5E013	ADK5E015	NAME_M	ADK5M001	ADK5M002	ADK5M003	ADK5M004	ADK5M005	ADK5M006	ADK5M007	ADK5M008	ADK5M009	ADK5M010	ADK5M011	ADK5M012	ADK5M013	ADK5M014	ADK5M015	ADK5M016	ADK5M017	ADK5M018	ADK5M019	ADK5M020	ADK5M021
0	401900	45	42	female	59	112	1	1	1	1	1	OCC	133.09000	white	5	no	0	G4200450401900	2011-2015	nan	nan	Pennsylvania	42	Delaware County	45	nan	nan	401900	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Census Tract 4019, Delaware County, Pennsylvania	4219	4181	2064	1719	224	174	21	153	38	21	17	Census Tract 4019, Delaware County, Pennsylvania	210	210	210	224	10	146	10	10	139	34	121	36	24	10	26	10	10	10	10	10	10
1	401900	45	42	male	63	192	1	1	1	1	1	NCUA	133.09000	na	5	no	0	G4200450401900	2011-2015	nan	nan	Pennsylvania	42	Delaware County	45	nan	nan	401900	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Census Tract 4019, Delaware County, Pennsylvania	4219	4181	2064	1719	224	174	21	153	38	21	17	Census Tract 4019, Delaware County, Pennsylvania	210	210	210	224	10	146	10	10	139	34	121	36	24	10	26	10	10	10	10	10	10
2	401900	45	42	female	80	105	1	1	1	1	1	FDIC	133.09000	black	1	yes	0	G4200450401900	2011-2015	nan	nan	Pennsylvania	42	Delaware County	45	nan	nan	401900	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Census Tract 4019, Delaware County, Pennsylvania	4219	4181	2064	1719	224	174	21	153	38	21	17	Census Tract 4019, Delaware County, Pennsylvania	210	210	210	224	10	146	10	10	139	34	121	36	24	10	26	10	10	10	10	10	10
3	401900	45	42	female	84	128	1	1	1	1	1	NCUA	133.09000	white	1	yes	0	G4200450401900	2011-2015	nan	nan	Pennsylvania	42	Delaware County	45	nan	nan	401900	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Census Tract 4019, Delaware County, Pennsylvania	4219	4181	2064	1719	224	174	21	153	38	21	17	Census Tract 4019, Delaware County, Pennsylvania	210	210	210	224	10	146	10	10	139	34	121	36	24	10	26	10	10	10	10	10	10
4	401900	45	42	female	26	168	1	1	1	3	1	NCUA	133.09000	black	5	no	1	G4200450401900	2011-2015	nan	nan	Pennsylvania	42	Delaware County	45	nan	nan	401900	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Census Tract 4019, Delaware County, Pennsylvania	4219	4181	2064	1719	224	174	21	153	38	21	17	Census Tract 4019, Delaware County, Pennsylvania	210	210	210	224	10	146	10	10	139	34	121	36	24	10	26	10	10	10	10	10	10

Confirm you have 10107 rows and 96 columns in the merged dataframe.

merged.shape

(10107, 97)

WE ARENT DROPPING NA DATA#

merged.head(2).T

	0	1
census_tract	401900	401900
county_code	45	45
state_code	42	42
applicant_sex	female	male
income	59	63
loan_amount	112	192
loan_type	1	1
property_type	1	1
occupancy	1	1
action_type	1	1
loan_purpose	1	1
agency_code	OCC	NCUA
tract_to_msa_income_percent	133.09000	133.09000
applicant_race	white	na
co_applicant_sex	5	5
co_applicant	no	no
loan_denied	0	0
GISJOIN	G4200450401900	G4200450401900
YEAR	2011-2015	2011-2015
REGIONA	NaN	NaN
DIVISIONA	NaN	NaN
STATE	Pennsylvania	Pennsylvania
STATEA	42	42
COUNTY	Delaware County	Delaware County
COUNTYA	45	45
COUSUBA	NaN	NaN
PLACEA	NaN	NaN
TRACTA	401900	401900
BLKGRPA	NaN	NaN
CONCITA	NaN	NaN
AIANHHA	NaN	NaN
RES_ONLYA	NaN	NaN
TRUSTA	NaN	NaN
AITSCEA	NaN	NaN
ANRCA	NaN	NaN
CBSAA	NaN	NaN
CSAA	NaN	NaN
METDIVA	NaN	NaN
NECTAA	NaN	NaN
CNECTAA	NaN	NaN
NECTADIVA	NaN	NaN
UAA	NaN	NaN
CDCURRA	NaN	NaN
SLDUA	NaN	NaN
SLDLA	NaN	NaN
ZCTA5A	NaN	NaN
SUBMCDA	NaN	NaN
SDELMA	NaN	NaN
SDSECA	NaN	NaN
SDUNIA	NaN	NaN
PUMA5A	NaN	NaN
BTTRA	NaN	NaN
BTBGA	NaN	NaN
NAME_E	Census Tract 4019, Delaware County, Pennsylvania	Census Tract 4019, Delaware County, Pennsylvania
pop_total	4219	4219
ADK5E002	4181	4181
pop_white	2064	2064
pop_black	1719	1719
pop_amer_indian	0	0
pop_asian	224	224
pop_pac_islander	0	0
ADK5E008	0	0
ADK5E009	174	174
ADK5E010	21	21
ADK5E011	153	153
pop_hispanic	38	38
ADK5E013	21	21
ADK5E014	0	0
ADK5E015	17	17
ADK5E016	0	0
ADK5E017	0	0
ADK5E018	0	0
ADK5E019	0	0
ADK5E020	0	0
ADK5E021	0	0
NAME_M	Census Tract 4019, Delaware County, Pennsylvania	Census Tract 4019, Delaware County, Pennsylvania
ADK5M001	210	210
ADK5M002	210	210
ADK5M003	210	210
ADK5M004	224	224
ADK5M005	10	10
ADK5M006	146	146
ADK5M007	10	10
ADK5M008	10	10
ADK5M009	139	139
ADK5M010	34	34
ADK5M011	121	121
ADK5M012	36	36
ADK5M013	24	24
ADK5M014	10	10
ADK5M015	26	26
ADK5M016	10	10
ADK5M017	10	10
ADK5M018	10	10
ADK5M019	10	10
ADK5M020	10	10
ADK5M021	10	10

Performing our regression#

Note: When working with statsmodels formulas, you don't need to drop missing data. It's handled automatically as part of the .fit() process.

Working with statsmodels formulas#

Statsmodels formulas are a fun (yes, fun! exciting! amazing!) way to write regressions. For example, I can write a formula to say "calculate the relationship between a loan being denied in relation to the loan amount and the applicant's income."

import statsmodels.formula.api as smf

model = smf.logit("""
    loan_denied ~ loan_amount + income
""", data=merged)

result = model.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.365688
         Iterations 7

Logit Regression Results
Dep. Variable:	loan_denied	No. Observations:	10107
Model:	Logit	Df Residuals:	10104
Method:	MLE	Df Model:	2
Date:	Thu, 07 Nov 2019	Pseudo R-squ.:	0.01192
Time:	13:13:01	Log-Likelihood:	-3696.0
converged:	True	LL-Null:	-3740.6
Covariance Type:	nonrobust	LLR p-value:	4.380e-20

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-1.5269	0.058	-26.406	0.000	-1.640	-1.414
loan_amount	-0.0015	0.000	-5.093	0.000	-0.002	-0.001
income	-0.0010	0.000	-1.952	0.051	-0.002	4.04e-06

But that's child's play: we're here to get a little bit crazy.

Formulas and calculations in statsmodels formulas#

Let's do this! Instead of building new columns in pandas, we're just going to tell statsmodels to do it for us. This is using something called Patsy, imitating the programming language R.

description	pandas style	formula style
Multiply column	`df.colname * 100`	`np.multiply(colname, 100)`
Divide columns	`df.loan_amount / df.income`	`np.divide(loan_amount, income)`
Percentage	`df.pop_black / pop_total * 100`	`np.multiply(pop_black / pop_total, 100)`
Calculate log	`np.log(income)`	`np.log(income)`
One-hot encoding	`pd.get_dummies(df.agency_code).drop('FDIC', axis=1)`	`C(agency_code, Treatment('FDIC')`

If you haven't heard of one-hot encoding before, I recommend reading the longer version of this notebook! Or looking at what happens down below and thinking it through.

If we follow Reveal's methodology, we have a nice long list of features to include in our formula. Turning them all into a statsmodels/Patsy formula, the result looks like this:

import statsmodels.formula.api as smf

model = smf.logit("""
    loan_denied ~ 
        tract_to_msa_income_percent
        + np.log(income)
        + np.log(loan_amount)
        + np.divide(loan_amount, income)
        + C(co_applicant, Treatment('no'))
        + C(applicant_sex, Treatment('female'))
        + C(applicant_race, Treatment('white'))
        + C(agency_code, Treatment('FDIC'))
        + np.multiply(pop_hispanic / pop_total, 100)
        + np.multiply(pop_black / pop_total, 100)
        + np.multiply(pop_amer_indian / pop_total, 100)
        + np.multiply(pop_asian / pop_total, 100)
        + np.multiply(pop_pac_islander / pop_total, 100)
""", data=merged)

result = model.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.334016
         Iterations 7

Logit Regression Results
Dep. Variable:	loan_denied	No. Observations:	10107
Model:	Logit	Df Residuals:	10082
Method:	MLE	Df Model:	24
Date:	Tue, 05 Nov 2019	Pseudo R-squ.:	0.09749
Time:	12:37:44	Log-Likelihood:	-3375.9
converged:	True	LL-Null:	-3740.6
Covariance Type:	nonrobust	LLR p-value:	1.629e-138

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-0.6024	0.306	-1.971	0.049	-1.202	-0.003
C(co_applicant, Treatment('no'))[T.unknown]	0.3957	0.207	1.914	0.056	-0.010	0.801
C(co_applicant, Treatment('no'))[T.yes]	-0.0958	0.078	-1.222	0.222	-0.249	0.058
C(applicant_sex, Treatment('female'))[T.male]	0.1213	0.070	1.722	0.085	-0.017	0.259
C(applicant_sex, Treatment('female'))[T.na]	-0.1091	0.176	-0.618	0.537	-0.455	0.237
C(applicant_race, Treatment('white'))[T.asian]	0.3783	0.105	3.595	0.000	0.172	0.584
C(applicant_race, Treatment('white'))[T.black]	0.7496	0.115	6.506	0.000	0.524	0.975
C(applicant_race, Treatment('white'))[T.hawaiian]	1.0989	0.464	2.368	0.018	0.189	2.008
C(applicant_race, Treatment('white'))[T.latino]	0.3232	0.164	1.970	0.049	0.002	0.645
C(applicant_race, Treatment('white'))[T.na]	0.4511	0.119	3.798	0.000	0.218	0.684
C(applicant_race, Treatment('white'))[T.native_amer]	1.0546	0.591	1.785	0.074	-0.103	2.213
C(agency_code, Treatment('FDIC'))[T.CFPB]	1.1066	0.136	8.158	0.000	0.841	1.372
C(agency_code, Treatment('FDIC'))[T.FRS]	-0.1133	0.221	-0.513	0.608	-0.546	0.319
C(agency_code, Treatment('FDIC'))[T.HUD]	0.1170	0.140	0.837	0.402	-0.157	0.391
C(agency_code, Treatment('FDIC'))[T.NCUA]	1.3009	0.153	8.507	0.000	1.001	1.601
C(agency_code, Treatment('FDIC'))[T.OCC]	0.3160	0.201	1.575	0.115	-0.077	0.709
tract_to_msa_income_percent	0.0017	0.001	2.627	0.009	0.000	0.003
np.log(income)	-0.3555	0.070	-5.104	0.000	-0.492	-0.219
np.log(loan_amount)	-0.2283	0.056	-4.060	0.000	-0.338	-0.118
np.divide(loan_amount, income)	0.0110	0.007	1.476	0.140	-0.004	0.026
np.multiply(pop_hispanic / pop_total, 100)	0.0072	0.004	2.050	0.040	0.000	0.014
np.multiply(pop_black / pop_total, 100)	0.0062	0.002	3.884	0.000	0.003	0.009
np.multiply(pop_amer_indian / pop_total, 100)	-0.2571	0.097	-2.652	0.008	-0.447	-0.067
np.multiply(pop_asian / pop_total, 100)	0.0107	0.004	2.427	0.015	0.002	0.019
np.multiply(pop_pac_islander / pop_total, 100)	0.0904	0.159	0.568	0.570	-0.221	0.402

It's beautiful! It's amazing! We just wrote our formulas inside of the statsmodels formula, and it did all the work for us.

I'm not crying, you're crying.

Sorting by odds ratio#

Since we're interested in the odds ratio, it makes sense to reformat our results, add an odds ratio column, and sort the output.

feature_names = result.params.index
coefficients = result.params.values

coefs = pd.DataFrame({
    'coef': coefficients,
    'odds ratio': np.exp(result.params.values),
    'pvalue': result.pvalues
}).sort_values(by='odds ratio', ascending=False)
coefs

	coef	odds ratio	pvalue
C(agency_code, Treatment('FDIC'))[T.NCUA]	1.30091	3.67262	0.00000
C(agency_code, Treatment('FDIC'))[T.CFPB]	1.10661	3.02410	0.00000
C(applicant_race, Treatment('white'))[T.hawaiian]	1.09891	3.00090	0.01787
C(applicant_race, Treatment('white'))[T.native_amer]	1.05461	2.87087	0.07426
C(applicant_race, Treatment('white'))[T.black]	0.74963	2.11622	0.00000
C(applicant_race, Treatment('white'))[T.na]	0.45112	1.57007	0.00015
C(co_applicant, Treatment('no'))[T.unknown]	0.39570	1.48543	0.05568
C(applicant_race, Treatment('white'))[T.asian]	0.37826	1.45974	0.00032
C(applicant_race, Treatment('white'))[T.latino]	0.32318	1.38151	0.04881
C(agency_code, Treatment('FDIC'))[T.OCC]	0.31598	1.37160	0.11521
C(applicant_sex, Treatment('female'))[T.male]	0.12133	1.12900	0.08514
C(agency_code, Treatment('FDIC'))[T.HUD]	0.11701	1.12413	0.40245
np.multiply(pop_pac_islander / pop_total, 100)	0.09038	1.09459	0.56988
np.divide(loan_amount, income)	0.01096	1.01102	0.13984
np.multiply(pop_asian / pop_total, 100)	0.01069	1.01074	0.01522
np.multiply(pop_hispanic / pop_total, 100)	0.00719	1.00721	0.04041
np.multiply(pop_black / pop_total, 100)	0.00623	1.00625	0.00010
tract_to_msa_income_percent	0.00175	1.00175	0.00862
C(co_applicant, Treatment('no'))[T.yes]	-0.09583	0.90862	0.22154
C(applicant_sex, Treatment('female'))[T.na]	-0.10905	0.89668	0.53657
C(agency_code, Treatment('FDIC'))[T.FRS]	-0.11333	0.89285	0.60766
np.log(loan_amount)	-0.22825	0.79592	0.00005
np.multiply(pop_amer_indian / pop_total, 100)	-0.25706	0.77332	0.00801
np.log(income)	-0.35546	0.70085	0.00000
Intercept	-0.60242	0.54748	0.04874

And there we go! Except, of course, it's very ugly.

While writing the formula was a lot nicer and less error-prone (I think) than building a dataframe of our features, the output is horrendous. Those features names are terrible! It looks so so so bad.

We thought we had a beautiful world ahead of us in terms of writing nice precise formulas instead of wrangling new columns, but it looks like it won't be quite so simple.

Renaming our output fields#

But don't worry: even if we love the formula method, we don't have to suffer through those results.

Yes, we can understand the columns, but it's a lot of work to really read what's going on. Life would be much nicer if C(applicant_race, Treatment('white'))[T.latino] were just applicant_race_latino. You'd have to remember what the reference category is, but if we think we can handle it, let's head onward.

While statsmodels doesn't make it easy, it's definitely possible to reach in and rename our features. We do it like this:

# Copy the names to a pd.Series for easy search/replace
# We'll also keep a safe copy to make double-checking easy later
names = pd.Series(model.data.xnames)
originals = list(names.copy())

# Reformat 'C(agency_code, Treatment('FDIC'))[T.FRS]' as 'agency_code_FRS'
names = names.str.replace(r", ?Treatment\(.*\)", r"")
names = names.str.replace(r"C\(([\w]+)", r"\1_")
names = names.str.replace(r"\[T.(.*)\]", r"\1")

# Manually replace other ones
names = names.replace({
    'np.multiply(pop_hispanic / pop_total, 100)': 'pop_hispanic',
    'np.multiply(pop_black / pop_total, 100)': 'pop_black',
    'np.multiply(pop_amer_indian / pop_total, 100)': 'pop_amer_indian',
    'np.multiply(pop_asian / pop_total, 100)': 'pct_asian',
    'np.multiply(pop_pac_islander / pop_total, 100)': 'pop_pac_islander',
    'np.log(income)': 'log_income',
    'np.log(loan_amount)': 'log_loan',
    'np.divide(loan_amount, income)': 'loan_income_ratio',    
})

original_names = model.data.xnames
# Assign back into the model for display
model.data.xnames = list(names)

# Redo our summary, and we get nice output!
result.summary()

Logit Regression Results
Dep. Variable:	loan_denied	No. Observations:	10107
Model:	Logit	Df Residuals:	10082
Method:	MLE	Df Model:	24
Date:	Tue, 05 Nov 2019	Pseudo R-squ.:	0.09749
Time:	12:39:26	Log-Likelihood:	-3375.9
converged:	True	LL-Null:	-3740.6
Covariance Type:	nonrobust	LLR p-value:	1.629e-138

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-0.6024	0.306	-1.971	0.049	-1.202	-0.003
co_applicant_unknown	0.3957	0.207	1.914	0.056	-0.010	0.801
co_applicant_yes	-0.0958	0.078	-1.222	0.222	-0.249	0.058
applicant_sex_male	0.1213	0.070	1.722	0.085	-0.017	0.259
applicant_sex_na	-0.1091	0.176	-0.618	0.537	-0.455	0.237
applicant_race_asian	0.3783	0.105	3.595	0.000	0.172	0.584
applicant_race_black	0.7496	0.115	6.506	0.000	0.524	0.975
applicant_race_hawaiian	1.0989	0.464	2.368	0.018	0.189	2.008
applicant_race_latino	0.3232	0.164	1.970	0.049	0.002	0.645
applicant_race_na	0.4511	0.119	3.798	0.000	0.218	0.684
applicant_race_native_amer	1.0546	0.591	1.785	0.074	-0.103	2.213
agency_code_CFPB	1.1066	0.136	8.158	0.000	0.841	1.372
agency_code_FRS	-0.1133	0.221	-0.513	0.608	-0.546	0.319
agency_code_HUD	0.1170	0.140	0.837	0.402	-0.157	0.391
agency_code_NCUA	1.3009	0.153	8.507	0.000	1.001	1.601
agency_code_OCC	0.3160	0.201	1.575	0.115	-0.077	0.709
tract_to_msa_income_percent	0.0017	0.001	2.627	0.009	0.000	0.003
log_income	-0.3555	0.070	-5.104	0.000	-0.492	-0.219
log_loan	-0.2283	0.056	-4.060	0.000	-0.338	-0.118
loan_income_ratio	0.0110	0.007	1.476	0.140	-0.004	0.026
pop_hispanic	0.0072	0.004	2.050	0.040	0.000	0.014
pop_black	0.0062	0.002	3.884	0.000	0.003	0.009
pop_amer_indian	-0.2571	0.097	-2.652	0.008	-0.447	-0.067
pct_asian	0.0107	0.004	2.427	0.015	0.002	0.019
pop_pac_islander	0.0904	0.159	0.568	0.570	-0.221	0.402

Everything still works great! We can rebuild our coefficient/odds ratio/p-value situation without any trouble at all.

feature_names = result.params.index
coefficients = result.params.values

coefs = pd.DataFrame({
    'coef': coefficients,
    'odds ratio': np.exp(result.params.values),
    'pvalue': result.pvalues,
    'original': originals
}).sort_values(by='odds ratio', ascending=False)
coefs

	coef	odds ratio	pvalue	original
agency_code_NCUA	1.30091	3.67262	0.00000	C(agency_code, Treatment('FDIC'))[T.NCUA]
agency_code_CFPB	1.10661	3.02410	0.00000	C(agency_code, Treatment('FDIC'))[T.CFPB]
applicant_race_hawaiian	1.09891	3.00090	0.01787	C(applicant_race, Treatment('white'))[T.hawaiian]
applicant_race_native_amer	1.05461	2.87087	0.07426	C(applicant_race, Treatment('white'))[T.native...
applicant_race_black	0.74963	2.11622	0.00000	C(applicant_race, Treatment('white'))[T.black]
applicant_race_na	0.45112	1.57007	0.00015	C(applicant_race, Treatment('white'))[T.na]
co_applicant_unknown	0.39570	1.48543	0.05568	C(co_applicant, Treatment('no'))[T.unknown]
applicant_race_asian	0.37826	1.45974	0.00032	C(applicant_race, Treatment('white'))[T.asian]
applicant_race_latino	0.32318	1.38151	0.04881	C(applicant_race, Treatment('white'))[T.latino]
agency_code_OCC	0.31598	1.37160	0.11521	C(agency_code, Treatment('FDIC'))[T.OCC]
applicant_sex_male	0.12133	1.12900	0.08514	C(applicant_sex, Treatment('female'))[T.male]
agency_code_HUD	0.11701	1.12413	0.40245	C(agency_code, Treatment('FDIC'))[T.HUD]
pop_pac_islander	0.09038	1.09459	0.56988	np.multiply(pop_pac_islander / pop_total, 100)
loan_income_ratio	0.01096	1.01102	0.13984	np.divide(loan_amount, income)
pct_asian	0.01069	1.01074	0.01522	np.multiply(pop_asian / pop_total, 100)
pop_hispanic	0.00719	1.00721	0.04041	np.multiply(pop_hispanic / pop_total, 100)
pop_black	0.00623	1.00625	0.00010	np.multiply(pop_black / pop_total, 100)
tract_to_msa_income_percent	0.00175	1.00175	0.00862	tract_to_msa_income_percent
co_applicant_yes	-0.09583	0.90862	0.22154	C(co_applicant, Treatment('no'))[T.yes]
applicant_sex_na	-0.10905	0.89668	0.53657	C(applicant_sex, Treatment('female'))[T.na]
agency_code_FRS	-0.11333	0.89285	0.60766	C(agency_code, Treatment('FDIC'))[T.FRS]
log_loan	-0.22825	0.79592	0.00005	np.log(loan_amount)
pop_amer_indian	-0.25706	0.77332	0.00801	np.multiply(pop_amer_indian / pop_total, 100)
log_income	-0.35546	0.70085	0.00000	np.log(income)
Intercept	-0.60242	0.54748	0.04874	Intercept

Interpreting and thinking about the analysis#

Question 1#

Our results aren't exactly the same as Reveal's, as I pulled a slightly different number of rows from the database and I'm not sure what exact dataset they used for census information. How are we feeling about this reproduction? You might want check their 2015 results in the whitepaper.

# I mean come on it's pretty close

Question 2#

In the opening paragraph to the flagship piece, Aaron and Emmanuel write:

Fifty years after the federal Fair Housing Act banned racial discrimination in lending, African Americans and Latinos continue to be routinely denied conventional mortgage loans at rates far higher than their white counterparts.

If you look at the results, Hawaiian/Pacific Islanders (and maybe Native Americans) have an even higher odds ratio. Why do they choose to talk about African Americans and Latinos instead?

# Not nearly as many of those two groups
# And I mean like REALLY not that many
train_df.loc[:,"pct_hispanic":"pct_pac_islander"].median()

pct_hispanic        3.831509
pct_black           7.453754
pct_amer_indian     0.000000
pct_asian           5.071502
pct_pac_islander    0.000000
dtype: float64

Question 3#

Write a sentence expressing the meaning of the odds ratio statistic for Black mortgage applicants. Find a line in the Reveal piece where they use the odds ratio.

# “I had a fair amount of savings and still had so much trouble just left and
# right,” said Rachelle Faroul, a 33-year-old black woman who was rejected twice
# by lenders when she tried to buy a brick row house close to Malcolm X Park in
# Philadelphia, where Reveal found African Americans were 2.7 times as likely as
# whites to be denied a conventional mortgage.

Question 4#

Write a similar sentence about men.

# Men are 12% more likely to be denied a conventional mortgage

Question 5#

Why did Aaron and Emmanuel choose to include the loan-to-income ratio statistic? You might want to read the whitepaper.

# Loan-to-income ratio: Looking at the raw numbers of an applicant’s income and
# loan amount doesn’t tell the whole picture. We needed to look at how much money
# applicants wanted to take out in relation to their income. This provides a proxy
# for whether or not the loan amount was manageable compared with the applicant’s income.
# Experts agreed that this variable should be included.

Question 6#

Credit score is a common reason why loans are denied. Why are credit scores not included in our analysis? You might want to read the whitepaper.

# They aren't available!

Question 7#

This data was just sitting out there for anyone to look at, they didn't even need to FOIA it. Why do you think this issue had not come up before Reveal's analysis?

# This is just an opinion! but I asked them:
# emmanuel - the data is just so big
# aaron - it's a blind spot of journalism, business press only writes about profit

Question 8#

As a result of this series, a lot has happened, although recent changes don't look so good. If you were reporting this story, what groups of people would you want to talk to in order to make sure you're getting the story right?

# aaron and emmanuel talked to experts in research, along with enforcement officers and others at CFPB

Question 9#

When they were consulting experts, Aaron and Emmanuel received a lot of conflicting accounts about whether they should include the "N/A" values for race (they ended up including it). If the experts disagreed about something like that, why do you think they went forward with their analysis?

# Experts will rarely agree, it eventually comes down to editorial judgment. There's not always
# a very clear delineation of right/wrong

Question 10#

What if we were working on this story, and our logistic regression or input dataset were flawed? What would be the repercussions?

# We would be making pretty big claims that weren't backed up - it wouldn't just be us having
# to research more, it would be us actually staking our credibility on the line

Lending disparities using Logistic Regression#

What's the goal?#

Setup#

What is each row of our data?#

Read in your data#

Check your column types#

Engineering and cleaning up features#

Income-related columns#

Co-applicants#

Filter loan applicants#

Confirm that you have 10,107 loans with 19 columns#

Create a "loan denied" column#

Deal with categorical variables#

Double-check our mortage data#

Census data#

Rename some columns#

Computed columns#

Merge datasets#

Cleaning#

Do the merge#

WE ARENT DROPPING NA DATA#

WE ARENT DROPPING NA DATA#

WE ARENT DROPPING NA DATA#

WE ARENT DROPPING NA DATA#

Performing our regression#

Working with statsmodels formulas#

Formulas and calculations in statsmodels formulas#

Sorting by odds ratio#

Renaming our output fields#

Interpreting and thinking about the analysis#

Question 1#

Question 2#

Question 3#

Question 4#

Question 5#

Question 6#

Question 7#

Question 8#

Question 9#

Question 10#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects