Lending disparities using Logistic Regression#

The story: https://www.revealnews.org/article/for-people-of-color-banks-are-shutting-the-door-to-homeownership/

Author: Aaron Glantz and Emmanuel Martinez

Topics: Logistic regression, odds ratios

Datasets

philadelphia-mortgages.csv: Philadelphia mortgage data for 2015
- A subset of HMDA LAR data from FFEIC
- Codebook is 2015HMDACodeSheet.pdf
- A guide to HMDA reporting
- I've massaged it slightly to make processing a bit easier
nhgis0006_ds233_20175_2017_tract.csv:
- Table B03002: Hispanic or Latino Origin by Race
- 2013-2017 American Community Survey data US Census Bureau, from NHGIS
- Codebook is nhgis0006_ds233_20175_2017_tract_codebook.txt
lending_disparities_whitepaper_180214.pdf: the whitepaper outlining Reveal's methodology

What's the goal?#

Do banks provide mortgages at disparate rates between white applicants and people of color? We're going to look at the following variables to find out:

Race/Ethnicity
- Native American
- Asian
- Black
- Native Hawaiian
- Hispanic/Latino
- Race and ethnicity were not reported
Sex
Whether there was a co-applicant
Applicant’s annual income (includes co-applicant income)
Loan amount
Ratio between the loan amount and the applicant’s income
Ratio between the median income of the census tract and the median income of the Metro area
Racial and ethnic breakdown by percentage for each census tract
Regulating agency of the lending institution

Read online Download notebook Interactive version

Setup#

Import pandas as usual, but also import numpy. We'll need it for logarithms and exponents.

Some of our datasets have a lot of columns, so you'll also want to use pd.set_option to display up to 100 columns or so.

import numpy as np
import pandas as pd

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format",'{:,.5f}'.format)

What is each row of our data?#

If you aren't sure, you might need to look at either the whitepaper or the codebook. You'll need to look at them both eventually, so might as well get started now.

Read in your data#

Read in our Philadelphia mortgage data and take a peek at the first few rows.

Tip: As always, census tract columns like to cause problems if they're read in as numbers. Make sure pandas reads it in as a string.

# We're just looking at Philly
mortgage = pd.read_csv("data/philadelphia-mortgages.csv", dtype={ 'census_tract': 'str'})
mortgage.head(5)

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex
0	0101.00	101	42	3	26	5	1	1	1	4	2	5	97.09000	6	5
1	0264.00	101	42	2	26	40	1	1	1	4	2	5	98.27000	3	5
2	0281.00	101	42	2	22	20	1	1	1	5	2	5	72.28000	6	5
3	0158.00	101	42	2	57	36	1	1	1	5	3	5	105.87000	6	5
4	0358.00	101	42	1	80	34	1	1	1	1	3	5	139.62000	5	2

Check your column types#

I mentioned it above, but make sure census_tract is an object (a string) or merging isn't going to be any fun later on.

mortgage.dtypes

census_tract                    object
county_code                      int64
state_code                       int64
applicant_sex                    int64
income                           int64
loan_amount                      int64
loan_type                        int64
property_type                    int64
occupancy                        int64
action_type                      int64
loan_purpose                     int64
agency_code                      int64
tract_to_msa_income_percent    float64
applicant_race                   int64
co_applicant_sex                 int64
dtype: object

Engineering and cleaning up features#

When we plotted the number of applicants, how much money they made and the size of the loan, we found that it skewed to the left, meaning the majority of applicants were clustered on the lower end of the income and loan amount scales. This was especially true for applicants of color. We took the logarithm transformation of income and loan amount to normalize the distribution of those variables and limit the effect of extreme outliers.

Traditionally we would calculate these values and save them in new columns. Since it's just simple math, though, we'll be able to do it when we run the regression.

Co-applicants#

Right now we have a column about the co-applicant's sex (see the codebook for column details). We don't want the sex, though, we're interested in whether there is a co applicant or not. Use the co-applicant's sex to create a new column called co_applicant that is either 'yes', 'no', or 'unknown'.

Hint: If the co-applicant's sex was not provided or is not applicable, count it as unknown.
Hint: The easiest way is to use .replace on the co-applicant sex column, but store the result in your new column

mortgage['co_applicant'] = mortgage.co_applicant_sex.replace({
    1: 'yes',
    2: 'yes',
    3: 'unknown',
    4: 'unknown',
    5: 'no'
})
mortgage.head()

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex	co_applicant
42	401900	45	42	female	59	112	1	1	1	1	1	OCC	133.09000	white	5	no
43	409902	45	42	na	177	375	1	1	1	1	1	OCC	208.56000	na	3	unknown
46	410200	45	42	male	150	381	1	1	1	1	1	OCC	215.35000	white	5	no
48	031200	101	42	female	65	136	1	1	1	1	1	OCC	93.11000	asian	5	no
51	403601	45	42	female	55	196	1	1	1	1	1	OCC	141.83000	asian	5	no

Filter loan applicants#

If you read the whitepaper - lending_disparities_whitepaper_180214.pdf - many filters are used to get to the target dataset for analysis.

Loan type

While we recognize the substantial presence of applicants of color in the FHA market, we focused on conventional home loans for several reasons.

Property type

Prospective borrowers submit loan applications for various types of structures: one- to four-unit properties, multi-family properties and manufactured homes. For this analysis, we focused on one- to four-unit properties.

Occupancy

We included only borrowers who said they planned to live in the house they were looking to buy. We did this to exclude developers or individuals who were buying property as an investment or to subsequently flip it.

Action Type

We wanted to look at the reasons lending institutions deny people a mortgage. After conversations with former officials at HUD, we decided to include only those applications that resulted in originations (action type 1) or denials (action type 3)

Income

An applicant’s income isn’t always reported in the data. In other cases, the data cuts off any incomes over \$9.9 million and any loan amounts over \\$99.9 million, meaning there’s a value in the database, but it’s not precise. We focused only on those records where income and loan amount have an accurate estimation. This meant discarding about 1 percent of all conventional home loans in the country for 2016. [Note: I already edited this]

When we plotted the number of applicants, how much money they made and the size of the loan, we found that it skewed to the left, meaning the majority of applicants were clustered on the lower end of the income and loan amount scales. This was especially true for applicants of color. We took the logarithm transformation of income and loan amount to normalize the distribution of those variables and limit the effect of extreme outliers.

Lien status

We included all cases in our analysis regardless of lien status.

Race and ethnicity

At first, we looked at race separate from ethnicity, but that approach introduced too many instances in which either the ethnicity or race was unknown. So we decided to combine race and ethnicity. Applicants who marked their ethnicity as Hispanic were grouped together as Hispanic/Latino regardless of race. Non-Hispanic applicants, as well as those who didn’t provide an ethnicity, were grouped together by race: non-Hispanic white, non-Hispanic black, etc. [Note: This has already been taken care of]

Loan purpose

We decided to look at home purchase, home improvement and refinance loans separately from each other. [Note: please look at home purchase loans.]

Use the text above (it's from the whitepaper) and the 2015HMDACodeSheet.pdf code book to filter the dataset.

Tip: there should be between 5-8 filters, depending on how you write them.

mortgage = mortgage[(mortgage.loan_type == 1) & \
            (mortgage.property_type == 1) & \
            (mortgage.occupancy == 1) & \
            mortgage.action_type.isin([1,3]) & \
            (mortgage.income != 9999) & \
            (mortgage.loan_amount != 99999) & \
            (mortgage.loan_purpose == 1)]
mortgage = mortgage.copy()
mortgage.shape

(10107, 16)

When you're done filtering, save your dataframe as a "copy" with df = df.copy() (if it's called df, of course). This will prevent irritating warnings when you're trying to create new columns.

Confirm that you have 10,107 loans with 19 columns#

mortgage.shape

(10107, 16)

Create a "loan denied" column#

Right now the action_type category reflects whether the loan was granted or not, and either has a value of 1 or 3. We'll need to create a new column called loan_denied, where the value is 0 if the loan was accepted and 1 if the loan was denied.

While we're eventually going to do a bunch of crazy comparisons and math inside of our statsmodels formula, we do need the target of our regression to be a number. You'll see what I mean later on!

mortgage['loan_denied'] = (mortgage.action_type == 3).astype(int)

Deal with categorical variables#

Let's go ahead and take a look at our categorical variables:

Applicant sex (male, female, na)
Applicant race
Mortgage agency
Co-applicant (yes, no, unknown)

Before we do anything crazy, let's use the codebook to turn them into strings.

Tip: We already did this with the co_applicant column, you only need to do the rest
Tip: Just use .replace

mortgage.applicant_sex = mortgage.applicant_sex.replace({
    1: 'male',
    2: 'female',
    3: 'na'
})
mortgage.applicant_race = mortgage.applicant_race.replace({
    1: 'native_amer',
    2: 'asian',
    3: 'black',
    4: 'hawaiian',
    5: 'white',
    6: 'na',
    7: 'na',
    8: 'na',
    99: 'latino'
})
mortgage.agency_code = mortgage.agency_code.replace({
    1: 'OCC',
    2: 'FRS',
    3: 'FDIC',
    5: 'NCUA',
    7: 'HUD',
    9: 'CFPB'
})
mortgage.head(3)

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex	co_applicant
42	4019.00	45	42	female	59	112	1	1	1	1	1	OCC	133.09000	white	5	no
43	4099.02	45	42	na	177	375	1	1	1	1	1	OCC	208.56000	na	3	unknown
46	4102.00	45	42	male	150	381	1	1	1	1	1	OCC	215.35000	white	5	no

Double-check these columns match these values in the first three rows (and yes, you should have a lot of other columns, too).

applicant_sex	agency_code	applicant_race	co_applicant
female	OCC	white	no
na	OCC	na	unknown
male	OCC	white	no

Double-check our mortage data#

mortgage.head()

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex	co_applicant
42	4019.00	45	42	female	59	112	1	1	1	1	1	OCC	133.09000	white	5	no
43	4099.02	45	42	na	177	375	1	1	1	1	1	OCC	208.56000	na	3	unknown
46	4102.00	45	42	male	150	381	1	1	1	1	1	OCC	215.35000	white	5	no
48	0312.00	101	42	female	65	136	1	1	1	1	1	OCC	93.11000	asian	5	no
51	4036.01	45	42	female	55	196	1	1	1	1	1	OCC	141.83000	asian	5	no

mortgage.shape

(10107, 17)

Census data#

Now we just need the final piece to the puzzle, the census data. Read in the census data file, calling the dataframe census.

Tip: As always, be sure to read the tract column in as a string. Interestingly, this time we don't need to worry about the state or county codes in the same way.

Tip: You're going to encounter a problem that you find every time you read in a file from the US government!

census = pd.read_csv("data/nhgis0007_ds215_20155_2015_tract.csv", encoding='latin-1', dtype={'TRACTA': 'str'})
census.head(2)

	GISJOIN	YEAR	REGIONA	DIVISIONA	STATE	STATEA	COUNTY	COUNTYA	COUSUBA	PLACEA	TRACTA	BLKGRPA	CONCITA	AIANHHA	RES_ONLYA	TRUSTA	AITSCEA	ANRCA	CBSAA	CSAA	METDIVA	NECTAA	CNECTAA	NECTADIVA	UAA	CDCURRA	SLDUA	SLDLA	ZCTA5A	SUBMCDA	SDELMA	SDSECA	SDUNIA	PUMA5A	BTTRA	BTBGA	NAME_E	ADK5E001	ADK5E002	ADK5E003	ADK5E004	ADK5E005	ADK5E006	ADK5E007	ADK5E008	ADK5E009	ADK5E010	ADK5E011	ADK5E012	ADK5E013	ADK5E014	ADK5E015	ADK5E016	ADK5E017	ADK5E018	ADK5E019	ADK5E020	ADK5E021	NAME_M	ADK5M001	ADK5M002	ADK5M003	ADK5M004	ADK5M005	ADK5M006	ADK5M007	ADK5M008	ADK5M009	ADK5M010	ADK5M011	ADK5M012	ADK5M013	ADK5M014	ADK5M015	ADK5M016	ADK5M017	ADK5M018	ADK5M019	ADK5M020	ADK5M021
0	G0100010020100	2011-2015	nan	nan	Alabama	1	Autauga County	1	nan	nan	020100	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Census Tract 201, Autauga County, Alabama	1948	1931	1703	150	6	12	0	0	60	0	60	17	17	0	0	0	0	0	0	0	0	Census Tract 201, Autauga County, Alabama	203	212	229	126	8	16	11	11	44	11	44	21	21	11	11	11	11	11	11	11	11
1	G0100010020200	2011-2015	nan	nan	Alabama	1	Autauga County	1	nan	nan	020200	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Census Tract 202, Autauga County, Alabama	2156	2139	872	1149	0	50	0	0	68	0	68	17	14	0	0	0	0	3	0	0	0	Census Tract 202, Autauga County, Alabama	268	268	207	250	11	61	11	11	62	11	62	25	23	11	11	11	11	7	11	11	11

Rename some columns#

If you like to keep your data extra clean, feel free to rename the columns you're interested in. If not, feel free to skip it!

Tip: Make sure you're using the estimates columns, not the margin of error columns

# join on STATEA-state code, COUNTYA-county code, TRACTA-census tract (cleaned)
census = census.rename(columns={
    'ADK5E001': 'pop_total',
    'ADK5E003': 'pop_white',
    'ADK5E004': 'pop_black',
    'ADK5E005': 'pop_amer_indian',
    'ADK5E006': 'pop_asian',
    'ADK5E007': 'pop_pac_islander',
    'ADK5E012': 'pop_hispanic'
})
census.head(2).T

	0	1
GISJOIN	G0100010020100	G0100010020200
YEAR	2011-2015	2011-2015
REGIONA	NaN	NaN
DIVISIONA	NaN	NaN
STATE	Alabama	Alabama
STATEA	1	1
COUNTY	Autauga County	Autauga County
COUNTYA	1	1
COUSUBA	NaN	NaN
PLACEA	NaN	NaN
TRACTA	020100	020200
BLKGRPA	NaN	NaN
CONCITA	NaN	NaN
AIANHHA	NaN	NaN
RES_ONLYA	NaN	NaN
TRUSTA	NaN	NaN
AITSCEA	NaN	NaN
ANRCA	NaN	NaN
CBSAA	NaN	NaN
CSAA	NaN	NaN
METDIVA	NaN	NaN
NECTAA	NaN	NaN
CNECTAA	NaN	NaN
NECTADIVA	NaN	NaN
UAA	NaN	NaN
CDCURRA	NaN	NaN
SLDUA	NaN	NaN
SLDLA	NaN	NaN
ZCTA5A	NaN	NaN
SUBMCDA	NaN	NaN
SDELMA	NaN	NaN
SDSECA	NaN	NaN
SDUNIA	NaN	NaN
PUMA5A	NaN	NaN
BTTRA	NaN	NaN
BTBGA	NaN	NaN
NAME_E	Census Tract 201, Autauga County, Alabama	Census Tract 202, Autauga County, Alabama
pop_total	1948	2156
ADK5E002	1931	2139
pop_white	1703	872
pop_black	150	1149
pop_amer_indian	6	0
pop_asian	12	50
pop_pac_islander	0	0
ADK5E008	0	0
ADK5E009	60	68
ADK5E010	0	0
ADK5E011	60	68
pop_hispanic	17	17
ADK5E013	17	14
ADK5E014	0	0
ADK5E015	0	0
ADK5E016	0	0
ADK5E017	0	0
ADK5E018	0	3
ADK5E019	0	0
ADK5E020	0	0
ADK5E021	0	0
NAME_M	Census Tract 201, Autauga County, Alabama	Census Tract 202, Autauga County, Alabama
ADK5M001	203	268
ADK5M002	212	268
ADK5M003	229	207
ADK5M004	126	250
ADK5M005	8	11
ADK5M006	16	61
ADK5M007	11	11
ADK5M008	11	11
ADK5M009	44	62
ADK5M010	11	11
ADK5M011	44	62
ADK5M012	21	25
ADK5M013	21	23
ADK5M014	11	11
ADK5M015	11	11
ADK5M016	11	11
ADK5M017	11	11
ADK5M018	11	7
ADK5M019	11	11
ADK5M020	11	11
ADK5M021	11	11

Computed columns#

According to Reveal's regression output, you'll want to create the following columns:

Percent Black in tract
Percent Hispanic/Latino in tract (I hope you know how Hispanic/Latino + census data works by now)
Percent Asian in tract
Percent Native American in tract
Percent Native Hawaiian in tract

Notice that we don't include percent white - because all of the other columns add up to percent white, we ignore it! It's similar to a reference category.

If we want to use buzzwords here, the technical reason we're not using percent white is called collinearity. We'll talk more about it on Friday.

Merge datasets#

Merge mortgage and census into a new dataframe called merged.

Unfortunately something is a little different between our mortgage and census census tract columns. You'll need to fix it before you can merge.

Cleaning#

mortgage.head(2)

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex	co_applicant	loan_denied
42	4019.00	45	42	female	59	112	1	1	1	1	1	OCC	133.09000	white	5	no	0
43	4099.02	45	42	na	177	375	1	1	1	1	1	OCC	208.56000	na	3	unknown	0

census.head(2)

	GISJOIN	YEAR	REGIONA	DIVISIONA	STATE	STATEA	COUNTY	COUNTYA	COUSUBA	PLACEA	TRACTA	BLKGRPA	CONCITA	AIANHHA	RES_ONLYA	TRUSTA	AITSCEA	ANRCA	CBSAA	CSAA	METDIVA	NECTAA	CNECTAA	NECTADIVA	UAA	CDCURRA	SLDUA	SLDLA	ZCTA5A	SUBMCDA	SDELMA	SDSECA	SDUNIA	PUMA5A	BTTRA	BTBGA	NAME_E	pop_total	ADK5E002	pop_white	pop_black	pop_amer_indian	pop_asian	pop_pac_islander	ADK5E008	ADK5E009	ADK5E010	ADK5E011	pop_hispanic	ADK5E013	ADK5E014	ADK5E015	ADK5E016	ADK5E017	ADK5E018	ADK5E019	ADK5E020	ADK5E021	NAME_M	ADK5M001	ADK5M002	ADK5M003	ADK5M004	ADK5M005	ADK5M006	ADK5M007	ADK5M008	ADK5M009	ADK5M010	ADK5M011	ADK5M012	ADK5M013	ADK5M014	ADK5M015	ADK5M016	ADK5M017	ADK5M018	ADK5M019	ADK5M020	ADK5M021
0	G0100010020100	2011-2015	nan	nan	Alabama	1	Autauga County	1	nan	nan	020100	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Census Tract 201, Autauga County, Alabama	1948	1931	1703	150	6	12	0	0	60	0	60	17	17	0	0	0	0	0	0	0	0	Census Tract 201, Autauga County, Alabama	203	212	229	126	8	16	11	11	44	11	44	21	21	11	11	11	11	11	11	11	11
1	G0100010020200	2011-2015	nan	nan	Alabama	1	Autauga County	1	nan	nan	020200	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Census Tract 202, Autauga County, Alabama	2156	2139	872	1149	0	50	0	0	68	0	68	17	14	0	0	0	0	3	0	0	0	Census Tract 202, Autauga County, Alabama	268	268	207	250	11	61	11	11	62	11	62	25	23	11	11	11	11	7	11	11	11

mortgage['census_tract'] = mortgage['census_tract'].str.replace(".", "")
mortgage.head(2)

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex	co_applicant	loan_denied
42	401900	45	42	female	59	112	1	1	1	1	1	OCC	133.09000	white	5	no	0
43	409902	45	42	na	177	375	1	1	1	1	1	OCC	208.56000	na	3	unknown	0

Do the merge#

merged = mortgage.merge(census,
                         left_on=['state_code', 'county_code', 'census_tract'],
                         right_on=['STATEA', 'COUNTYA', 'TRACTA'])
merged.head()

	census_tract	county_code	state_code	applicant_sex	income	loan_amount	loan_type	property_type	occupancy	action_type	loan_purpose	agency_code	tract_to_msa_income_percent	applicant_race	co_applicant_sex	co_applicant	loan_denied	GISJOIN	YEAR	REGIONA	DIVISIONA	STATE	STATEA	COUNTY	COUNTYA	COUSUBA	PLACEA	TRACTA	BLKGRPA	CONCITA	AIANHHA	RES_ONLYA	TRUSTA	AITSCEA	ANRCA	CBSAA	CSAA	METDIVA	NECTAA	CNECTAA	NECTADIVA	UAA	CDCURRA	SLDUA	SLDLA	ZCTA5A	SUBMCDA	SDELMA	SDSECA	SDUNIA	PUMA5A	BTTRA	BTBGA	NAME_E	pop_total	ADK5E002	pop_white	pop_black	pop_asian	ADK5E009	ADK5E010	ADK5E011	pop_hispanic	ADK5E013	ADK5E015	NAME_M	ADK5M001	ADK5M002	ADK5M003	ADK5M004	ADK5M005	ADK5M006	ADK5M007	ADK5M008	ADK5M009	ADK5M010	ADK5M011	ADK5M012	ADK5M013	ADK5M014	ADK5M015	ADK5M016	ADK5M017	ADK5M018	ADK5M019	ADK5M020	ADK5M021
0	401900	45	42	female	59	112	1	1	1	1	1	OCC	133.09000	white	5	no	0	G4200450401900	2011-2015	nan	nan	Pennsylvania	42	Delaware County	45	nan	nan	401900	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Census Tract 4019, Delaware County, Pennsylvania	4219	4181	2064	1719	224	174	21	153	38	21	17	Census Tract 4019, Delaware County, Pennsylvania	210	210	210	224	10	146	10	10	139	34	121	36	24	10	26	10	10	10	10	10	10
1	401900	45	42	male	63	192	1	1	1	1	1	NCUA	133.09000	na	5	no	0	G4200450401900	2011-2015	nan	nan	Pennsylvania	42	Delaware County	45	nan	nan	401900	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Census Tract 4019, Delaware County, Pennsylvania	4219	4181	2064	1719	224	174	21	153	38	21	17	Census Tract 4019, Delaware County, Pennsylvania	210	210	210	224	10	146	10	10	139	34	121	36	24	10	26	10	10	10	10	10	10
2	401900	45	42	female	80	105	1	1	1	1	1	FDIC	133.09000	black	1	yes	0	G4200450401900	2011-2015	nan	nan	Pennsylvania	42	Delaware County	45	nan	nan	401900	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Census Tract 4019, Delaware County, Pennsylvania	4219	4181	2064	1719	224	174	21	153	38	21	17	Census Tract 4019, Delaware County, Pennsylvania	210	210	210	224	10	146	10	10	139	34	121	36	24	10	26	10	10	10	10	10	10
3	401900	45	42	female	84	128	1	1	1	1	1	NCUA	133.09000	white	1	yes	0	G4200450401900	2011-2015	nan	nan	Pennsylvania	42	Delaware County	45	nan	nan	401900	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Census Tract 4019, Delaware County, Pennsylvania	4219	4181	2064	1719	224	174	21	153	38	21	17	Census Tract 4019, Delaware County, Pennsylvania	210	210	210	224	10	146	10	10	139	34	121	36	24	10	26	10	10	10	10	10	10
4	401900	45	42	female	26	168	1	1	1	3	1	NCUA	133.09000	black	5	no	1	G4200450401900	2011-2015	nan	nan	Pennsylvania	42	Delaware County	45	nan	nan	401900	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	Census Tract 4019, Delaware County, Pennsylvania	4219	4181	2064	1719	224	174	21	153	38	21	17	Census Tract 4019, Delaware County, Pennsylvania	210	210	210	224	10	146	10	10	139	34	121	36	24	10	26	10	10	10	10	10	10

Confirm you have 10107 rows and 96 columns in the merged dataframe.

merged.shape

(10107, 97)

merged.to_csv("data/mortgage-census-cleaned-merged.csv", index=False)

Lending disparities using Logistic Regression#

What's the goal?#

Setup#

What is each row of our data?#

Read in your data#

Check your column types#

Engineering and cleaning up features#

Co-applicants#

Filter loan applicants#

Confirm that you have 10,107 loans with 19 columns#

Create a "loan denied" column#

Deal with categorical variables#

Double-check our mortage data#

Census data#

Rename some columns#

Computed columns#

Merge datasets#

Cleaning#

Do the merge#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects

Lending disparities using Logistic Regression#

What's the goal?#

Setup#

What is each row of our data?#

Read in your data#

Check your column types#

Engineering and cleaning up features#

Income-related columns#

Co-applicants#

Filter loan applicants#

Confirm that you have 10,107 loans with 19 columns#

Create a "loan denied" column#

Deal with categorical variables#

Double-check our mortage data#

Census data#

Rename some columns#

Computed columns#

Merge datasets#

Cleaning#

Do the merge#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects