1.5 Working with our data

1.5.1 Connecting to the database

We’ll start with a few imports. Even though we typically use pandas to read in CSV and Excel files, it also conveniently can read in data from a Postgres database the (almost) exact same way.

import pandas as pd
import psycopg2
import statsmodels.api as sm
import numpy as np

In order to connect to the database, we’ll need to provide a connection string. You might need to chance the username and password, although this is the default on an OS X postgres installation.

conn_str = "host='localhost' dbname='hmda' user='postgres' password=''"
conn = psycopg2.connect(conn_str)

Now that we’ve connected to the database, let’s try a sample query. We’ll just take the first 10 rows from the 2015 LAR dataset.

df = pd.read_sql('select * from hmda_public.lar_2015 limit 10', con=conn)
df

activity_year	respondent_id	agency_code	loan_type	property_type	loan_purpose	occupancy	loan_amount	preapproval	action_type	msa	state_code	county_code	census_tract	applicant_ethnicity	co_applicant_ethnicity	applicant_race_1	co_applicant_race_1	applicant_sex	co_applicant_sex	income	purchaser_type	rate_spread	hoepa_status	lien_status	edit_status	sequence_num	population	minority_population_percent	ffiec_median_income	tract_to_msa_income_percent	owner_occupied_units	one_to_four_units
2015	0000000319	5	1	1	2	1	00010	3	4	11260	02	020	0026.02	3	5	6	8	3	5	0076	0	NA	2	2	6	0000049	00005805	041.05	00089600	117.41	00001588	00001849
2015	0000001040	5	3	1	3	1	00198	3	4	11260	02	170	0010.01	2	2	5	5	1	2	0185	0	NA	2	1	6	0000602	00003458	017.35	00089600	112.09	00000998	00001268
2015	0000001281	3	1	1	3	1	00222	3	3	11260	02	170	0010.04	2	5	5	8	1	5	0115	0	NA	2	1	6	0000343	00004831	017.93	00089600	117.52	00001374	00001797
2015	0000001281	3	1	1	3	2	00188	3	3	11260	02	170	0011.00	2	5	5	8	1	5	0115	0	NA	2	1	6	0000345	00005364	016.63	00089600	111.23	00001593	00001754
2015	0000001281	3	1	1	3	2	00216	3	1	11260	02	020	0005.00	2	5	5	8	1	5	0230	7	NA	2	1	6	0000509	00001988	052.31	00089600	058.72	00000149	00000327
2015	0000001281	3	1	1	3	1	00219	3	1	11260	02	170	0010.01	2	5	5	8	1	5	0145	1	NA	2	1	6	0000630	00003458	017.35	00089600	112.09	00000998	00001268
2015	0000001281	3	1	1	3	2	00188	3	1	11260	02	170	0011.00	2	5	5	8	1	5	0145	1	NA	2	1	6	0000635	00005364	016.63	00089600	111.23	00001593	00001754
2015	0000001281	3	1	1	2	1	00533	3	1	11260	02	020	0013.00	2	5	5	8	2	5	0134	7	NA	2	1	6	0000713	00003059	015.20	00089600	170.10	00001113	00001306
2015	0000001281	3	1	1	3	2	00203	3	1	11260	02	020	0013.00	2	5	5	8	2	5	0136	7	NA	2	1	6	0000714	00003059	015.20	00089600	170.10	00001113	00001306
2015	0000001281	3	1	1	2	1	00275	3	1	11260	02	020	0015.00	3	5	6	8	3	5	0139	7	NA	2	1	6	0001097	00005104	028.62	00089600	151.23	00001474	00001942

That’s a lot of columns! What are they all, again?

df.columns

## Index(['activity_year', 'respondent_id', 'agency_code', 'loan_type',
##        'property_type', 'loan_purpose', 'occupancy', 'loan_amount',
##        'preapproval', 'action_type', 'msa', 'state_code', 'county_code',
##        'census_tract', 'applicant_ethnicity', 'co_applicant_ethnicity',
##        'applicant_race_1', 'applicant_race_2', 'applicant_race_3',
##        'applicant_race_4', 'applicant_race_5', 'co_applicant_race_1',
##        'co_applicant_race_2', 'co_applicant_race_3', 'co_applicant_race_4',
##        'co_applicant_race_5', 'applicant_sex', 'co_applicant_sex', 'income',
##        'purchaser_type', 'denial_1', 'denial_2', 'denial_3', 'rate_spread',
##        'hoepa_status', 'lien_status', 'edit_status', 'sequence_num',
##        'population', 'minority_population_percent', 'ffiec_median_income',
##        'tract_to_msa_income_percent', 'owner_occupied_units',
##        'one_to_four_units', 'app_date_indicator'],
##       dtype='object')

1.5.2 Understanding our data

We’ll need to read definitions from both the documentation and the whitepaper so we can understand what the different columns mean.

For the documentation, you’ll use this file here.

For the whitepaper, you’ll want to scroll down to the HMDA variables section.

1.5.3 Following the whitepaper

We’ll need to translate the terms used in the whitepaper into what we’ll be doing with the SQL database.

Loan type

While we recognize the substantial presence of applicants of color in the FHA market, we focused on conventional home loans for several reasons.

loan_type = '1'

Property type

Prospective borrowers submit loan applications for various types of structures: one- to four-unit properties, multi-family properties and manufactured homes. For this analysis, we focused on one- to four-unit properties.

property_type = '1'

Occupancy

We included only borrowers who said they planned to live in the house they were looking to buy. We did this to exclude developers or individuals who were buying property as an investment or to subsequently flip it.

occupancy = '1'

Action Type

We wanted to look at the reasons lending institutions deny people a mortgage. After conversations with former officials at HUD, we decided to include only those applications that resulted in originations (action type 1) or denials (action type 3)

(action_type = '1' OR action_type = '3')

An applicant’s income isn’t always reported in the data. In other cases, the data cuts off any incomes over $9.9 million and any loan amounts over $99.9 million, meaning there’s a value in the database, but it’s not precise. We focused only on those records where income and loan amount have an accurate estimation. This meant discarding about 1 percent of all conventional home loans in the country for 2016.

income != '9999' AND loan_amount != '99999'

When we plotted the number of applicants, how much money they made and the size of the loan, we found that it skewed to the left, meaning the majority of applicants were clustered on the lower end of the income and loan amount scales. This was especially true for applicants of color. We took the logarithm transformation of income and loan amount to normalize the distribution of those variables and limit the effect of extreme outliers.

LOG(income)

We included all cases in our analysis regardless of lien status.

Nothing to do.

At first, we looked at race separate from ethnicity, but that approach introduced too many instances in which either the ethnicity or race was unknown. So we decided to combine race and ethnicity. Applicants who marked their ethnicity as Hispanic were grouped together as Hispanic/Latino regardless of race. Non-Hispanic applicants, as well as those who didn’t provide an ethnicity, were grouped together by race: non-Hispanic white, non-Hispanic black, etc.

(IF applicant_ethnicity = '1' THEN '99' ELSE applicant_race_1 END IF) AS custom_race

We decided to look at home purchase, home improvement and refinance loans separately from each other.

loan_purpose = '1'