1.5 Working with our data

1.5.1 Connecting to the database

We’ll start with a few imports. Even though we typically use pandas to read in CSV and Excel files, it also conveniently can read in data from a Postgres database the (almost) exact same way.

import pandas as pd
import psycopg2
import statsmodels.api as sm
import numpy as np

In order to connect to the database, we’ll need to provide a connection string. You might need to chance the username and password, although this is the default on an OS X postgres installation.

conn_str = "host='localhost' dbname='hmda' user='postgres' password=''"
conn = psycopg2.connect(conn_str)

Now that we’ve connected to the database, let’s try a sample query. We’ll just take the first 10 rows from the 2015 LAR dataset.

df = pd.read_sql('select * from hmda_public.lar_2015 limit 10', con=conn)
df
activity_year respondent_id agency_code loan_type property_type loan_purpose occupancy loan_amount preapproval action_type msa state_code county_code census_tract applicant_ethnicity co_applicant_ethnicity applicant_race_1 applicant_race_2 applicant_race_3 applicant_race_4 applicant_race_5 co_applicant_race_1 co_applicant_race_2 co_applicant_race_3 co_applicant_race_4 co_applicant_race_5 applicant_sex co_applicant_sex income purchaser_type denial_1 denial_2 denial_3 rate_spread hoepa_status lien_status edit_status sequence_num population minority_population_percent ffiec_median_income tract_to_msa_income_percent owner_occupied_units one_to_four_units app_date_indicator
2015 0000000319 5 1 1 2 1 00010 3 4 11260 02 020 0026.02 3 5 6 8 3 5 0076 0 NA 2 2 6 0000049 00005805 041.05 00089600 117.41 00001588 00001849 0
2015 0000001040 5 3 1 3 1 00198 3 4 11260 02 170 0010.01 2 2 5 5 1 2 0185 0 NA 2 1 6 0000602 00003458 017.35 00089600 112.09 00000998 00001268 0
2015 0000001281 3 1 1 3 1 00222 3 3 11260 02 170 0010.04 2 5 5 8 1 5 0115 0 NA 2 1 6 0000343 00004831 017.93 00089600 117.52 00001374 00001797 0
2015 0000001281 3 1 1 3 2 00188 3 3 11260 02 170 0011.00 2 5 5 8 1 5 0115 0 NA 2 1 6 0000345 00005364 016.63 00089600 111.23 00001593 00001754 0
2015 0000001281 3 1 1 3 2 00216 3 1 11260 02 020 0005.00 2 5 5 8 1 5 0230 7 NA 2 1 6 0000509 00001988 052.31 00089600 058.72 00000149 00000327 0
2015 0000001281 3 1 1 3 1 00219 3 1 11260 02 170 0010.01 2 5 5 8 1 5 0145 1 NA 2 1 6 0000630 00003458 017.35 00089600 112.09 00000998 00001268 0
2015 0000001281 3 1 1 3 2 00188 3 1 11260 02 170 0011.00 2 5 5 8 1 5 0145 1 NA 2 1 6 0000635 00005364 016.63 00089600 111.23 00001593 00001754 0
2015 0000001281 3 1 1 2 1 00533 3 1 11260 02 020 0013.00 2 5 5 8 2 5 0134 7 NA 2 1 6 0000713 00003059 015.20 00089600 170.10 00001113 00001306 0
2015 0000001281 3 1 1 3 2 00203 3 1 11260 02 020 0013.00 2 5 5 8 2 5 0136 7 NA 2 1 6 0000714 00003059 015.20 00089600 170.10 00001113 00001306 0
2015 0000001281 3 1 1 2 1 00275 3 1 11260 02 020 0015.00 3 5 6 8 3 5 0139 7 NA 2 1 6 0001097 00005104 028.62 00089600 151.23 00001474 00001942 0

That’s a lot of columns! What are they all, again?

df.columns
## Index(['activity_year', 'respondent_id', 'agency_code', 'loan_type',
##        'property_type', 'loan_purpose', 'occupancy', 'loan_amount',
##        'preapproval', 'action_type', 'msa', 'state_code', 'county_code',
##        'census_tract', 'applicant_ethnicity', 'co_applicant_ethnicity',
##        'applicant_race_1', 'applicant_race_2', 'applicant_race_3',
##        'applicant_race_4', 'applicant_race_5', 'co_applicant_race_1',
##        'co_applicant_race_2', 'co_applicant_race_3', 'co_applicant_race_4',
##        'co_applicant_race_5', 'applicant_sex', 'co_applicant_sex', 'income',
##        'purchaser_type', 'denial_1', 'denial_2', 'denial_3', 'rate_spread',
##        'hoepa_status', 'lien_status', 'edit_status', 'sequence_num',
##        'population', 'minority_population_percent', 'ffiec_median_income',
##        'tract_to_msa_income_percent', 'owner_occupied_units',
##        'one_to_four_units', 'app_date_indicator'],
##       dtype='object')

1.5.2 Understanding our data

We’ll need to read definitions from both the documentation and the whitepaper so we can understand what the different columns mean.

For the documentation, you’ll use this file here.

For the whitepaper, you’ll want to scroll down to the HMDA variables section.

1.5.3 Following the whitepaper

We’ll need to translate the terms used in the whitepaper into what we’ll be doing with the SQL database.

Loan type

While we recognize the substantial presence of applicants of color in the FHA market, we focused on conventional home loans for several reasons.

loan_type = '1'

Property type

Prospective borrowers submit loan applications for various types of structures: one- to four-unit properties, multi-family properties and manufactured homes. For this analysis, we focused on one- to four-unit properties.

property_type = '1'

Occupancy

We included only borrowers who said they planned to live in the house they were looking to buy. We did this to exclude developers or individuals who were buying property as an investment or to subsequently flip it.

occupancy = '1'

Action Type

We wanted to look at the reasons lending institutions deny people a mortgage. After conversations with former officials at HUD, we decided to include only those applications that resulted in originations (action type 1) or denials (action type 3)

(action_type = '1' OR action_type = '3')

An applicant’s income isn’t always reported in the data. In other cases, the data cuts off any incomes over $9.9 million and any loan amounts over $99.9 million, meaning there’s a value in the database, but it’s not precise. We focused only on those records where income and loan amount have an accurate estimation. This meant discarding about 1 percent of all conventional home loans in the country for 2016.

income != '9999' AND loan_amount != '99999'

When we plotted the number of applicants, how much money they made and the size of the loan, we found that it skewed to the left, meaning the majority of applicants were clustered on the lower end of the income and loan amount scales. This was especially true for applicants of color. We took the logarithm transformation of income and loan amount to normalize the distribution of those variables and limit the effect of extreme outliers.

LOG(income)

We included all cases in our analysis regardless of lien status.

Nothing to do.

At first, we looked at race separate from ethnicity, but that approach introduced too many instances in which​ ​either the ethnicity or race was unknown. So we decided to combine race and ethnicity. Applicants who marked their ethnicity as Hispanic were grouped together as Hispanic/Latino regardless of race. Non-Hispanic applicants, as well as those who didn’t provide an ethnicity, were grouped together by race: non-Hispanic white, non-Hispanic black, etc.

(IF applicant_ethnicity = '1' THEN '99' ELSE applicant_race_1 END IF) AS custom_race

We decided to look at home purchase, home improvement and refinance loans separately from each other.

loan_purpose = '1'