1.5 Working with our data
1.5.1 Connecting to the database
We’ll start with a few imports. Even though we typically use pandas to read in CSV and Excel files, it also conveniently can read in data from a Postgres database the (almost) exact same way.
In order to connect to the database, we’ll need to provide a connection string. You might need to chance the username and password, although this is the default on an OS X postgres installation.
conn_str = "host='localhost' dbname='hmda' user='postgres' password=''"
conn = psycopg2.connect(conn_str)
Now that we’ve connected to the database, let’s try a sample query. We’ll just take the first 10 rows from the 2015 LAR dataset.
activity_year | respondent_id | agency_code | loan_type | property_type | loan_purpose | occupancy | loan_amount | preapproval | action_type | msa | state_code | county_code | census_tract | applicant_ethnicity | co_applicant_ethnicity | applicant_race_1 | applicant_race_2 | applicant_race_3 | applicant_race_4 | applicant_race_5 | co_applicant_race_1 | co_applicant_race_2 | co_applicant_race_3 | co_applicant_race_4 | co_applicant_race_5 | applicant_sex | co_applicant_sex | income | purchaser_type | denial_1 | denial_2 | denial_3 | rate_spread | hoepa_status | lien_status | edit_status | sequence_num | population | minority_population_percent | ffiec_median_income | tract_to_msa_income_percent | owner_occupied_units | one_to_four_units | app_date_indicator |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2015 | 0000000319 | 5 | 1 | 1 | 2 | 1 | 00010 | 3 | 4 | 11260 | 02 | 020 | 0026.02 | 3 | 5 | 6 | 8 | 3 | 5 | 0076 | 0 | NA | 2 | 2 | 6 | 0000049 | 00005805 | 041.05 | 00089600 | 117.41 | 00001588 | 00001849 | 0 | |||||||||||
2015 | 0000001040 | 5 | 3 | 1 | 3 | 1 | 00198 | 3 | 4 | 11260 | 02 | 170 | 0010.01 | 2 | 2 | 5 | 5 | 1 | 2 | 0185 | 0 | NA | 2 | 1 | 6 | 0000602 | 00003458 | 017.35 | 00089600 | 112.09 | 00000998 | 00001268 | 0 | |||||||||||
2015 | 0000001281 | 3 | 1 | 1 | 3 | 1 | 00222 | 3 | 3 | 11260 | 02 | 170 | 0010.04 | 2 | 5 | 5 | 8 | 1 | 5 | 0115 | 0 | NA | 2 | 1 | 6 | 0000343 | 00004831 | 017.93 | 00089600 | 117.52 | 00001374 | 00001797 | 0 | |||||||||||
2015 | 0000001281 | 3 | 1 | 1 | 3 | 2 | 00188 | 3 | 3 | 11260 | 02 | 170 | 0011.00 | 2 | 5 | 5 | 8 | 1 | 5 | 0115 | 0 | NA | 2 | 1 | 6 | 0000345 | 00005364 | 016.63 | 00089600 | 111.23 | 00001593 | 00001754 | 0 | |||||||||||
2015 | 0000001281 | 3 | 1 | 1 | 3 | 2 | 00216 | 3 | 1 | 11260 | 02 | 020 | 0005.00 | 2 | 5 | 5 | 8 | 1 | 5 | 0230 | 7 | NA | 2 | 1 | 6 | 0000509 | 00001988 | 052.31 | 00089600 | 058.72 | 00000149 | 00000327 | 0 | |||||||||||
2015 | 0000001281 | 3 | 1 | 1 | 3 | 1 | 00219 | 3 | 1 | 11260 | 02 | 170 | 0010.01 | 2 | 5 | 5 | 8 | 1 | 5 | 0145 | 1 | NA | 2 | 1 | 6 | 0000630 | 00003458 | 017.35 | 00089600 | 112.09 | 00000998 | 00001268 | 0 | |||||||||||
2015 | 0000001281 | 3 | 1 | 1 | 3 | 2 | 00188 | 3 | 1 | 11260 | 02 | 170 | 0011.00 | 2 | 5 | 5 | 8 | 1 | 5 | 0145 | 1 | NA | 2 | 1 | 6 | 0000635 | 00005364 | 016.63 | 00089600 | 111.23 | 00001593 | 00001754 | 0 | |||||||||||
2015 | 0000001281 | 3 | 1 | 1 | 2 | 1 | 00533 | 3 | 1 | 11260 | 02 | 020 | 0013.00 | 2 | 5 | 5 | 8 | 2 | 5 | 0134 | 7 | NA | 2 | 1 | 6 | 0000713 | 00003059 | 015.20 | 00089600 | 170.10 | 00001113 | 00001306 | 0 | |||||||||||
2015 | 0000001281 | 3 | 1 | 1 | 3 | 2 | 00203 | 3 | 1 | 11260 | 02 | 020 | 0013.00 | 2 | 5 | 5 | 8 | 2 | 5 | 0136 | 7 | NA | 2 | 1 | 6 | 0000714 | 00003059 | 015.20 | 00089600 | 170.10 | 00001113 | 00001306 | 0 | |||||||||||
2015 | 0000001281 | 3 | 1 | 1 | 2 | 1 | 00275 | 3 | 1 | 11260 | 02 | 020 | 0015.00 | 3 | 5 | 6 | 8 | 3 | 5 | 0139 | 7 | NA | 2 | 1 | 6 | 0001097 | 00005104 | 028.62 | 00089600 | 151.23 | 00001474 | 00001942 | 0 |
That’s a lot of columns! What are they all, again?
## Index(['activity_year', 'respondent_id', 'agency_code', 'loan_type',
## 'property_type', 'loan_purpose', 'occupancy', 'loan_amount',
## 'preapproval', 'action_type', 'msa', 'state_code', 'county_code',
## 'census_tract', 'applicant_ethnicity', 'co_applicant_ethnicity',
## 'applicant_race_1', 'applicant_race_2', 'applicant_race_3',
## 'applicant_race_4', 'applicant_race_5', 'co_applicant_race_1',
## 'co_applicant_race_2', 'co_applicant_race_3', 'co_applicant_race_4',
## 'co_applicant_race_5', 'applicant_sex', 'co_applicant_sex', 'income',
## 'purchaser_type', 'denial_1', 'denial_2', 'denial_3', 'rate_spread',
## 'hoepa_status', 'lien_status', 'edit_status', 'sequence_num',
## 'population', 'minority_population_percent', 'ffiec_median_income',
## 'tract_to_msa_income_percent', 'owner_occupied_units',
## 'one_to_four_units', 'app_date_indicator'],
## dtype='object')
1.5.2 Understanding our data
We’ll need to read definitions from both the documentation and the whitepaper so we can understand what the different columns mean.
For the documentation, you’ll use this file here.
For the whitepaper, you’ll want to scroll down to the HMDA variables section.
1.5.3 Following the whitepaper
We’ll need to translate the terms used in the whitepaper into what we’ll be doing with the SQL database.
Loan type
While we recognize the substantial presence of applicants of color in the FHA market, we focused on conventional home loans for several reasons.
loan_type = '1'
Property type
Prospective borrowers submit loan applications for various types of structures: one- to four-unit properties, multi-family properties and manufactured homes. For this analysis, we focused on one- to four-unit properties.
property_type = '1'
Occupancy
We included only borrowers who said they planned to live in the house they were looking to buy. We did this to exclude developers or individuals who were buying property as an investment or to subsequently flip it.
occupancy = '1'
Action Type
We wanted to look at the reasons lending institutions deny people a mortgage. After conversations with former officials at HUD, we decided to include only those applications that resulted in originations (action type 1) or denials (action type 3)
(action_type = '1' OR action_type = '3')
An applicant’s income isn’t always reported in the data. In other cases, the data cuts off any incomes over $9.9 million and any loan amounts over $99.9 million, meaning there’s a value in the database, but it’s not precise. We focused only on those records where income and loan amount have an accurate estimation. This meant discarding about 1 percent of all conventional home loans in the country for 2016.
income != '9999' AND loan_amount != '99999'
When we plotted the number of applicants, how much money they made and the size of the loan, we found that it skewed to the left, meaning the majority of applicants were clustered on the lower end of the income and loan amount scales. This was especially true for applicants of color. We took the logarithm transformation of income and loan amount to normalize the distribution of those variables and limit the effect of extreme outliers.
LOG(income)
We included all cases in our analysis regardless of lien status.
Nothing to do.
At first, we looked at race separate from ethnicity, but that approach introduced too many instances in which either the ethnicity or race was unknown. So we decided to combine race and ethnicity. Applicants who marked their ethnicity as Hispanic were grouped together as Hispanic/Latino regardless of race. Non-Hispanic applicants, as well as those who didn’t provide an ethnicity, were grouped together by race: non-Hispanic white, non-Hispanic black, etc.
(IF applicant_ethnicity = '1' THEN '99' ELSE applicant_race_1 END IF) AS custom_race
We decided to look at home purchase, home improvement and refinance loans separately from each other.
loan_purpose = '1'