1.4 Obtaining our data
1.4.1 Downloading HMDA Data Science Kit
You can browse around and download the actual files from HMDA, but it’s kind of a complicated process, and not terribly simple if you need to cross multiple years.
Thankfully, the Consumer Finance Protection Bureau has done a great job producing software products to help HMDA data be analyzed (and analyzed well!). One of these products is the HMDA Data Science Kit, which we’ll be using.
According to the README:
The primary goal of this repository is to provide data users with tools to enable them to produce accurate analytics results. Additionally, this repository provides an overview of HMDA resources, publications, and guidelines for proper use. This repository does not provide statutory interpretation or compliance assistance.
They’re missing out on the most important part: it has a bunch of scripts that help you download the data and process it into a Postgres database!
Why Postgres and not Excel or a CSV file? It’s a lot of data. A lot lot lot lot lot of data. Many, many, many gigs of data, especially if we’re dealing with more recent years or data across multiple years.
We aren’t going to have a good time if we’re trying to deal with it via the typical CSV+pandas setup — we’d either need a lot of RAM in our machine, to use something in the cloud, or work with a fancier setup that is able to chunk in large datasets to pandas.
Postgres is the perfect solution, as it allows us to segment our data into useful pieces quickly and easily, and take summary statistics without too much work.
To download HMDA Data Science Kit, you can either click this link or download it from their GitHub Repo.
1.4.2 Importing data into Postgres
According to Reveal’s writeup, “Reveal separately analyzed data from 2015 and 2016.” We’ll just stick with 2015 to keep things a bit simpler.
The HMDA Data Science README has a section on downloading and importing the data, but I’ve simplified the instructions for you below.
If you need to install the PostgreSQL database software, you might find this page helpful. Scroll down a bit for the PostgreSQL instructions.
First, we’ll unzip HMDA_Data_Science_Kit-master.zip
into a new folder. There is a lot of stuff in there, but we’re only interested in a few scripts that download datasets and import them into a database.
The first script will download the 2015 LAR HMDA data:
bash download_scripts/download_hmda.sh -Fs lar_2015
This next command extracts the data from the zip files and renames them to a standardized format. We only have one LAR file, but that’s fine! This one will take forever. Like, forever forever.
bash download_scripts/unzip_and_rename_lar.sh
You’ll receive a handful of error messages — you’re missing the other years, after all — but as long as it hung around on inflating:
for a long long time you’ll be all set.
Now we need to get started with our database. First we’ll create the standardized database…
bash load_scripts/create_hmda_db.sh
…then we’ll load in our data. Even though this command looks like it’s going to load in thirteen years of data, since we only have LAR data fro 2015 that’s all that will be imported into Postgres.
bash load_scripts/create_and_load_lar_2004_2017.sh
Again, it’s a lot of data, so it’ll sit at CREATE TABLE
for a long long time while it shovels it out of the CSV and into our database.