Cleaning EOIR dataset for analysis#
The EOIR dataset is bad, bad, bad, but let's try to clean it up anyway.
import pandas as pd
pd.set_option("display.max_columns", 50)
Preview the data#
You can find the EOIR dataset here, but beware it's about 2GB to download and even larger once you unzip it.
It's a big dataset, so we'll start by just reading in twenty rows of each of the tables we're interested in. Our tables of interest right now are:
- Cases
- Proceedings
- Charges
- Judges
It'll give us an idea of what the data looks like before we really start cleaning. We'll be using sep='\t'
since there tab-separated.
# Cases
pd.read_csv("data/FOIA_TRAC_Report/A_TblCase.csv", sep='\t', nrows=20)
# Hearings
pd.read_csv("data/FOIA_TRAC_Report/B_TblProceeding.csv", sep='\t', nrows=20)
# Charges
pd.read_csv("data/FOIA_TRAC_Report/B_TblProceedCharges.csv", sep='\t', nrows=20)
# Judges
pd.read_csv("data/FOIA_TRAC_Report/Lookup/tblLookupJudge.csv", sep='\t', nrows=20)
Read in the data#
OK! So they're tab-delimited, demands latin-1 and some of the rows aren't liked by the pandas parser. There aren't that many rows that throw errors in pandas, so we're just going to skip them with error_bad_lines=False
.
%%time
cases = pd.read_csv("data/FOIA_TRAC_Report/A_TblCase.csv",
sep='\t',
dtype='str',
error_bad_lines=False,
warn_bad_lines=False,
encoding='latin-1',
na_values=' ',
usecols=['CUSTODY', 'IDNCASE', 'CASE_TYPE', 'DATE_OF_ENTRY', 'UPDATE_SITE', 'NAT'])
proceedings = pd.read_csv("data/FOIA_TRAC_Report/B_TblProceeding.csv",
sep='\t',
dtype='str',
error_bad_lines=False,
warn_bad_lines=False,
na_values=' ',
usecols=['IDNCASE', 'ABSENTIA', 'DEC_TYPE', 'DEC_CODE', 'CASE_TYPE', 'COMP_DATE', 'OSC_DATE', 'IJ_CODE'])
charges = pd.read_csv("data/FOIA_TRAC_Report/B_TblProceedCharges.csv",
sep='\t',
dtype='str',
na_values=' ',
error_bad_lines=False,
warn_bad_lines=False)
charge_lookup = pd.read_csv("data/FOIA_TRAC_Report/Lookup/tbllookupCharges.csv",
sep='\t',
dtype='str',
na_values=' ',
error_bad_lines=False)
How big are each of these tables?
print(f"Cases: {cases.shape[0]:,}")
print(f"Proceedings: {proceedings.shape[0]:,}")
print(f"Charges: {charges.shape[0]:,}")
print(f"Charge Lookup: {charge_lookup.shape[0]:,}")
Yeah, pretty big! It'd probably be best to put these into a Postgres database, but we're just going to use pandas for now. If you're interested in trying it out database-style, check when we inserted bills into Postgres.
Build the filters#
We're going to remove a lot of this data, as we're only looking for people who could plausibly be called refugees, haven't been charged with crimes, aren't in any non-standard situations, and can be compared across time with others in similar situations (post 1998 law change). And possibly some more adjustments, too!
We'll use this function to check how each filter does#
While we could just filter filter filter, I'm curious to see what percent of cases each filter removes. We'll test each filter first before we actually filter out data.
def filter_info(pandas_filter, dataframe):
cases_removed = dataframe.shape[0] - pandas_filter.sum()
pct_removed = cases_removed / dataframe.shape[0]
print(f"This filter would remove {pct_removed:.2%} of cases - {cases_removed:,}, to be specific")
Detailed filter#
- Remove anyone who is currently detained
- N is never detained, R is released, D is detained
not_detained_filter = cases.CUSTODY != 'D'
filter_info(not_detained_filter, cases)
Case type filter#
- For case types, only RMV, AOC, WHO, NAC
case_types_filter = cases.CASE_TYPE.isin(['RMV', 'AOC', 'WHO', 'NAC'])
filter_info(case_types_filter, cases)
Absentia filter#
- No cases where any of the proceedings were done in absentia
- Need to pull the absentia case nos from proceedings first
- Then build the filter to reject the cases with those IDs
absentia_case_numbers = proceedings[proceedings.ABSENTIA == 'Y'].IDNCASE.unique()
not_absentia_filter = ~cases.IDNCASE.isin(absentia_case_numbers)
filter_info(not_absentia_filter, cases)
Post-1998 filter#
- Only cases with OSC dates in 1998 or later
- Need to pull the 1998-or-later proceedings first
- Then build the filter to accept the cases with those IDs
after_1998_case_nos = proceedings[proceedings.OSC_DATE >= '1998'].IDNCASE.unique()
after_1998_filter = cases.IDNCASE.isin(after_1998_case_nos)
filter_info(after_1998_filter, cases)
Criminal charge filter#
- Filter out any case that has a criminal charge associated with it
- Find the charge codes that are criminal
- Find the rows in Proceed Charges that involve those charges, pull the case nos
- Then build the filter to reject the cases with those IDs
criminal_charge_codes = charge_lookup[charge_lookup.criminal_flag == '1'].strCode
criminal_case_numbers = charges[charges.CHARGE.isin(criminal_charge_codes)].IDNCASE.unique()
criminal_case_filter = ~cases.IDNCASE.isin(criminal_case_numbers)
filter_info(criminal_case_filter, cases)
Filter our cases#
# Actually run the filters to get a filtered dataset
filtered_cases = cases[not_detained_filter & case_types_filter & not_absentia_filter & criminal_case_filter]
print(f"Original case count: {cases.shape[0]:,}")
print(f"Final case count: {filtered_cases.shape[0]:,}")
Filter the proceedings#
filtered_proceedings = proceedings[proceedings.IDNCASE.isin(filtered_cases.IDNCASE)]
# Need to filter for case types here, too
filtered_proceedings = filtered_proceedings[filtered_proceedings.CASE_TYPE.isin(['RMV', 'AOC', 'WHO', 'NAC'])]
print(f"Original proceedings count: {proceedings.shape[0]:,}")
print(f"Filtered proceedings count: {filtered_proceedings.shape[0]:,}")
Save our data#
Honestly: we still have too many cases (or too few, depending on how you're counting). The dataset is a wreck anyway, but we can live with it.
filtered_cases.to_csv("data/cases-filtered.csv", index=False)
filtered_proceedings.to_csv("data/proceedings-filtered.csv", index=False)