Cleaning EOIR dataset for analysis#

The EOIR dataset is bad, bad, bad, but let's try to clean it up anyway.

import pandas as pd
pd.set_option("display.max_columns", 50)

Preview the data#

You can find the EOIR dataset here, but beware it's about 2GB to download and even larger once you unzip it.

It's a big dataset, so we'll start by just reading in twenty rows of each of the tables we're interested in. Our tables of interest right now are:

  • Cases
  • Proceedings
  • Charges
  • Judges

It'll give us an idea of what the data looks like before we really start cleaning. We'll be using sep='\t' since there tab-separated.

# Cases
pd.read_csv("data/FOIA_TRAC_Report/A_TblCase.csv", sep='\t', nrows=20)
IDNCASE ALIEN_CITY ALIEN_STATE NAT LANG CUSTODY SITE_TYPE E_28_DATE ATTY_NBR CASE_TYPE UPDATE_SITE LATEST_HEARING LATEST_TIME LATEST_CAL_TYPE UP_BOND_DATE UP_BOND_RSN CORRECTIONAL_FAC RELEASE_MONTH RELEASE_YEAR INMATE_HOUSING DATE_OF_ENTRY C_ASY_TYPE C_BIRTHDATE C_RELEASE_DATE UPDATED_CITY UPDATED_STATE ADDRESS_CHANGEDON ZBOND_MRG_FLAG GENDER DATE_DETAINED DATE_RELEASED LPR DETENTION_DATE DETENTION_LOCATION DCO_LOCATION DETENTION_FACILITY_TYPE CASEPRIORITY_CODE
0 2048313 EL CENTRO CA GE ENG D NaN 0 RMV IMP 2001-08-08 00:00:00.000 830 M NaN 1950-06-14 00:00:00.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2048314 SAN FRANCISCO CA MT ENG D 1997-10-01 00:00:00.000 2 RMV SFR 1999-02-01 00:00:00.000 230 I NaN D 00 0000 1951-09-21 00:00:00.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2048315 SALINAS CA MX SP N 1987-06-15 00:00:00.000 1 DEP SFR 1987-12-28 00:00:00.000 230 I NaN 00 0000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2048316 BATAVIA NY MX ENG D 1999-01-05 00:00:00.000 2 RMV BTV 1999-02-23 00:00:00.000 900 M NaN D 00 0000 1955-05-18 00:00:00.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2048317 JACKSON MI CA ENG D 2000-03-16 00:00:00.000 1 RMV DET 2000-08-17 00:00:00.000 100 I NaN S 08 2000 1953-01-12 00:00:00.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 2048318 FA FA MX SP N 1989-11-17 00:00:00.000 1 EXC SNA 1989-11-17 00:00:00.000 800 M NaN 00 0000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 2048319 BOYNTON BEACH FL MX SP N M 2001-10-23 00:00:00.000 2 RMV MIA 2001-10-23 00:00:00.000 930 M NaN 1954-08-06 00:00:00.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 2048320 EL CENTRO CA GE ENG D NaN 0 RMV IMP 2001-10-03 00:00:00.000 830 I NaN 1950-06-14 00:00:00.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 2048321 HOUSTON TX MX SP N 1999-09-01 00:00:00.000 1 RMV HOU 2000-02-02 00:00:00.000 900 M NaN 00 0000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 2048322 CHICAGO IL MX SP D NaN 0 DEP CHI 1998-02-03 00:00:00.000 900 M NaN 00 0000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 2048325 SAN JUAN TX MX SP N 1984-06-08 00:00:00.000 1 DEP HLG 1985-04-30 00:00:00.000 900 M NaN 00 0000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 2048326 GLENDALE, NY YO ENG N NaN 1 DEP NYC 1987-06-22 00:00:00.000 100 I NaN 00 0000 NaN E NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 2048327 EL PASO TX MX SP R 1986-03-12 00:00:00.000 1 DEP ELP 1989-12-06 00:00:00.000 900 I NaN 00 0000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13 2048328 HUNTSVILLE TX MX ENG D 2000-12-27 00:00:00.000 1 RMV CIC 2001-02-20 00:00:00.000 900 M NaN S 03 2001 HUN 1955-02-14 00:00:00.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 2048329 SEE REMARKS FA MX SP N NaN 0 EXC SNA 1992-11-30 00:00:00.000 1230 M NaN 00 0000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 2048330 ATLANTA GA CA ENG D NaN 0 RMV ATL 1999-02-18 00:00:00.000 1100 M NaN 00 0000 1955-07-01 00:00:00.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 2048331 OAKDALE LA HO ENG D NaN 0 RMV OAK 2002-04-12 00:00:00.000 100 M NaN 00 0000 1955-09-24 00:00:00.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 2048332 MODESTO CA CA ENG N 1989-03-09 00:00:00.000 1 EXC SFR 1989-08-01 00:00:00.000 230 I NaN 00 0000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 2048333 SAN DIEGO CA MX SP D NaN 0 RMV SND 2001-02-27 00:00:00.000 100 M NaN S 03 2001 1954-03-05 00:00:00.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 2048334 ENGLEWOOD NJ HU HUN N 1996-08-26 00:00:00.000 2 DEP NEW 1997-01-16 00:00:00.000 1100 M NaN 00 0000 1950-12-11 00:00:00.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# Hearings
pd.read_csv("data/FOIA_TRAC_Report/B_TblProceeding.csv", sep='\t', nrows=20)
IDNPROCEEDING IDNCASE OSC_DATE INPUT_DATE BASE_CITY_CODE HEARING_LOC_CODE IJ_CODE TRANS_IN_DATE PREV_HEARING_LOC PREV_HEARING_BASE PREV_IJ_CODE TRANS_NBR HEARING_DATE HEARING_TIME DEC_TYPE DEC_CODE DEPORTED_1 DEPORTED_2 OTHER_COMP APPEAL_RSVD APPEAL_NOT_FILED COMP_DATE ABSENTIA VENUE_CHG_GRANTED TRANSFER_TO DATE_APPEAL_DUE_STATUS TRANSFER_STATUS CUSTODY CASE_TYPE NAT LANG SCHEDULED_HEAR_LOC CORRECTIONAL_FAC CRIM_IND IHP AGGRAVATE_FELON DATE_DETAINED DATE_RELEASED
0 830 3327498 1994-09-21 00:00:00.000 1995-02-23 15:48:00.000 KRO KRO NSF 1995-02-27 00:00:00.000 MIA MIA PAM 2 1995-03-07 00:00:00.000 100 C 1995-03-10 00:00:00.000 N 1995-03-10 00:00:00.000 MIA NaN V R DEP HO SP KRO N NaN NaN NaN
1 831 3327498 1994-09-21 00:00:00.000 1995-02-23 15:40:00.000 MIA MIA PAM NaN 1 1995-02-23 00:00:00.000 300 T 1995-02-23 00:00:00.000 N 1995-02-23 00:00:00.000 KRO NaN V R DEP HO SP MIA N NaN NaN NaN
2 832 3327498 1994-09-21 00:00:00.000 1994-10-27 10:11:00.000 MIA MIA PAM 1994-11-01 00:00:00.000 PIS PIS CAL 1 1994-12-27 00:00:00.000 1030 W D HO 1994-12-27 00:00:00.000 Y NaN NaN C R DEP HO SP MIA N NaN NaN NaN
3 833 3327498 1994-09-21 00:00:00.000 1994-09-27 00:00:00.000 PIS PIS CAL NaN 0 1994-10-13 00:00:00.000 900 C 1994-10-27 00:00:00.000 N 1994-10-27 00:00:00.000 MIA NaN V R DEP HO SP PIS N NaN NaN NaN
4 834 3327502 1994-09-21 00:00:00.000 1994-09-27 00:00:00.000 PIS PIS CAL NaN 0 1994-10-13 00:00:00.000 900 7 D HO 1994-10-27 00:00:00.000 N NaN NaN D DEP HO SP PIS N NaN NaN NaN
5 835 3327506 1994-09-21 00:00:00.000 1994-09-29 00:00:00.000 PIS PIS HEA NaN 0 1994-10-18 00:00:00.000 900 7 D HO 1994-11-08 00:00:00.000 N NaN NaN D DEP HO SP PIS N NaN NaN NaN
6 836 3327510 1994-09-21 00:00:00.000 1994-09-27 00:00:00.000 PIS PIS JZ NaN 0 1994-10-12 00:00:00.000 900 O D HO 1994-10-12 00:00:00.000 N NaN NaN D DEP HO SP PIS N NaN NaN NaN
7 837 3327514 1994-09-21 00:00:00.000 1994-09-27 00:00:00.000 PIS PIS CAL NaN 0 1994-10-13 00:00:00.000 900 W D HO 1995-01-31 00:00:00.000 Y NaN NaN R DEP HO SP PIS N NaN NaN NaN
8 838 3327535 1994-09-16 00:00:00.000 1994-11-22 10:26:00.000 SFR SFR BJH 1994-11-30 00:00:00.000 HLG HLG MB 2 1995-04-04 00:00:00.000 830 6 V IN O 1997-03-26 00:00:00.000 N NaN 1997-04-25 00:00:00.000 C R DEP IN PUN SFR N NaN NaN NaN
9 839 3327535 1994-09-16 00:00:00.000 1994-10-05 10:52:00.000 HLG HLG MB 1994-10-06 00:00:00.000 PIS PIS CLR 1 1994-11-21 00:00:00.000 900 C 1994-11-21 00:00:00.000 N 1994-11-22 00:00:00.000 SFR NaN V R DEP IN PUN HLG N NaN NaN NaN
10 840 3327535 1994-09-16 00:00:00.000 1994-09-26 00:00:00.000 PIS PIS CLR NaN 0 1994-10-05 00:00:00.000 900 T 1994-10-05 00:00:00.000 N 1994-10-05 00:00:00.000 HLG NaN V R DEP IN PUN PIS N NaN NaN NaN
11 841 3327539 1994-09-16 00:00:00.000 1994-10-31 15:26:00.000 HLG HLG HEA 1994-11-02 00:00:00.000 PIS PIS HEA 1 1995-01-11 00:00:00.000 900 W D IN 1995-03-08 00:00:00.000 Y NaN NaN C R DEP IN PUN HLG N NaN NaN NaN
12 842 3327539 1994-09-16 00:00:00.000 1994-09-26 00:00:00.000 PIS PIS HEA NaN 0 1994-10-12 00:00:00.000 900 T 1994-10-31 00:00:00.000 N 1994-10-31 00:00:00.000 HLG NaN V R DEP IN PUN PIS N NaN NaN NaN
13 843 3327543 1994-09-16 00:00:00.000 1994-09-26 11:25:00.000 HLG HLG MEP 1994-09-28 00:00:00.000 PIS PIS CLR 1 1994-10-25 00:00:00.000 900 W D IN 1994-10-25 00:00:00.000 Y NaN NaN C R DEP IN UNK HLG N NaN NaN NaN
14 844 3327543 1994-09-16 00:00:00.000 1994-09-21 00:00:00.000 PIS PIS CLR NaN 0 1994-09-26 00:00:00.000 900 T 1994-09-26 00:00:00.000 N 1994-09-26 00:00:00.000 HLG NaN V R DEP IN UNK PIS N NaN NaN NaN
15 845 3327176 1994-09-16 00:00:00.000 1994-10-06 09:17:00.000 HLG HLG HEA 1994-10-19 00:00:00.000 PIS PIS HEA 1 1995-01-20 00:00:00.000 900 W D IN 1995-01-20 00:00:00.000 Y NaN 1995-02-21 00:00:00.000 C R DEP IN UNK HLG N NaN NaN NaN
16 846 3327176 1994-09-16 00:00:00.000 1994-09-21 00:00:00.000 PIS PIS HEA NaN 0 1994-10-06 00:00:00.000 900 T 1994-10-06 00:00:00.000 N 1994-10-06 00:00:00.000 HLG NaN V R DEP IN UNK PIS N NaN NaN NaN
17 847 3327181 1994-09-16 00:00:00.000 1994-09-28 12:23:00.000 HLG HLG MB 1994-09-29 00:00:00.000 PIS PIS CLR 1 1994-11-16 00:00:00.000 900 O D IN 1994-11-16 00:00:00.000 Y NaN NaN C R DEP IN UNK HLG N NaN NaN NaN
18 848 3327181 1994-09-16 00:00:00.000 1994-09-21 00:00:00.000 PIS PIS CLR NaN 0 1994-09-28 00:00:00.000 900 T 1994-09-28 00:00:00.000 N 1994-09-28 00:00:00.000 HLG NaN V R DEP IN UNK PIS N NaN NaN NaN
19 849 3327185 1994-09-16 00:00:00.000 1994-10-06 09:17:00.000 HLG HLG HEA 1994-10-19 00:00:00.000 PIS PIS HEA 1 1995-01-20 00:00:00.000 900 W D IN 1995-01-20 00:00:00.000 Y NaN NaN C R DEP IN UNK HLG N NaN NaN NaN
# Charges
pd.read_csv("data/FOIA_TRAC_Report/B_TblProceedCharges.csv", sep='\t', nrows=20)
IDNPRCDCHG IDNCASE IDNPROCEEDING CHARGE CHG_STATUS
0 1 2047228 1 212a20
1 2 2047228 1 212a26
2 3 2047232 2 212a06Ai O
3 4 2047236 3 212a06Ai S
4 5 2047240 4 241a02Bi S
5 6 2047248 6 237a01B O
6 7 2047252 7 212a06Ci O
7 8 2047255 8 241a11
8 9 2047258 9 241a02
9 10 2047262 10 237a01B S
10 11 2047262 11 237a01B O
11 12 2047262 12 237a01B O
12 13 2047266 13 241a04
13 14 2047270 14 241a01B S
14 15 2047274 15 237a02Aiii O
15 16 2047274 15 237a02C O
16 17 2047278 16 237a02C S
17 18 2047282 17 241a01A S
18 19 2047282 17 241a01B S
19 20 2047286 18 241a01
# Judges
pd.read_csv("data/FOIA_TRAC_Report/Lookup/tblLookupJudge.csv", sep='\t', nrows=20)
idnJudge JUDGE_CODE JUDGE_NAME JUDGE_ST_ADDRESS JUDGE_CITY JUDGE_STATE JUDGE_ZIP_1 JUDGE_ZIP_2 JUDGE_PHONE_NO datCreatedOn datModifiedOn blnActive blnSkippedOnWheel blnLastOnWheel blnSkippedOnWheelMA blnLastOnWheelMA intOrderMA intOrderMM
0 126 AA ANTHONY ATENAIDE NaN NaN NaN NaN NaN NaN 2003-08-10 11:31:57.427 2011-02-10 10:40:22.000 0 1 0 1 0 507 505
1 127 AAA <ALL JUDGES> NaN NaN NaN NaN NaN NaN 2003-08-10 11:31:57.427 NaN 0 1 0 1 0 1 1
2 493 AAK AMIENA A. KHAN NaN NaN NaN NaN NaN NaN 2010-11-12 13:49:14.000 2010-11-12 13:49:14.000 1 1 0 1 0 889 902
3 617 AAS ARWEN A. SWINK NaN NaN NaN NaN NaN NaN 2017-01-05 15:06:40.000 2017-01-05 15:06:40.000 1 1 0 1 0 886 -190
4 376 AAT A. ASHLEY TABADDOR NaN NaN NaN NaN NaN NaN 2005-12-09 13:30:02.000 2005-12-09 13:30:02.000 1 1 0 1 0 -1245 886
5 128 AAV ALAN A. VOMACKA NaN NaN NaN NaN NaN NaN 2003-08-10 11:31:57.427 NaN 1 1 0 1 0 -733 -15057
6 801 ABE AUDRA BEHNE NaN NaN NaN NaN NaN NaN 2018-08-30 09:31:52.000 2018-08-30 09:31:52.000 1 1 0 1 0 -4069 -7564
7 450 ABM ALISON M. BROWN NaN NaN NaN NaN NaN NaN 2010-04-23 16:30:30.000 2010-04-23 16:30:30.000 1 1 0 1 0 904 275
8 452 ABT ALISON M. BROWN (TEMP) NaN NaN NaN NaN NaN NaN 2010-04-27 14:21:38.000 2010-04-28 14:34:55.000 0 1 0 1 0 434 -165
9 540 AC2 AMIT CHUGH NaN NaN NaN NaN NaN NaN 2015-06-11 16:20:55.000 2015-06-11 16:20:55.000 1 1 0 1 0 -432 -3406
10 129 ACB ARVID C. BOYES NaN NaN NaN NaN NaN NaN 2003-08-10 11:31:57.427 NaN 0 1 0 1 0 1 737
11 472 ACH AMY C. HOOGASIAN NaN NaN NaN NaN NaN NaN 2010-10-08 15:25:59.000 2017-04-18 16:10:12.000 1 1 0 1 0 -538 807
12 867 ACL ANNA C. LITTLE NaN NaN NaN NaN NaN NaN 2018-12-19 14:24:58.420 2018-12-19 14:24:58.420 1 1 0 1 0 906 451
13 461 ADM ANIBAL D. MARTINEZ NaN NaN NaN NaN NaN NaN 2010-09-01 14:43:28.000 2010-09-01 14:43:28.000 1 1 0 1 0 344 856
14 130 ADP ANTHONY D. PETRONE, JR. NaN NaN NaN NaN NaN NaN 2003-08-10 11:31:57.427 NaN 0 1 0 1 0 1 475
15 383 AED ALISON E DAW NaN NaN NaN NaN NaN NaN 2006-05-18 11:45:42.000 2006-05-18 11:45:42.000 1 1 0 1 0 368 126
16 131 AEG ALBERTO E. GONZALEZ NaN NaN NaN NaN NaN NaN 2003-08-10 11:31:57.427 NaN 1 1 0 1 0 1 398
17 638 AEM ANTHONY E. MAINGOT NaN NaN NaN NaN NaN NaN 2017-03-03 09:57:17.000 2017-03-03 09:57:17.000 1 1 0 1 0 901 -2048
18 805 AEP ANNE PERRY NaN NaN NaN NaN NaN NaN 2018-08-31 11:46:09.000 2018-10-24 22:23:15.000 0 1 0 1 0 -299 -899
19 744 AJD ANGELO J. DICAMILLO NaN NaN NaN NaN NaN NaN 2018-01-25 15:50:28.000 2018-01-25 15:50:28.000 1 1 0 1 0 -538 -2109

Read in the data#

OK! So they're tab-delimited, demands latin-1 and some of the rows aren't liked by the pandas parser. There aren't that many rows that throw errors in pandas, so we're just going to skip them with error_bad_lines=False.

%%time

cases = pd.read_csv("data/FOIA_TRAC_Report/A_TblCase.csv", 
                    sep='\t', 
                    dtype='str', 
                    error_bad_lines=False,
                    warn_bad_lines=False,
                    encoding='latin-1', 
                    na_values=' ',
                    usecols=['CUSTODY', 'IDNCASE', 'CASE_TYPE', 'DATE_OF_ENTRY', 'UPDATE_SITE', 'NAT'])

proceedings = pd.read_csv("data/FOIA_TRAC_Report/B_TblProceeding.csv", 
                          sep='\t', 
                          dtype='str',
                          error_bad_lines=False,
                          warn_bad_lines=False,
                          na_values=' ',
                          usecols=['IDNCASE', 'ABSENTIA', 'DEC_TYPE', 'DEC_CODE', 'CASE_TYPE', 'COMP_DATE', 'OSC_DATE', 'IJ_CODE'])

charges = pd.read_csv("data/FOIA_TRAC_Report/B_TblProceedCharges.csv", 
                          sep='\t',
                          dtype='str', 
                          na_values=' ',
                          error_bad_lines=False,
                          warn_bad_lines=False)

charge_lookup = pd.read_csv("data/FOIA_TRAC_Report/Lookup/tbllookupCharges.csv", 
                            sep='\t',
                            dtype='str',
                            na_values=' ',
                            error_bad_lines=False)
CPU times: user 55.9 s, sys: 22 s, total: 1min 17s
Wall time: 1min 39s

How big are each of these tables?

print(f"Cases: {cases.shape[0]:,}")
print(f"Proceedings: {proceedings.shape[0]:,}")
print(f"Charges: {charges.shape[0]:,}")
print(f"Charge Lookup: {charge_lookup.shape[0]:,}")
Cases: 6,710,354
Proceedings: 8,890,744
Charges: 10,377,199
Charge Lookup: 251

Yeah, pretty big! It'd probably be best to put these into a Postgres database, but we're just going to use pandas for now. If you're interested in trying it out database-style, check when we inserted bills into Postgres.

Build the filters#

We're going to remove a lot of this data, as we're only looking for people who could plausibly be called refugees, haven't been charged with crimes, aren't in any non-standard situations, and can be compared across time with others in similar situations (post 1998 law change). And possibly some more adjustments, too!

We'll use this function to check how each filter does#

While we could just filter filter filter, I'm curious to see what percent of cases each filter removes. We'll test each filter first before we actually filter out data.

def filter_info(pandas_filter, dataframe):
    cases_removed = dataframe.shape[0] - pandas_filter.sum()
    pct_removed = cases_removed / dataframe.shape[0]
    print(f"This filter would remove {pct_removed:.2%} of cases - {cases_removed:,}, to be specific")

Detailed filter#

  • Remove anyone who is currently detained
  • N is never detained, R is released, D is detained
not_detained_filter = cases.CUSTODY != 'D'

filter_info(not_detained_filter, cases)
This filter would remove 35.11% of cases - 2,355,694, to be specific

Case type filter#

  • For case types, only RMV, AOC, WHO, NAC
case_types_filter = cases.CASE_TYPE.isin(['RMV', 'AOC', 'WHO', 'NAC'])

filter_info(case_types_filter, cases)
This filter would remove 22.46% of cases - 1,506,970, to be specific

Absentia filter#

  • No cases where any of the proceedings were done in absentia
  • Need to pull the absentia case nos from proceedings first
  • Then build the filter to reject the cases with those IDs
absentia_case_numbers = proceedings[proceedings.ABSENTIA == 'Y'].IDNCASE.unique()
not_absentia_filter = ~cases.IDNCASE.isin(absentia_case_numbers)

filter_info(not_absentia_filter, cases)
This filter would remove 17.43% of cases - 1,169,315, to be specific

Post-1998 filter#

  • Only cases with OSC dates in 1998 or later
  • Need to pull the 1998-or-later proceedings first
  • Then build the filter to accept the cases with those IDs
after_1998_case_nos = proceedings[proceedings.OSC_DATE >= '1998'].IDNCASE.unique()
after_1998_filter = cases.IDNCASE.isin(after_1998_case_nos)

filter_info(after_1998_filter, cases)
This filter would remove 24.45% of cases - 1,640,721, to be specific

Criminal charge filter#

  • Filter out any case that has a criminal charge associated with it
  • Find the charge codes that are criminal
  • Find the rows in Proceed Charges that involve those charges, pull the case nos
  • Then build the filter to reject the cases with those IDs
criminal_charge_codes = charge_lookup[charge_lookup.criminal_flag == '1'].strCode
criminal_case_numbers = charges[charges.CHARGE.isin(criminal_charge_codes)].IDNCASE.unique()
criminal_case_filter = ~cases.IDNCASE.isin(criminal_case_numbers)

filter_info(criminal_case_filter, cases)
This filter would remove 7.15% of cases - 479,625, to be specific

Filter our cases#

# Actually run the filters to get a filtered dataset
filtered_cases = cases[not_detained_filter & case_types_filter & not_absentia_filter & criminal_case_filter]

print(f"Original case count: {cases.shape[0]:,}")
print(f"Final case count: {filtered_cases.shape[0]:,}")
Original case count: 6,710,354
Final case count: 2,357,071

Filter the proceedings#

filtered_proceedings = proceedings[proceedings.IDNCASE.isin(filtered_cases.IDNCASE)]
# Need to filter for case types here, too
filtered_proceedings = filtered_proceedings[filtered_proceedings.CASE_TYPE.isin(['RMV', 'AOC', 'WHO', 'NAC'])]

print(f"Original proceedings count: {proceedings.shape[0]:,}")
print(f"Filtered proceedings count: {filtered_proceedings.shape[0]:,}")
Original proceedings count: 8,890,744
Filtered proceedings count: 3,504,464

Save our data#

Honestly: we still have too many cases (or too few, depending on how you're counting). The dataset is a wreck anyway, but we can live with it.

filtered_cases.to_csv("data/cases-filtered.csv", index=False)
filtered_proceedings.to_csv("data/proceedings-filtered.csv", index=False)