A final cleaning, combining, and filtering of our vehicle data#

After combining from so many sources, we need to filter out the car crashes we're interested in. We're curious about 2-car accidents that happen between light vehicles (no tractor-trailers).

import pandas as pd
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)

Read in our data#

The vehicles involved in the crash#

vehicles = pd.read_csv("data/cleaned/combined-vehicle-data.csv")
vehicles.head(2)
AREA_DAMAGED_CODE1 AREA_DAMAGED_CODE2 AREA_DAMAGED_CODE3 AREA_DAMAGED_CODE_IMP1 AREA_DAMAGED_CODE_MAIN BODY_TYPE_CODE COMMERCIAL_FLAG CONTI_DIRECTION_CODE CV_BODY_TYPE_CODE DAMAGE_CODE DRIVERLESS_FLAG FIRE_FLAG GOING_DIRECTION_CODE GVW_CODE HARM_EVENT_CODE HAZMAT_SPILL_FLAG HIT_AND_RUN_FLAG HZM_NUM MOVEMENT_CODE NUM_AXLES PARKED_FLAG REPORT_NO SPEED_LIMIT TOWED_AWAY_FLAG TOWED_VEHICLE_CONFIG_CODE VEHICLE_ID VEH_MAKE VEH_MODEL VEH_YEAR VIN_NO
0 8.0 9.0 10.0 10.0 10.0 23.08 N E NaN 5 N N E NaN 9.0 NaN N NaN 1.0 NaN N ADJ487004H 30 Y 0 000238fd-44fa-4cd5-8eb7-41ab30500bec CHEVY TAHOE 2005.0 1GNEK13Q2J285593
1 12.0 NaN NaN 12.0 12.0 2.00 N E NaN 4 N N E NaN 1.0 NaN N NaN 1.0 NaN N MCP2487000M 40 Y 0 00038116-1bf9-48cc-b317-4f4375d14b60 INFI 4S 2003.0 JNKCV51E63M013580

What are we interested in for this dataset? We probably don't need all those columns!

vehicles = vehicles[['REPORT_NO', 'SPEED_LIMIT', 'VEH_MAKE', 'VEH_MODEL', 'VEH_YEAR', 'VIN_NO', 'VEHICLE_ID']]
vehicles.head()
REPORT_NO SPEED_LIMIT VEH_MAKE VEH_MODEL VEH_YEAR VIN_NO VEHICLE_ID
0 ADJ487004H 30 CHEVY TAHOE 2005.0 1GNEK13Q2J285593 000238fd-44fa-4cd5-8eb7-41ab30500bec
1 MCP2487000M 40 INFI 4S 2003.0 JNKCV51E63M013580 00038116-1bf9-48cc-b317-4f4375d14b60
2 CB5190006B 55 TOYT TK 2011.0 5TFUY5F1XBX167340 0003b659-2785-4868-8877-0b786a284827
3 ADJ4590035 5 HONDA CIVIC 2015.0 2HGFG4A59FH702545 00050484-d08f-4b6e-bc7e-9ec270e94660
4 ADJ849000Z 10 HONDA ACCORD 2003.0 1HGCM66313A037175 00057af4-d848-4cee-b854-707f57581f4e

Details about the crash itself#

crashes = pd.read_csv("data/cleaned/combined-crash-data.csv")
crashes.head(2)
ACC_DATE ACC_TIME AGENCY_CODE AREA_CODE COLLISION_TYPE_CODE COUNTY_NO C_M_ZONE_FLAG DISTANCE DISTANCE_DIR_FLAG FEET_MILES_FLAG FIX_OBJ_CODE HARM_EVENT_CODE1 HARM_EVENT_CODE2 JUNCTION_CODE LANE_CODE LATITUDE LIGHT_CODE LOC_CODE LOGMILE_DIR_FLAG LOG_MILE LONGITUDE MAINROAD_NAME MUNI_CODE RD_COND_CODE RD_DIV_CODE REFERENCE_NO REFERENCE_ROAD_NAME REFERENCE_SUFFIX REFERENCE_TYPE_CODE REPORT_NO REPORT_TYPE ROUTE_TYPE_CODE RTE_NO RTE_SUFFIX SIGNAL_FLAG SURF_COND_CODE WEATHER_CODE
0 2018-04-10 00:00:00 01:50:00 MSP UNK 17 18.0 N 0.0 N F 22.03 16.0 11.0 1.0 NaN 38.277230 4.0 NaN N 2.09 -76.682876 NEWTOWNE NECK RD 0.0 1.0 1.0 166.0 ROSEBANK RD NaN CO MSP6188002Q Injury Crash MD 243.0 NaN N 2.0 6.01
1 2018-05-02 00:00:00 11:06:00 ELKTON UNK 5 7.0 N 15.0 S F 0.00 1.0 1.0 3.0 NaN 39.613747 1.0 NaN N 19.41 -75.837454 N BRIDGE ST 52.0 1.0 3.0 362.0 LAUREL DR NaN MU BK0227001M Injury Crash MD 213.0 NaN N 2.0 6.01

Do we need all of those fields? Again - probably not! What fields are we interested in?

crashes = crashes[['REPORT_NO']]
crashes.head()
REPORT_NO
0 MSP6188002Q
1 BK0227001M
2 ZU8005001W
3 MCP2891005S
4 MSP6743004Q

Weights of the cars#

weights = pd.read_csv("data/cleaned/vins_and_weights.csv")
weights.head(2)
VIN Make Model ModelYear weight
0 2FMDA5143TBB45576 FORD WINDSTAR 1996 3733.0
1 2G1WC5E37E1120089 CHEVROLET IMPALA 2014 3618.0

Seems good to me!

Person data#

people = pd.read_csv("data/cleaned/combined-person-data.csv")
people.head(2)
AIRBAG_DEPLOYED ALCOHOL_TESTTYPE_CODE ALCOHOL_TEST_CODE BAC_CODE CDL_FLAG CLASS CONDITION_CODE DATE_OF_BIRTH DRUG_TESTRESULT_CODE DRUG_TEST_CODE EJECT_CODE EMS_UNIT_LABEL EQUIP_PROB_CODE FAULT_FLAG INJ_SEVER_CODE LICENSE_STATE_CODE MOVEMENT_CODE OCC_SEAT_POS_CODE PED_LOCATION_CODE PED_OBEY_CODE PED_TYPE_CODE PED_VISIBLE_CODE PERSON_ID PERSON_TYPE REPORT_NO SAF_EQUIP_CODE SEX_CODE VEHICLE_ID
0 1.0 NaN 0.0 NaN N C 0.0 1952-04-20 00:00:00 NaN 0.0 1.0 NaN 1.0 N 1 PA NaN NaN NaN NaN NaN NaN 48dd00ee-e033-47e7-ad1e-0b734020301b D AB4284000S 13.0 F eb6aadb8-dacb-4744-a1a7-ab812c96f27f
1 1.0 NaN 0.0 NaN N NaN 0.0 1985-05-28 00:00:00 NaN 0.0 0.0 NaN 0.0 N 1 MD NaN NaN NaN NaN NaN NaN 166296bd-ffd3-4c16-aa74-4f4bf4139d8d D AB4313000X 13.0 F b463eb20-2f01-4200-9d6f-b18888ce2593
people = people[['INJ_SEVER_CODE', 'VEHICLE_ID']]
people.head()
INJ_SEVER_CODE VEHICLE_ID
0 1 eb6aadb8-dacb-4744-a1a7-ab812c96f27f
1 1 b463eb20-2f01-4200-9d6f-b18888ce2593
2 1 3c8629d0-d524-47c1-bfbc-b18e07f3087e
3 1 c4628cdb-f295-4a24-8a4b-653741ac6ae7
4 1 cdda1580-fd79-4358-8819-c2250f494591

Merging our data#

# We'll keep the ones without weights so we can drop those accidents from the analysis
vehicles_with_weights = vehicles.merge(weights,
                                       how='left',
                                       left_on='VIN_NO',
                                       right_on='VIN')
vehicles_with_weights.head(10)
REPORT_NO SPEED_LIMIT VEH_MAKE VEH_MODEL VEH_YEAR VIN_NO VEHICLE_ID VIN Make Model ModelYear weight
0 ADJ487004H 30 CHEVY TAHOE 2005.0 1GNEK13Q2J285593 000238fd-44fa-4cd5-8eb7-41ab30500bec 1GNEK13Q2J285593 CHEVROLET GMT-400 1988.0 4300.0
1 MCP2487000M 40 INFI 4S 2003.0 JNKCV51E63M013580 00038116-1bf9-48cc-b317-4f4375d14b60 JNKCV51E63M013580 INFINITI G35 2003.0 3468.0
2 CB5190006B 55 TOYT TK 2011.0 5TFUY5F1XBX167340 0003b659-2785-4868-8877-0b786a284827 5TFUY5F1XBX167340 TOYOTA TUNDRA 2011.0 5480.0
3 ADJ4590035 5 HONDA CIVIC 2015.0 2HGFG4A59FH702545 00050484-d08f-4b6e-bc7e-9ec270e94660 2HGFG4A59FH702545 HONDA CIVIC 2015.0 2754.0
4 ADJ849000Z 10 HONDA ACCORD 2003.0 1HGCM66313A037175 00057af4-d848-4cee-b854-707f57581f4e 1HGCM66313A037175 HONDA ACCORD 2003.0 3023.0
5 AE5207008Z 30 FORD 4D 2016.0 1FADP3K28GL258987 00089d4a-7038-4693-9e02-b402676631af 1FADP3K28GL258987 FORD FOCUS 2016.0 2932.0
6 MDTA1150000D 30 GILL BUS 2004.0 15GGD211X41076128 000abd04-058d-4e14-9f1c-4a100493a305 NaN NaN NaN NaN NaN
7 MSP5973003Q 35 CHEVROLET HHR 2008.0 3GNDA23DX8S558149 000e3b5d-b4ae-44a1-af02-df6cce381fc1 NaN NaN NaN NaN NaN
8 MCP27070015 25 ACUR TSX 2008.0 JH4CL96848C021626 000f45d8-bc0e-4f9c-820a-474212e669cd NaN NaN NaN NaN NaN
9 ADJ859000Y 15 HONDA CIVIC 2009.0 2HGFA16689H357624 0010076b-0a45-45f3-8796-f387b39cd85d 2HGFA16689H357624 HONDA CIVIC 2009.0 2678.0

Do the VEH_MAKE and Make columns match up, generally speaking? Do we trust what we did by searching out the VIN?

vehicles_with_weights[['VEH_MAKE', 'Make', 'VEH_MODEL', 'Model', 'VEH_YEAR', 'ModelYear', 'weight']].head(10)
VEH_MAKE Make VEH_MODEL Model VEH_YEAR ModelYear weight
0 CHEVY CHEVROLET TAHOE GMT-400 2005.0 1988.0 4300.0
1 INFI INFINITI 4S G35 2003.0 2003.0 3468.0
2 TOYT TOYOTA TK TUNDRA 2011.0 2011.0 5480.0
3 HONDA HONDA CIVIC CIVIC 2015.0 2015.0 2754.0
4 HONDA HONDA ACCORD ACCORD 2003.0 2003.0 3023.0
5 FORD FORD 4D FOCUS 2016.0 2016.0 2932.0
6 GILL NaN BUS NaN 2004.0 NaN NaN
7 CHEVROLET NaN HHR NaN 2008.0 NaN NaN
8 ACUR NaN TSX NaN 2008.0 NaN NaN
9 HONDA HONDA CIVIC CIVIC 2009.0 2009.0 2678.0

It looks some of these have weights, while others don't. How many do we have weights for?

# isna vs notna
vehicles_with_weights.weight.notna().value_counts()
True     534436
False    207697
Name: weight, dtype: int64

Some of our vehicles don't have weights or years!#

Let's get the report numbers of vehicles without weights and/or years, then exclude those reports (crashes) from our analysis.

# Find the ones missing weights
missing_weight = vehicles_with_weights[vehicles_with_weights.weight.isna()]
missing_weight.head()
REPORT_NO SPEED_LIMIT VEH_MAKE VEH_MODEL VEH_YEAR VIN_NO VEHICLE_ID VIN Make Model ModelYear weight
6 MDTA1150000D 30 GILL BUS 2004.0 15GGD211X41076128 000abd04-058d-4e14-9f1c-4a100493a305 NaN NaN NaN NaN NaN
7 MSP5973003Q 35 CHEVROLET HHR 2008.0 3GNDA23DX8S558149 000e3b5d-b4ae-44a1-af02-df6cce381fc1 NaN NaN NaN NaN NaN
8 MCP27070015 25 ACUR TSX 2008.0 JH4CL96848C021626 000f45d8-bc0e-4f9c-820a-474212e669cd NaN NaN NaN NaN NaN
14 DA3276000V 45 TOYT CAMRY 2018.0 JTNB11HK2J3004953 0018b4d3-b19b-4327-b5f9-f0725fd06844 NaN NaN NaN NaN NaN
15 CB5712001X 50 JEEP COMPASS 2015.0 W452866115351 001968e1-88be-424b-b7cf-bd816ffe340a NaN NaN NaN NaN NaN
# Get their report numbers
missing_weight.REPORT_NO.head()
6     MDTA1150000D
7      MSP5973003Q
8      MCP27070015
14      DA3276000V
15      CB5712001X
Name: REPORT_NO, dtype: object
 
# Filter those out of our 'crashes' dataset
# We need to outline this real bigtime beforehand, "what can we keep?"
# because if we're only looking at our own car, then we don't need to drop these
print("Before", crashes.shape)
crashes = crashes[~crashes.REPORT_NO.isin(missing_weight.REPORT_NO)]
print("After", crashes.shape)
Before (396419, 1)
After (218513, 1)

If we're getting rid of half of our data, is it still okay?

We only want two-vehicle crashes#

For analytic purposes, we decompose the data set into three sub-saw three-vehicle crashes, and single-vehicle crashes. The two-vehicle crash data set is the focus of most of our analyses

Each vehicle has a REPORT_NO. If two vehicles show up with the same REPORT_NO, we know there were two vehicles in the crash.

vehicles_with_weights.head()
REPORT_NO SPEED_LIMIT VEH_MAKE VEH_MODEL VEH_YEAR VIN_NO VEHICLE_ID VIN Make Model ModelYear weight
0 ADJ487004H 30 CHEVY TAHOE 2005.0 1GNEK13Q2J285593 000238fd-44fa-4cd5-8eb7-41ab30500bec 1GNEK13Q2J285593 CHEVROLET GMT-400 1988.0 4300.0
1 MCP2487000M 40 INFI 4S 2003.0 JNKCV51E63M013580 00038116-1bf9-48cc-b317-4f4375d14b60 JNKCV51E63M013580 INFINITI G35 2003.0 3468.0
2 CB5190006B 55 TOYT TK 2011.0 5TFUY5F1XBX167340 0003b659-2785-4868-8877-0b786a284827 5TFUY5F1XBX167340 TOYOTA TUNDRA 2011.0 5480.0
3 ADJ4590035 5 HONDA CIVIC 2015.0 2HGFG4A59FH702545 00050484-d08f-4b6e-bc7e-9ec270e94660 2HGFG4A59FH702545 HONDA CIVIC 2015.0 2754.0
4 ADJ849000Z 10 HONDA ACCORD 2003.0 1HGCM66313A037175 00057af4-d848-4cee-b854-707f57581f4e 1HGCM66313A037175 HONDA ACCORD 2003.0 3023.0
# Which technique?

# counted = vehicles_with_weights.groupby('REPORT_NO').size().reset_index(name='car_count')
# counted.head()

counted = vehicles_with_weights.REPORT_NO.value_counts()
counted.head()
MDTA1229000H    69
CB53480003      35
MSP67460095     18
CE4636002C      17
MSP6063005N     17
Name: REPORT_NO, dtype: int64

There was a 69-car crash? That's pretty interesting. Better look that one up. To get the ones that have two cars, we just ask if == 2 - since it's a series, we don't have to ask for the column name.

#two_car_report_nos = counted[counted.car_count == 2].REPORT_NO
two_car_report_nos = counted[counted == 2].index
two_car_report_nos
Index(['MSP528100BT', 'MSP5179001M', 'AE5869002N', 'ADH2680025', 'MSP2312000H',
       'HS98300005', 'AC12830021', 'DA3892000P', 'AC2040003L', 'AC2127000B',
       ...
       'ADI179000X', 'DA3889002W', 'ADI946000L', 'ADJ380001Y', 'AC2034002L',
       'MSP61740086', 'AK00410003', 'MCP1366006C', 'ADJ186004B', 'ZH0339000W'],
      dtype='object', length=253011)

Filter for only light vehicles#

this data set contains all collisions involving two light vehicles built after 1980. We define a light vehicle as any car, pickup truck, SUV, or minivan that weights between 1500 and 6000 pounds.

is_not_light_vehicle = (vehicles_with_weights.weight <= 1500) | (vehicles_with_weights.weight > 6000)
is_before_1980 = vehicles_with_weights.ModelYear.astype(float) < 1980
not_light_vehicle_crash = vehicles_with_weights[is_not_light_vehicle | is_before_1980].REPORT_NO
crashes = crashes[~crashes.REPORT_NO.isin(not_light_vehicle_crash)]
crashes.shape
(217074, 1)