A final cleaning, combining, and filtering of our vehicle data#
After combining from so many sources, we need to filter out the car crashes we're interested in. We're curious about 2-car accidents that happen between light vehicles (no tractor-trailers).
import pandas as pd
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
Read in our data#
The vehicles involved in the crash#
vehicles = pd.read_csv("data/cleaned/combined-vehicle-data.csv")
vehicles.head(2)
What are we interested in for this dataset? We probably don't need all those columns!
vehicles = vehicles[['REPORT_NO', 'SPEED_LIMIT', 'VEH_MAKE', 'VEH_MODEL', 'VEH_YEAR', 'VIN_NO', 'VEHICLE_ID']]
vehicles.head()
Details about the crash itself#
crashes = pd.read_csv("data/cleaned/combined-crash-data.csv")
crashes.head(2)
Do we need all of those fields? Again - probably not! What fields are we interested in?
crashes = crashes[['REPORT_NO']]
crashes.head()
Weights of the cars#
weights = pd.read_csv("data/cleaned/vins_and_weights.csv")
weights.head(2)
Seems good to me!
Person data#
people = pd.read_csv("data/cleaned/combined-person-data.csv")
people.head(2)
people = people[['INJ_SEVER_CODE', 'VEHICLE_ID']]
people.head()
Merging our data#
# We'll keep the ones without weights so we can drop those accidents from the analysis
vehicles_with_weights = vehicles.merge(weights,
how='left',
left_on='VIN_NO',
right_on='VIN')
vehicles_with_weights.head(10)
Do the VEH_MAKE
and Make
columns match up, generally speaking? Do we trust what we did by searching out the VIN?
vehicles_with_weights[['VEH_MAKE', 'Make', 'VEH_MODEL', 'Model', 'VEH_YEAR', 'ModelYear', 'weight']].head(10)
It looks some of these have weights, while others don't. How many do we have weights for?
# isna vs notna
vehicles_with_weights.weight.notna().value_counts()
Some of our vehicles don't have weights or years!#
Let's get the report numbers of vehicles without weights and/or years, then exclude those reports (crashes) from our analysis.
# Find the ones missing weights
missing_weight = vehicles_with_weights[vehicles_with_weights.weight.isna()]
missing_weight.head()
# Get their report numbers
missing_weight.REPORT_NO.head()
# Filter those out of our 'crashes' dataset
# We need to outline this real bigtime beforehand, "what can we keep?"
# because if we're only looking at our own car, then we don't need to drop these
print("Before", crashes.shape)
crashes = crashes[~crashes.REPORT_NO.isin(missing_weight.REPORT_NO)]
print("After", crashes.shape)
If we're getting rid of half of our data, is it still okay?
We only want two-vehicle crashes#
For analytic purposes, we decompose the data set into three sub-saw three-vehicle crashes, and single-vehicle crashes. The two-vehicle crash data set is the focus of most of our analyses
Each vehicle has a REPORT_NO
. If two vehicles show up with the same REPORT_NO
, we know there were two vehicles in the crash.
vehicles_with_weights.head()
# Which technique?
# counted = vehicles_with_weights.groupby('REPORT_NO').size().reset_index(name='car_count')
# counted.head()
counted = vehicles_with_weights.REPORT_NO.value_counts()
counted.head()
There was a 69-car crash? That's pretty interesting. Better look that one up. To get the ones that have two cars, we just ask if == 2
- since it's a series, we don't have to ask for the column name.
#two_car_report_nos = counted[counted.car_count == 2].REPORT_NO
two_car_report_nos = counted[counted == 2].index
two_car_report_nos
Filter for only light vehicles#
this data set contains all collisions involving two light vehicles built after 1980. We define a light vehicle as any car, pickup truck, SUV, or minivan that weights between 1500 and 6000 pounds.
is_not_light_vehicle = (vehicles_with_weights.weight <= 1500) | (vehicles_with_weights.weight > 6000)
is_before_1980 = vehicles_with_weights.ModelYear.astype(float) < 1980
not_light_vehicle_crash = vehicles_with_weights[is_not_light_vehicle | is_before_1980].REPORT_NO
crashes = crashes[~crashes.REPORT_NO.isin(not_light_vehicle_crash)]
crashes.shape