Understanding feature selection and feature engineering when performing regressions#

Even with a large and complete dataset, not all of the features end up being important, and you don't always have all the fields you need. Sometimes performing calculations between two or more columns can yield a large improvement in your predictor's performance.

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

pd.set_option("display.max_columns", 200)
pd.set_option("display.max_colwidth", 200)
%matplotlib inline

Reading in our data#

Every single person in a crash#

We'll start by reading in the list of people who were involved in an accident.

Call this dataframe people.

people = pd.read_csv("data/combined-person-data.csv")
people.head(2)
AIRBAG_DEPLOYED ALCOHOL_TESTTYPE_CODE ALCOHOL_TEST_CODE BAC_CODE CDL_FLAG CLASS CONDITION_CODE DATE_OF_BIRTH DRUG_TESTRESULT_CODE DRUG_TEST_CODE EJECT_CODE EMS_UNIT_LABEL EQUIP_PROB_CODE FAULT_FLAG INJ_SEVER_CODE LICENSE_STATE_CODE MOVEMENT_CODE OCC_SEAT_POS_CODE PED_LOCATION_CODE PED_OBEY_CODE PED_TYPE_CODE PED_VISIBLE_CODE PERSON_ID PERSON_TYPE REPORT_NO SAF_EQUIP_CODE SEX_CODE VEHICLE_ID
0 1.0 NaN 0.0 NaN N C 0.0 1952-04-20 00:00:00 NaN 0.0 1.0 NaN 1.0 N 1 PA NaN NaN NaN NaN NaN NaN 48dd00ee-e033-47e7-ad1e-0b734020301b D AB4284000S 13.0 F eb6aadb8-dacb-4744-a1a7-ab812c96f27f
1 1.0 NaN 0.0 NaN N NaN 0.0 1985-05-28 00:00:00 NaN 0.0 0.0 NaN 0.0 N 1 MD NaN NaN NaN NaN NaN NaN 166296bd-ffd3-4c16-aa74-4f4bf4139d8d D AB4313000X 13.0 F b463eb20-2f01-4200-9d6f-b18888ce2593

How often did each severity of injury show up? (e.g. not injured, non-incapacitating injury, etc)

people.INJ_SEVER_CODE.value_counts()
1    716806
2     82642
3     76801
4     10353
5      1681
Name: INJ_SEVER_CODE, dtype: int64

We're only interested in fatalities, so let's create a new had_fatality column for when people received a fatal injury.

Confirm there were 1681 people with fatal injuries.

people['had_fatality'] = (people.INJ_SEVER_CODE == 5).astype(int)
people.had_fatality.value_counts()
0    886602
1      1681
Name: had_fatality, dtype: int64

Working on Features#

Starting our analysis#

We're going to run a regression on the impact of being male vs female on crash fatalities. Prepare a dataframe called train_df with the appropriate information in it.

  • Tip: What column(s) are your input, and what is your output? Aka independent and dependent variables
  • Tip: You'll need to convert your input column into something numeric, I suggest using .replace
  • Tip: We aren't interested in the "Unknown" sex - either filtering or np.nan + .dropna() might be useful ways to get rid of those columns
people['SEX_CODE'] = people.SEX_CODE.replace({'M': 0, 'F': 1, 'U': np.nan})
train_df = people[['SEX_CODE', 'had_fatality']].copy()
train_df = train_df.dropna()

Confirm that your train_df has two columns and 815,827 rows.

Tip: If you have more rows, make sure you dropped all of the rows with Unknown sex.

Tip: If you have more columns, make sure you only have your input and output columns.

train_df.shape
(815827, 2)

Run your regression#

See the effect of sex on whether the person's injuries are fatal or not. After we train the regression, we can use my ✨favorite technique✨ to display features and their odds ratios:

feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)
X = train_df.drop(columns='had_fatality')
y = train_df.had_fatality
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X, y)
LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)
feature coefficient (log odds ratio) odds ratio
0 SEX_CODE -0.71481 0.4893

Use words to interpret this result#

# Women are about 50% less likely to die in a car crash

Adding more features#

The actual crash data has more details - whether it was snowy/icy, whether it was a highway, etc.

Read in combined-crash-data.csv, calling it crashes, and merge it with our people dataset. I'll save you a lookup: the REPORT_NO is what matches between the two.

crashes = pd.read_csv("data/combined-crash-data.csv")
crashes.head(2)
ACC_DATE ACC_TIME AGENCY_CODE AREA_CODE COLLISION_TYPE_CODE COUNTY_NO C_M_ZONE_FLAG DISTANCE DISTANCE_DIR_FLAG FEET_MILES_FLAG FIX_OBJ_CODE HARM_EVENT_CODE1 HARM_EVENT_CODE2 JUNCTION_CODE LANE_CODE LATITUDE LIGHT_CODE LOC_CODE LOGMILE_DIR_FLAG LOG_MILE LONGITUDE MAINROAD_NAME MUNI_CODE RD_COND_CODE RD_DIV_CODE REFERENCE_NO REFERENCE_ROAD_NAME REFERENCE_SUFFIX REFERENCE_TYPE_CODE REPORT_NO REPORT_TYPE ROUTE_TYPE_CODE RTE_NO RTE_SUFFIX SIGNAL_FLAG SURF_COND_CODE WEATHER_CODE
0 2018-04-10 00:00:00 01:50:00 MSP UNK 17 18.0 N 0.0 N F 22.03 16.0 11.0 1.0 NaN 38.277230 4.0 NaN N 2.09 -76.682876 NEWTOWNE NECK RD 0.0 1.0 1.0 166.0 ROSEBANK RD NaN CO MSP6188002Q Injury Crash MD 243.0 NaN N 2.0 6.01
1 2018-05-02 00:00:00 11:06:00 ELKTON UNK 5 7.0 N 15.0 S F 0.00 1.0 1.0 3.0 NaN 39.613747 1.0 NaN N 19.41 -75.837454 N BRIDGE ST 52.0 1.0 3.0 362.0 LAUREL DR NaN MU BK0227001M Injury Crash MD 213.0 NaN N 2.0 6.01
merged = people.merge(crashes, on='REPORT_NO')
merged.head()
AIRBAG_DEPLOYED ALCOHOL_TESTTYPE_CODE ALCOHOL_TEST_CODE BAC_CODE CDL_FLAG CLASS CONDITION_CODE DATE_OF_BIRTH DRUG_TESTRESULT_CODE DRUG_TEST_CODE EJECT_CODE EMS_UNIT_LABEL EQUIP_PROB_CODE FAULT_FLAG INJ_SEVER_CODE LICENSE_STATE_CODE MOVEMENT_CODE OCC_SEAT_POS_CODE PED_LOCATION_CODE PED_OBEY_CODE PED_TYPE_CODE PED_VISIBLE_CODE PERSON_ID PERSON_TYPE REPORT_NO SAF_EQUIP_CODE SEX_CODE VEHICLE_ID had_fatality ACC_DATE ACC_TIME AGENCY_CODE AREA_CODE COLLISION_TYPE_CODE COUNTY_NO C_M_ZONE_FLAG DISTANCE DISTANCE_DIR_FLAG FEET_MILES_FLAG FIX_OBJ_CODE HARM_EVENT_CODE1 HARM_EVENT_CODE2 JUNCTION_CODE LANE_CODE LATITUDE LIGHT_CODE LOC_CODE LOGMILE_DIR_FLAG LOG_MILE LONGITUDE MAINROAD_NAME MUNI_CODE RD_COND_CODE RD_DIV_CODE REFERENCE_NO REFERENCE_ROAD_NAME REFERENCE_SUFFIX REFERENCE_TYPE_CODE REPORT_TYPE ROUTE_TYPE_CODE RTE_NO RTE_SUFFIX SIGNAL_FLAG SURF_COND_CODE WEATHER_CODE
0 1.0 NaN 0.0 NaN N C 0.0 1952-04-20 00:00:00 NaN 0.0 1.0 NaN 1.0 N 1 PA NaN NaN NaN NaN NaN NaN 48dd00ee-e033-47e7-ad1e-0b734020301b D AB4284000S 13.0 1.0 eb6aadb8-dacb-4744-a1a7-ab812c96f27f 0 2018-05-25 00:00:00 10:58:00 ANNAPOLIS UNK 3 2.0 N 20.0 S F 0.0 1.0 0.0 3.0 1.0 38.950029 1.0 NaN S 0.00 -76.492578 HILLSMERE DR 3.0 1.0 5.01 2930.0 BAY RIDGE RD NaN CO Property Damage Crash CO 2885.0 NaN Y 2.0 6.01
1 0.0 NaN 0.0 NaN N C 1.0 1994-12-13 00:00:00 NaN 0.0 1.0 NaN 1.0 Y 1 MD NaN NaN NaN NaN NaN NaN 974e41b6-480e-431f-9585-4b87f4c29b19 D AB4284000S 13.0 1.0 e95fc5fb-a269-4cce-9e09-643cd6a5bde7 0 2018-05-25 00:00:00 10:58:00 ANNAPOLIS UNK 3 2.0 N 20.0 S F 0.0 1.0 0.0 3.0 1.0 38.950029 1.0 NaN S 0.00 -76.492578 HILLSMERE DR 3.0 1.0 5.01 2930.0 BAY RIDGE RD NaN CO Property Damage Crash CO 2885.0 NaN Y 2.0 6.01
2 1.0 NaN 0.0 NaN N NaN 0.0 1985-05-28 00:00:00 NaN 0.0 0.0 NaN 0.0 N 1 MD NaN NaN NaN NaN NaN NaN 166296bd-ffd3-4c16-aa74-4f4bf4139d8d D AB4313000X 13.0 1.0 b463eb20-2f01-4200-9d6f-b18888ce2593 0 2018-04-19 00:00:00 15:50:00 ANNAPOLIS UNK 3 2.0 N 0.0 S F 0.0 1.0 0.0 0.0 1.0 38.976882 1.0 NaN N 0.01 -76.486787 COMPROMISE ST 3.0 0.0 1.00 960.0 DUKE OF GLOUCHESTER ST NaN MU Property Damage Crash MU 740.0 NaN N 0.0 7.01
3 1.0 NaN 0.0 NaN N C 0.0 1960-10-04 00:00:00 NaN 0.0 0.0 NaN 0.0 Y 1 MD NaN NaN NaN NaN NaN NaN f3b2743f-fbc3-4345-9419-56a0ca29102c D AB4313000X 13.0 1.0 3c8629d0-d524-47c1-bfbc-b18e07f3087e 0 2018-04-19 00:00:00 15:50:00 ANNAPOLIS UNK 3 2.0 N 0.0 S F 0.0 1.0 0.0 0.0 1.0 38.976882 1.0 NaN N 0.01 -76.486787 COMPROMISE ST 3.0 0.0 1.00 960.0 DUKE OF GLOUCHESTER ST NaN MU Property Damage Crash MU 740.0 NaN N 0.0 7.01
4 1.0 NaN 0.0 NaN N B 0.0 1971-05-28 00:00:00 NaN 0.0 1.0 NaN 0.0 N 1 MD NaN NaN NaN NaN NaN NaN 5bbe589b-a2db-4dbb-be9b-17a800d69a08 D AB4313000Y 0.0 1.0 c4628cdb-f295-4a24-8a4b-653741ac6ae7 0 2018-05-16 00:00:00 13:41:00 ANNAPOLIS UNK 4 2.0 N 0.0 W F 0.0 1.0 0.0 2.0 2.0 38.978461 1.0 NaN W 0.03 -76.493329 CHURCH CIR 3.0 0.0 2.00 2300.0 NORTHWEST ST NaN MU Property Damage Crash MU 660.0 NaN Y 1.0 3.00

Examining more possible features#

How often was it wet, dry, snowy, icy, etc? What was the most common condition?

merged.SURF_COND_CODE.value_counts()
2.00     619569
1.00     147803
0.00      30921
3.00       9418
4.00       9111
99.00      3616
6.03       1697
88.00      1064
5.00        447
7.01        258
9.88        141
8.05         36
Name: SURF_COND_CODE, dtype: int64

Do you feel that a Dry road condition should be the average of Wet and Snow?

# Nope

The answer to that should be no, which means we can't use this data as numeric data. We want a different coefficient for each of these - I want to know the impact of dry, the impact of wet, the impact of snow, all separately.

Start by replacing each code with a proper description. I'll even include them here:

  • 00 - Not Applicable
  • 01 - Wet
  • 02 - Dry
  • 03 - Snow
  • 04 - Ice
  • 05 - Mud, Dirt, Gravel
  • 06 - Slush
  • 07 - Water (standing/moving)
  • 08 - Sand
  • 09 - Oil
  • 88 - Other
  • 99 - Unknown

But watch out, pandas read the column in as numbers so they might have come through a little differently than their codes.

merged['SURF_COND_CODE'] = merged.SURF_COND_CODE.replace({
    0: 'Not Applicable',
    1: 'Wet',
    2: 'Dry',
    3: 'Snow',
    4: 'Ice',
    5: 'Mud, Dirt, Gravel',
    6: 'Slush',
    7: 'Water (standing/moving)',
    8: 'Sand',
    9: 'Oil',
    88: 'Other',
    99: 'Unknown',
})

Confirm you have 147,803 wet, and a few codes you can't understand, like 6.03 and 7.01.

merged.SURF_COND_CODE.value_counts()
Dry                  619569
Wet                  147803
Not Applicable        30921
Snow                   9418
Ice                    9111
Unknown                3616
6.03                   1697
Other                  1064
Mud, Dirt, Gravel       447
7.01                    258
9.88                    141
8.05                     36
Name: SURF_COND_CODE, dtype: int64

Replace the codes you don't understand with Other.

merged['SURF_COND_CODE'] = merged.SURF_COND_CODE.replace({
    6.03: 'Other',
    7.01: 'Other',
    9.88: 'Other',
    8.05: 'Other'
})

Confirm you have 3,196 'Other'.

merged.SURF_COND_CODE.value_counts()
Dry                  619569
Wet                  147803
Not Applicable        30921
Snow                   9418
Ice                    9111
Unknown                3616
Other                  3196
Mud, Dirt, Gravel       447
Name: SURF_COND_CODE, dtype: int64

One-hot encoding#

We're going to use pd.get_dummies to build a variable you'll call surf_dummies. Each surface condition should be a 0 or 1 as to whether it was that condition (dry, icy, wet, etc).

Use a prefix= so we know they are surface conditions.

You'll want to drop the column you'll use as the reference category.

Before we do this: which column works best as the reference?

# Dry

Now build your surf_dummies variable.

surf_dummies = pd.get_dummies(merged.SURF_COND_CODE, prefix='surface').drop(columns='surface_Dry')
surf_dummies.head()
surface_Ice surface_Mud, Dirt, Gravel surface_Not Applicable surface_Other surface_Snow surface_Unknown surface_Wet
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0
3 0 0 1 0 0 0 0
4 0 0 0 0 0 0 1

Confirm your surf_dummies looks roughly like this:

surface_Ice Surce_Mud, Dirt, Gravel surface_Not Applicable ... surface_Wet
0 0 0 ... 0
0 0 0 ... 0
0 0 1 ... 0
0 0 1 ... 0
0 0 0 ... 1

Another regression#

Let's run another regression to see the impact of both sex and surface condition on fatalities.

Build your train_df#

To build your train_df, I recommend doing it either of these two ways. They both first select the important columns, then add in the one-hot encoded surf_dummies columns.

train_df = pd.DataFrame({
    'col1': merged.col1,
    'col2': merged.col2,
    'col3': merged.col3,
})
train_df = train_df.join(surf_dummies)
train_df = train_df.dropna()

or like this:

train_df = train_df[['col1','col2','col3']].copy()
train_df = train_df.join(surf_dummies)
train_df = train_df.dropna()

The second one is shorter, but the first one makes it easier to use comments to remove columns later.

train_df = pd.DataFrame({
    'sex': merged.SEX_CODE,
    'had_fatality': merged.had_fatality
})
train_df = train_df.join(surf_dummies)
train_df = train_df.dropna()

Run your regression and check your odds ratios#

Actually no, wait, first - what kind of surface do you think will have the highest fatality rate?

# Wet
train_df.shape
(815843, 9)

Confirm your train_df has 815,843 rows and 9 columns.

  • Tip: When you run your regression, if you get an error about not knowing what to do with U, it's because you didn't convert your sex to numbers (or if you did, you didn't do it in your original dataframe)
X = train_df.drop(columns='had_fatality')
y = train_df.had_fatality
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X, y)
LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)
feature coefficient (log odds ratio) odds ratio
2 surface_Mud, Dirt, Gravel 1.115150 3.0500
4 surface_Other 0.596230 1.8153
6 surface_Unknown 0.340200 1.4052
7 surface_Wet -0.098825 0.9059
1 surface_Ice -0.557708 0.5725
5 surface_Snow -0.652480 0.5208
0 sex -0.718133 0.4877
3 surface_Not Applicable -0.730582 0.4816

Is this what you expected? Why do you think this result might be the case?

 

More features: Vehicles#

Maybe the car they're in is related to the car they were in. Luckily, we have this information - read in combined_vehicle_data as vehicles.

vehicles = pd.read_csv("data/combined-vehicle-data.csv")
vehicles.head(2)
AREA_DAMAGED_CODE1 AREA_DAMAGED_CODE2 AREA_DAMAGED_CODE3 AREA_DAMAGED_CODE_IMP1 AREA_DAMAGED_CODE_MAIN BODY_TYPE_CODE COMMERCIAL_FLAG CONTI_DIRECTION_CODE CV_BODY_TYPE_CODE DAMAGE_CODE DRIVERLESS_FLAG FIRE_FLAG GOING_DIRECTION_CODE GVW_CODE HARM_EVENT_CODE HAZMAT_SPILL_FLAG HIT_AND_RUN_FLAG HZM_NUM MOVEMENT_CODE NUM_AXLES PARKED_FLAG REPORT_NO SPEED_LIMIT TOWED_AWAY_FLAG TOWED_VEHICLE_CONFIG_CODE VEHICLE_ID VEH_MAKE VEH_MODEL VEH_YEAR VIN_NO
0 8.0 9.0 10.0 10.0 10.0 23.08 N E NaN 5 N N E NaN 9.0 NaN N NaN 1.0 NaN N ADJ487004H 30 Y 0 000238fd-44fa-4cd5-8eb7-41ab30500bec CHEVY TAHOE 2005.0 1GNEK13Q2J285593
1 12.0 NaN NaN 12.0 12.0 2.00 N E NaN 4 N N E NaN 1.0 NaN N NaN 1.0 NaN N MCP2487000M 40 Y 0 00038116-1bf9-48cc-b317-4f4375d14b60 INFI 4S 2003.0 JNKCV51E63M013580

Weights of those cars#

The car weights are stored in another file since the info had to come from an API. I looked up the VINs - vehicle identification numbers - in a government database to try to get data for each of them.

Read them and build a new dataframe that is both the vehicle data along with their weights. You can call it vehicles since you don't need the original weightless vehicle data any more.

weights = pd.read_csv("data/vins_and_weights.csv")
weights.head(2)
VIN Make Model ModelYear weight
0 2FMDA5143TBB45576 FORD WINDSTAR 1996 3733.0
1 2G1WC5E37E1120089 CHEVROLET IMPALA 2014 3618.0
vehicles = vehicles.merge(weights, left_on='VIN_NO', right_on='VIN')
vehicles.head(2)
AREA_DAMAGED_CODE1 AREA_DAMAGED_CODE2 AREA_DAMAGED_CODE3 AREA_DAMAGED_CODE_IMP1 AREA_DAMAGED_CODE_MAIN BODY_TYPE_CODE COMMERCIAL_FLAG CONTI_DIRECTION_CODE CV_BODY_TYPE_CODE DAMAGE_CODE DRIVERLESS_FLAG FIRE_FLAG GOING_DIRECTION_CODE GVW_CODE HARM_EVENT_CODE HAZMAT_SPILL_FLAG HIT_AND_RUN_FLAG HZM_NUM MOVEMENT_CODE NUM_AXLES PARKED_FLAG REPORT_NO SPEED_LIMIT TOWED_AWAY_FLAG TOWED_VEHICLE_CONFIG_CODE VEHICLE_ID VEH_MAKE VEH_MODEL VEH_YEAR VIN_NO VIN Make Model ModelYear weight
0 8.0 9.0 10.0 10.0 10.0 23.08 N E NaN 5 N N E NaN 9.0 NaN N NaN 1.0 NaN N ADJ487004H 30 Y 0 000238fd-44fa-4cd5-8eb7-41ab30500bec CHEVY TAHOE 2005.0 1GNEK13Q2J285593 1GNEK13Q2J285593 CHEVROLET GMT-400 1988 4300.0
1 12.0 NaN NaN 12.0 12.0 2.00 N E NaN 4 N N E NaN 1.0 NaN N NaN 1.0 NaN N MCP2487000M 40 Y 0 00038116-1bf9-48cc-b317-4f4375d14b60 INFI 4S 2003.0 JNKCV51E63M013580 JNKCV51E63M013580 INFINITI G35 2003 3468.0

Confirm that your combined vehicles dataset should have 534,436 rows and 35 columns. And yes, that's less than we were working with before - you haven't combined it with the people/crashes dataset yet.

vehicles.shape
(534436, 35)

Filter your data#

We only want vehicles that are "normal" - somewhere between 1500 and 6000 pounds. Filter your vehicles to only include those in that weight range.

vehicles = vehicles[(vehicles.weight >= 1500) & (vehicles.weight <= 6000)]

Confirm that you have 532,370 vehicles in the dataset.

vehicles.shape
(532370, 35)

Add this vehicle information to your merged data#

Now we'll have a dataframe that contains information on:

  • The people themselves and their injuries
  • The crash
  • The vehicles

Every person came with a VEHICLE_ID column that is the vehicle they were in. You'll want to merge on that.

merged = merged.merge(vehicles, on='VEHICLE_ID')

Confirm you have 99 columns and 616,212 rows. That is a lot of possible features!

merged.shape
(616212, 99)

Another regression, because we can't get enough#

Build another train_df and run another regression about how car weight impacts the chance of fatalities. You'll want to confirm that your dataset has 616,212 and 2 columns.

train_df = merged[['weight', 'had_fatality']]
train_df = train_df.dropna()
train_df.shape
(616212, 2)
X = train_df.drop(columns='had_fatality')
y = train_df.had_fatality
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X, y)
LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)
feature coefficient (log odds ratio) odds ratio
0 weight -0.000147 0.9999

Can you translate that into plain English? Remember weight is in pounds.

 

I feel like pounds isn't the best measure for something like this. Remember how we had to adjust percentages with AP and life expectancy, and then change around the way we said things? It sounded like this:

Every 10% increase in unemployment translates to a year and a half loss of life expectancy

Instead of every single pound, maybe we could do every... some other number of pounds? One hundred? One thousand?

Run another regression with weight in thousands of pounds. Get another odds ratio. Give me another sentence English.

train_df = merged[['weight', 'had_fatality']]
train_df['weight'] = train_df['weight'] / 1000
train_df = train_df.dropna()
X = train_df.drop(columns='had_fatality')
y = train_df.had_fatality
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X, y)
LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)
feature coefficient (log odds ratio) odds ratio
0 weight -0.146881 0.8634
# Every thousand pounds heavier your car is increase translates to a 15% decrease in fatalities

Two-car accidents, struck and striker#

Here's the thing, though: it isn't just the weight of your car. It's the weight of both cars! If I'm in a big car and I have a wreck with a smaller car, it's the smaller car that's in trouble.

To get that value, we need to do some feature engineering, some calculating of new variables from our existing variables.

We need to jump through some hoops to do that.

Two-car accidents#

First we're going to count how many vehicles were in each accident. Since we're looking to compare the weight of two cars hitting each other, we're only going to want crashes with only two cars.

counted = vehicles.REPORT_NO.value_counts()
counted.head(10)
CB53480003      28
MDTA1229000H    26
AS0286000B      14
CE4636002C      14
MSP6063006C     13
MSP67460095     13
AE5607002W      12
CC0237003W      12
MSP63070049     12
AC1811000Q      11
Name: REPORT_NO, dtype: int64

By using .value_counts I can see how many cars were in each crash, and now I'm going to filter to get a list of all of the ones with two vehicles.

two_car_report_nos = counted[counted == 2].index
two_car_report_nos
Index(['MSP5688001T', 'ZW0166000H', 'AC2154000Z', 'AE5447007F', 'ADI489000B',
       'AE5364000K', 'MSP59510031', 'CD0633001C', 'AE5633000H', 'MCP9386003X',
       ...
       'ADI452002W', 'CB5541000B', 'GS04490005', 'AE40860053', 'AC12880051',
       'AC1695001G', 'CE39090017', 'MCP27830021', 'AC19440016', 'DA3965002D'],
      dtype='object', length=137214)

And now we'll filter my vehicles so we only have those that were in two-vehicle crashes.

vehicles = vehicles[vehicles.REPORT_NO.isin(two_car_report_nos)]

Struck and striker#

To do the math correctly, we need both the risk of someone dying in the smaller car and the risk of someone dying in the bigger car. To do this we need to separate our cars into two groups:

  • The 'struck' vehicle: did the person die inside?
  • The 'striker' vehicle: how much heavier was it than the struck car?

But we don't know which car was which, so we have to try out both versions - pretending car A was the striker, then pretending car B was the striker. It's hard to explain, but you can read Pounds That Kill - The External Costs of Vehicle Weight.pdf for more details on how it works.

cars_1 = vehicles.drop_duplicates(subset='REPORT_NO', keep='first')
cars_2 = vehicles.drop_duplicates(subset='REPORT_NO', keep='last')
cars_merged_1 = cars_1.merge(cars_2, on='REPORT_NO', suffixes=['_striker', '_struck'])
cars_merged_2 = cars_2.merge(cars_1, on='REPORT_NO', suffixes=['_striker', '_struck'])
vehicles_complete = pd.concat([cars_merged_1, cars_merged_2])
vehicles_complete.head()
AREA_DAMAGED_CODE1_striker AREA_DAMAGED_CODE2_striker AREA_DAMAGED_CODE3_striker AREA_DAMAGED_CODE_IMP1_striker AREA_DAMAGED_CODE_MAIN_striker BODY_TYPE_CODE_striker COMMERCIAL_FLAG_striker CONTI_DIRECTION_CODE_striker CV_BODY_TYPE_CODE_striker DAMAGE_CODE_striker DRIVERLESS_FLAG_striker FIRE_FLAG_striker GOING_DIRECTION_CODE_striker GVW_CODE_striker HARM_EVENT_CODE_striker HAZMAT_SPILL_FLAG_striker HIT_AND_RUN_FLAG_striker HZM_NUM_striker MOVEMENT_CODE_striker NUM_AXLES_striker PARKED_FLAG_striker REPORT_NO SPEED_LIMIT_striker TOWED_AWAY_FLAG_striker TOWED_VEHICLE_CONFIG_CODE_striker VEHICLE_ID_striker VEH_MAKE_striker VEH_MODEL_striker VEH_YEAR_striker VIN_NO_striker VIN_striker Make_striker Model_striker ModelYear_striker weight_striker AREA_DAMAGED_CODE1_struck AREA_DAMAGED_CODE2_struck AREA_DAMAGED_CODE3_struck AREA_DAMAGED_CODE_IMP1_struck AREA_DAMAGED_CODE_MAIN_struck BODY_TYPE_CODE_struck COMMERCIAL_FLAG_struck CONTI_DIRECTION_CODE_struck CV_BODY_TYPE_CODE_struck DAMAGE_CODE_struck DRIVERLESS_FLAG_struck FIRE_FLAG_struck GOING_DIRECTION_CODE_struck GVW_CODE_struck HARM_EVENT_CODE_struck HAZMAT_SPILL_FLAG_struck HIT_AND_RUN_FLAG_struck HZM_NUM_struck MOVEMENT_CODE_struck NUM_AXLES_struck PARKED_FLAG_struck SPEED_LIMIT_struck TOWED_AWAY_FLAG_struck TOWED_VEHICLE_CONFIG_CODE_struck VEHICLE_ID_struck VEH_MAKE_struck VEH_MODEL_struck VEH_YEAR_struck VIN_NO_struck VIN_struck Make_struck Model_struck ModelYear_struck weight_struck
0 6.0 7.0 NaN 6.0 6.0 20.0 N N NaN 3 N N N NaN 1.0 NaN N NaN 6.0 NaN N CB5190006B 55 N 0 0003b659-2785-4868-8877-0b786a284827 TOYT TK 2011.0 5TFUY5F1XBX167340 5TFUY5F1XBX167340 TOYOTA TUNDRA 2011 5480.0 3.0 5.0 6.0 6.0 6.0 23.08 N N NaN 5 N N N NaN 1.0 NaN N NaN 6.0 NaN N 55 Y 0 0e7da302-5f53-45c8-8b28-393cdc38171d JEEP CHEROKEE 2015.0 1C4RJFBGXFC751955 1C4RJFBGXFC751955 JEEP GRAND CHEROKEE 2015 4545.0
1 8.0 NaN NaN 8.0 8.0 2.0 N W NaN 2 N NaN W NaN 1.0 NaN N NaN 8.0 NaN N ADJ4590035 5 N 0 00050484-d08f-4b6e-bc7e-9ec270e94660 HONDA CIVIC 2015.0 2HGFG4A59FH702545 2HGFG4A59FH702545 HONDA CIVIC 2015 2754.0 99.0 NaN NaN 99.0 99.0 2.00 N W NaN 99 N N W NaN 1.0 NaN Y NaN 1.0 NaN N 5 N 0 3ebfe7fc-4832-414d-a5e8-391b65a2f280 HONDA ACCORD 2013.0 1HGCT2B81DA005695 1HGCT2B81DA005695 HONDA ACCORD 2013 3163.0
2 12.0 NaN NaN 12.0 12.0 2.0 N N NaN 4 N NaN N NaN 0.0 NaN N NaN 1.0 NaN N ADJ849000Z 10 N 0 00057af4-d848-4cee-b854-707f57581f4e HONDA ACCORD 2003.0 1HGCM66313A037175 1HGCM66313A037175 HONDA ACCORD 2003 3023.0 12.0 NaN NaN 12.0 12.0 2.00 N NaN NaN 3 Y NaN NaN NaN 1.0 NaN N NaN 10.0 NaN Y 0 N 0 4cf496cc-f537-447e-bb8e-3ef48c8c3ddf HONDA TK 2010.0 5J6RE3H73AL008454 5J6RE3H73AL008454 HONDA CR-V 2010 3389.0
3 1.0 11.0 12.0 12.0 12.0 2.0 N S NaN 4 N N S NaN 1.0 NaN N NaN 12.0 NaN N AE5207008Z 30 Y 0 00089d4a-7038-4693-9e02-b402676631af FORD 4D 2016.0 1FADP3K28GL258987 1FADP3K28GL258987 FORD FOCUS 2016 2932.0 11.0 12.0 NaN 12.0 12.0 23.08 N N NaN 4 N N N NaN 1.0 NaN N NaN 1.0 NaN N 30 Y 0 1e69b284-719f-429b-baae-6ae4976a546b BUICK TK 2002.0 3G5DB03E52S528316 3G5DB03E52S528316 BUICK RENDEZVOUS 2002 4024.0
4 2.0 NaN NaN 2.0 2.0 2.0 N S NaN 99 N N S NaN 1.0 NaN Y NaN 1.0 NaN N ADJ859000Y 15 N 0 0010076b-0a45-45f3-8796-f387b39cd85d HONDA CIVIC 2009.0 2HGFA16689H357624 2HGFA16689H357624 HONDA CIVIC 2009 2678.0 10.0 NaN NaN 10.0 10.0 2.00 N NaN NaN 3 Y N NaN NaN 1.0 NaN N NaN 10.0 NaN Y 15 N 0 b14273cb-b3e5-42c7-af59-df069e8c2f0f HONDA CIVIC 2011.0 2HGFG1B87BH517469 2HGFG1B87BH517469 HONDA CIVIC 2011 2703.0

Put people in their cars#

Which car was each person in? We'll assign that now.

merged = people.merge(vehicles_complete, left_on='VEHICLE_ID', right_on='VEHICLE_ID_struck')
merged.head(3)
AIRBAG_DEPLOYED ALCOHOL_TESTTYPE_CODE ALCOHOL_TEST_CODE BAC_CODE CDL_FLAG CLASS CONDITION_CODE DATE_OF_BIRTH DRUG_TESTRESULT_CODE DRUG_TEST_CODE EJECT_CODE EMS_UNIT_LABEL EQUIP_PROB_CODE FAULT_FLAG INJ_SEVER_CODE LICENSE_STATE_CODE MOVEMENT_CODE OCC_SEAT_POS_CODE PED_LOCATION_CODE PED_OBEY_CODE PED_TYPE_CODE PED_VISIBLE_CODE PERSON_ID PERSON_TYPE REPORT_NO_x SAF_EQUIP_CODE SEX_CODE VEHICLE_ID had_fatality AREA_DAMAGED_CODE1_striker AREA_DAMAGED_CODE2_striker AREA_DAMAGED_CODE3_striker AREA_DAMAGED_CODE_IMP1_striker AREA_DAMAGED_CODE_MAIN_striker BODY_TYPE_CODE_striker COMMERCIAL_FLAG_striker CONTI_DIRECTION_CODE_striker CV_BODY_TYPE_CODE_striker DAMAGE_CODE_striker DRIVERLESS_FLAG_striker FIRE_FLAG_striker GOING_DIRECTION_CODE_striker GVW_CODE_striker HARM_EVENT_CODE_striker HAZMAT_SPILL_FLAG_striker HIT_AND_RUN_FLAG_striker HZM_NUM_striker MOVEMENT_CODE_striker NUM_AXLES_striker PARKED_FLAG_striker REPORT_NO_y SPEED_LIMIT_striker TOWED_AWAY_FLAG_striker TOWED_VEHICLE_CONFIG_CODE_striker VEHICLE_ID_striker VEH_MAKE_striker VEH_MODEL_striker VEH_YEAR_striker VIN_NO_striker VIN_striker Make_striker Model_striker ModelYear_striker weight_striker AREA_DAMAGED_CODE1_struck AREA_DAMAGED_CODE2_struck AREA_DAMAGED_CODE3_struck AREA_DAMAGED_CODE_IMP1_struck AREA_DAMAGED_CODE_MAIN_struck BODY_TYPE_CODE_struck COMMERCIAL_FLAG_struck CONTI_DIRECTION_CODE_struck CV_BODY_TYPE_CODE_struck DAMAGE_CODE_struck DRIVERLESS_FLAG_struck FIRE_FLAG_struck GOING_DIRECTION_CODE_struck GVW_CODE_struck HARM_EVENT_CODE_struck HAZMAT_SPILL_FLAG_struck HIT_AND_RUN_FLAG_struck HZM_NUM_struck MOVEMENT_CODE_struck NUM_AXLES_struck PARKED_FLAG_struck SPEED_LIMIT_struck TOWED_AWAY_FLAG_struck TOWED_VEHICLE_CONFIG_CODE_struck VEHICLE_ID_struck VEH_MAKE_struck VEH_MODEL_struck VEH_YEAR_struck VIN_NO_struck VIN_struck Make_struck Model_struck ModelYear_struck weight_struck
0 1.0 NaN 0.0 NaN N NaN 0.0 1985-05-28 00:00:00 NaN 0.0 0.0 NaN 0.0 N 1 MD NaN NaN NaN NaN NaN NaN 166296bd-ffd3-4c16-aa74-4f4bf4139d8d D AB4313000X 13.0 1.0 b463eb20-2f01-4200-9d6f-b18888ce2593 0 3.0 NaN NaN 3.0 3.0 2.0 N E NaN 2 N NaN E NaN 1.0 NaN Y NaN 2.0 NaN N AB4313000X 25 N 0 3c8629d0-d524-47c1-bfbc-b18e07f3087e LEXU TK 2004.0 2T2HA31U24C031048 2T2HA31U24C031048 LEXUS RX 2004 3900.0 7.0 NaN NaN 7.0 7.0 2.00 N S NaN 2 N N S NaN 1.0 NaN N NaN 6.0 NaN N 25 N 0 b463eb20-2f01-4200-9d6f-b18888ce2593 CHEVY TK 2007.0 2CNDL13F576252855 2CNDL13F576252855 CHEVROLET EQUINOX 2007 3776.0
1 1.0 NaN 0.0 NaN N C 0.0 1960-10-04 00:00:00 NaN 0.0 0.0 NaN 0.0 Y 1 MD NaN NaN NaN NaN NaN NaN f3b2743f-fbc3-4345-9419-56a0ca29102c D AB4313000X 13.0 1.0 3c8629d0-d524-47c1-bfbc-b18e07f3087e 0 7.0 NaN NaN 7.0 7.0 2.0 N S NaN 2 N N S NaN 1.0 NaN N NaN 6.0 NaN N AB4313000X 25 N 0 b463eb20-2f01-4200-9d6f-b18888ce2593 CHEVY TK 2007.0 2CNDL13F576252855 2CNDL13F576252855 CHEVROLET EQUINOX 2007 3776.0 3.0 NaN NaN 3.0 3.0 2.00 N E NaN 2 N NaN E NaN 1.0 NaN Y NaN 2.0 NaN N 25 N 0 3c8629d0-d524-47c1-bfbc-b18e07f3087e LEXU TK 2004.0 2T2HA31U24C031048 2T2HA31U24C031048 LEXUS RX 2004 3900.0
2 1.0 NaN 0.0 NaN N C 0.0 1990-11-05 00:00:00 NaN 0.0 1.0 NaN 1.0 N 1 MD NaN NaN NaN NaN NaN NaN 21fff0f9-ed13-4b87-ac04-29d018aff9d4 D AB5218001Y 13.0 1.0 4dea42c3-e02c-4c6b-8ea0-c2f8c17f147f 0 10.0 11.0 12.0 11.0 11.0 2.0 N N NaN 3 N N N NaN 1.0 NaN N NaN 1.0 NaN N AB5218001Y 25 N 0 057a3e63-3009-4d07-b956-78f53a0ca992 LINC TOWNCAR 2003.0 1LNHM82W53Y644084 1LNHM82W53Y644084 LINCOLN TOWN CAR 2003 4000.0 9.0 10.0 11.0 11.0 11.0 23.08 N S NaN 3 N N W NaN 1.0 NaN N NaN 12.0 NaN N 25 N 0 4dea42c3-e02c-4c6b-8ea0-c2f8c17f147f HYUN SW 2010.0 5NMSG3AB9AH400386 5NMSG3AB9AH400386 HYUNDAI SANTA FE 2010 3900.0

Add the crash details#

You did this already! I'm going to do it for you. We're merging on REPORT_NO_x because there are so many REPORT_NO columns duplicated across our files that pandas started giving them weird names.

merged = merged.merge(crashes, left_on='REPORT_NO_x', right_on='REPORT_NO')
merged.head(3)
AIRBAG_DEPLOYED ALCOHOL_TESTTYPE_CODE ALCOHOL_TEST_CODE BAC_CODE CDL_FLAG CLASS CONDITION_CODE DATE_OF_BIRTH DRUG_TESTRESULT_CODE DRUG_TEST_CODE EJECT_CODE EMS_UNIT_LABEL EQUIP_PROB_CODE FAULT_FLAG INJ_SEVER_CODE LICENSE_STATE_CODE MOVEMENT_CODE OCC_SEAT_POS_CODE PED_LOCATION_CODE PED_OBEY_CODE PED_TYPE_CODE PED_VISIBLE_CODE PERSON_ID PERSON_TYPE REPORT_NO_x SAF_EQUIP_CODE SEX_CODE VEHICLE_ID had_fatality AREA_DAMAGED_CODE1_striker AREA_DAMAGED_CODE2_striker AREA_DAMAGED_CODE3_striker AREA_DAMAGED_CODE_IMP1_striker AREA_DAMAGED_CODE_MAIN_striker BODY_TYPE_CODE_striker COMMERCIAL_FLAG_striker CONTI_DIRECTION_CODE_striker CV_BODY_TYPE_CODE_striker DAMAGE_CODE_striker DRIVERLESS_FLAG_striker FIRE_FLAG_striker GOING_DIRECTION_CODE_striker GVW_CODE_striker HARM_EVENT_CODE_striker HAZMAT_SPILL_FLAG_striker HIT_AND_RUN_FLAG_striker HZM_NUM_striker MOVEMENT_CODE_striker NUM_AXLES_striker PARKED_FLAG_striker REPORT_NO_y SPEED_LIMIT_striker TOWED_AWAY_FLAG_striker TOWED_VEHICLE_CONFIG_CODE_striker VEHICLE_ID_striker VEH_MAKE_striker VEH_MODEL_striker VEH_YEAR_striker VIN_NO_striker VIN_striker Make_striker Model_striker ModelYear_striker weight_striker AREA_DAMAGED_CODE1_struck AREA_DAMAGED_CODE2_struck AREA_DAMAGED_CODE3_struck AREA_DAMAGED_CODE_IMP1_struck AREA_DAMAGED_CODE_MAIN_struck BODY_TYPE_CODE_struck COMMERCIAL_FLAG_struck CONTI_DIRECTION_CODE_struck CV_BODY_TYPE_CODE_struck DAMAGE_CODE_struck DRIVERLESS_FLAG_struck FIRE_FLAG_struck GOING_DIRECTION_CODE_struck GVW_CODE_struck HARM_EVENT_CODE_struck HAZMAT_SPILL_FLAG_struck HIT_AND_RUN_FLAG_struck HZM_NUM_struck MOVEMENT_CODE_struck NUM_AXLES_struck PARKED_FLAG_struck SPEED_LIMIT_struck TOWED_AWAY_FLAG_struck TOWED_VEHICLE_CONFIG_CODE_struck VEHICLE_ID_struck VEH_MAKE_struck VEH_MODEL_struck VEH_YEAR_struck VIN_NO_struck VIN_struck Make_struck Model_struck ModelYear_struck weight_struck ACC_DATE ACC_TIME AGENCY_CODE AREA_CODE COLLISION_TYPE_CODE COUNTY_NO C_M_ZONE_FLAG DISTANCE DISTANCE_DIR_FLAG FEET_MILES_FLAG FIX_OBJ_CODE HARM_EVENT_CODE1 HARM_EVENT_CODE2 JUNCTION_CODE LANE_CODE LATITUDE LIGHT_CODE LOC_CODE LOGMILE_DIR_FLAG LOG_MILE LONGITUDE MAINROAD_NAME MUNI_CODE RD_COND_CODE RD_DIV_CODE REFERENCE_NO REFERENCE_ROAD_NAME REFERENCE_SUFFIX REFERENCE_TYPE_CODE REPORT_NO REPORT_TYPE ROUTE_TYPE_CODE RTE_NO RTE_SUFFIX SIGNAL_FLAG SURF_COND_CODE WEATHER_CODE
0 1.0 NaN 0.0 NaN N NaN 0.0 1985-05-28 00:00:00 NaN 0.0 0.0 NaN 0.0 N 1 MD NaN NaN NaN NaN NaN NaN 166296bd-ffd3-4c16-aa74-4f4bf4139d8d D AB4313000X 13.0 1.0 b463eb20-2f01-4200-9d6f-b18888ce2593 0 3.0 NaN NaN 3.0 3.0 2.0 N E NaN 2 N NaN E NaN 1.0 NaN Y NaN 2.0 NaN N AB4313000X 25 N 0 3c8629d0-d524-47c1-bfbc-b18e07f3087e LEXU TK 2004.0 2T2HA31U24C031048 2T2HA31U24C031048 LEXUS RX 2004 3900.0 7.0 NaN NaN 7.0 7.0 2.00 N S NaN 2 N N S NaN 1.0 NaN N NaN 6.0 NaN N 25 N 0 b463eb20-2f01-4200-9d6f-b18888ce2593 CHEVY TK 2007.0 2CNDL13F576252855 2CNDL13F576252855 CHEVROLET EQUINOX 2007 3776.0 2018-04-19 00:00:00 15:50:00 ANNAPOLIS UNK 3 2.0 N 0.0 S F 0.0 1.0 0.0 0.0 1.0 38.976882 1.0 NaN N 0.01 -76.486787 COMPROMISE ST 3.0 0.0 1.0 960.0 DUKE OF GLOUCHESTER ST NaN MU AB4313000X Property Damage Crash MU 740.0 NaN N 0.0 7.01
1 1.0 NaN 0.0 NaN N C 0.0 1960-10-04 00:00:00 NaN 0.0 0.0 NaN 0.0 Y 1 MD NaN NaN NaN NaN NaN NaN f3b2743f-fbc3-4345-9419-56a0ca29102c D AB4313000X 13.0 1.0 3c8629d0-d524-47c1-bfbc-b18e07f3087e 0 7.0 NaN NaN 7.0 7.0 2.0 N S NaN 2 N N S NaN 1.0 NaN N NaN 6.0 NaN N AB4313000X 25 N 0 b463eb20-2f01-4200-9d6f-b18888ce2593 CHEVY TK 2007.0 2CNDL13F576252855 2CNDL13F576252855 CHEVROLET EQUINOX 2007 3776.0 3.0 NaN NaN 3.0 3.0 2.00 N E NaN 2 N NaN E NaN 1.0 NaN Y NaN 2.0 NaN N 25 N 0 3c8629d0-d524-47c1-bfbc-b18e07f3087e LEXU TK 2004.0 2T2HA31U24C031048 2T2HA31U24C031048 LEXUS RX 2004 3900.0 2018-04-19 00:00:00 15:50:00 ANNAPOLIS UNK 3 2.0 N 0.0 S F 0.0 1.0 0.0 0.0 1.0 38.976882 1.0 NaN N 0.01 -76.486787 COMPROMISE ST 3.0 0.0 1.0 960.0 DUKE OF GLOUCHESTER ST NaN MU AB4313000X Property Damage Crash MU 740.0 NaN N 0.0 7.01
2 1.0 NaN 0.0 NaN N C 0.0 1990-11-05 00:00:00 NaN 0.0 1.0 NaN 1.0 N 1 MD NaN NaN NaN NaN NaN NaN 21fff0f9-ed13-4b87-ac04-29d018aff9d4 D AB5218001Y 13.0 1.0 4dea42c3-e02c-4c6b-8ea0-c2f8c17f147f 0 10.0 11.0 12.0 11.0 11.0 2.0 N N NaN 3 N N N NaN 1.0 NaN N NaN 1.0 NaN N AB5218001Y 25 N 0 057a3e63-3009-4d07-b956-78f53a0ca992 LINC TOWNCAR 2003.0 1LNHM82W53Y644084 1LNHM82W53Y644084 LINCOLN TOWN CAR 2003 4000.0 9.0 10.0 11.0 11.0 11.0 23.08 N S NaN 3 N N W NaN 1.0 NaN N NaN 12.0 NaN N 25 N 0 4dea42c3-e02c-4c6b-8ea0-c2f8c17f147f HYUN SW 2010.0 5NMSG3AB9AH400386 5NMSG3AB9AH400386 HYUNDAI SANTA FE 2010 3900.0 2018-06-08 00:00:00 16:44:00 ANNAPOLIS UNK 11 2.0 N 30.0 N F 0.0 1.0 0.0 0.0 1.0 38.978495 1.0 NaN N 0.00 -76.496738 W WASHINGTON ST 3.0 1.0 1.0 450.0 WEST ST NaN MD AB5218001Y Property Damage Crash MU 3380.0 NaN N 2.0 6.01

Filter#

We already filtered out vehicles by weight, so we don't have to do that again.

Calculated features#

I'm sure you forgot what all the features are, so we'll bring back whether there was a fatality or not

Feature: Accident was fatal#

merged['had_fatality'] = (merged.INJ_SEVER_CODE == 5).astype(int)
merged.had_fatality.value_counts()
0    334152
1       244
Name: had_fatality, dtype: int64

Feature: Weight difference#

Remove everything missing weights for strikers or struck vehicles. You might need to merged.columns to remind yourself what the column names are.

merged = merged.dropna(subset=['weight_striker', 'weight_struck'])

Confirm your dataset has roughly 335,000 rows.

merged.shape
(334396, 135)

Create a new feature called weight_diff about how much heavier the striking car was compared to the struck car. Make sure you've done the math correctly!

merged['weight_diff'] = merged['weight_striker'] - merged['weight_struck']

Feature adjustment#

Make all of your weight columns in thousands of pounds instead of just in pounds. It'll help you interpret your results much better.

merged.weight_striker = merged.weight_striker / 1000
merged.weight_struck = merged.weight_struck / 1000
merged.weight_diff = merged.weight_diff / 1000
merged[['weight_striker', 'weight_struck', 'weight_diff']].describe()
weight_striker weight_struck weight_diff
count 334396.000000 334396.000000 334396.000000
mean 3.608778 3.629094 -0.020316
std 0.758489 0.769407 1.077867
min 1.701000 1.701000 -3.916000
25% 3.087000 3.093000 -0.710000
50% 3.450000 3.450000 -0.009000
75% 4.057000 4.096000 0.668000
max 5.985000 5.985000 3.916000

Another regression!!!#

What is the impact of weight difference on fatality rate? Create your train_df, drop missing values, run your regression, analyze your odds ratios.

train_df = merged[['weight_diff', 'had_fatality']].copy()
train_df = train_df.dropna()
X = train_df.drop(columns='had_fatality')
y = train_df.had_fatality
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X, y)
LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)
feature coefficient (log odds ratio) odds ratio
0 weight_diff 0.475534 1.6089

Please translate your odds ratio into plain English.

# For every 1000 lbs heavier the striking car is, you have a 60% greater chance of dying

Adding in more features#

How about speed limit? That's important, right? We can add the speed limit of the striking vehicle with SPEED_LIMIT_striker.

train_df = merged[['weight_diff', 'had_fatality', 'SPEED_LIMIT_striker']].copy()
train_df = train_df.dropna()
X = train_df.drop(columns='had_fatality')
y = train_df.had_fatality
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X, y)
LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)
feature coefficient (log odds ratio) odds ratio
0 weight_diff 0.465337 1.5926
1 SPEED_LIMIT_striker 0.054314 1.0558

Can you translate the speed limit odds ratio into plain English?

 

Feature engineering: Speed limits#

Honestly, that's a pretty bad way to go about things. What's more fun is if we translate speed limits into bins.

First, we'll use pd.cut to assign each speed limit a category.

speed_bins = [-np.inf, 10, 20, 30, 40, 50, np.inf]
merged['speed_bin'] = pd.cut(merged.SPEED_LIMIT_struck, bins=speed_bins)
merged[['SPEED_LIMIT_striker', 'speed_bin']].head(10)
SPEED_LIMIT_striker speed_bin
0 25 (20.0, 30.0]
1 25 (20.0, 30.0]
2 25 (20.0, 30.0]
3 25 (20.0, 30.0]
4 25 (20.0, 30.0]
5 5 (-inf, 10.0]
6 45 (40.0, 50.0]
7 45 (40.0, 50.0]
8 45 (40.0, 50.0]
9 45 (40.0, 50.0]

Then we'll one-hot encode around 20-30mph speed limits.

speed_dummies = pd.get_dummies(merged.speed_bin, 
                               prefix='speed').drop('speed_(20.0, 30.0]', axis=1)
speed_dummies.head()
speed_(-inf, 10.0] speed_(10.0, 20.0] speed_(30.0, 40.0] speed_(40.0, 50.0] speed_(50.0, inf]
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0

Running a regression#

I like this layout for creating train_df, it allows us to easily add dummies and do a little replacing/encoding when we're building binary features like for sex.

If the below gives you an error, it's because SEX_CODE is already a number. In that case, just remove .replace({'M': 1, 'F': 0, 'U': np.nan }).

train_df = pd.DataFrame({
    'weight_diff': merged.weight_diff,
    'sex': merged.SEX_CODE,#.replace({'M': 1, 'F': 0, 'U': np.nan }),
    'had_fatality': merged.had_fatality,
})
train_df = train_df.join(speed_dummies)
train_df = train_df.join(surf_dummies)
train_df = train_df.dropna()
train_df.head()
weight_diff sex had_fatality speed_(-inf, 10.0] speed_(10.0, 20.0] speed_(30.0, 40.0] speed_(40.0, 50.0] speed_(50.0, inf] surface_Ice surface_Mud, Dirt, Gravel surface_Not Applicable surface_Other surface_Snow surface_Unknown surface_Wet
0 0.124 1.0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 -0.124 1.0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0.100 1.0 0 0 0 0 0 0 0 0 1 0 0 0 0
3 -0.100 1.0 0 0 0 0 0 0 0 0 1 0 0 0 0
4 -0.100 0.0 0 0 0 0 0 0 0 0 0 0 0 0 1
X = train_df.drop(columns='had_fatality')
y = train_df.had_fatality

clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)
clf.fit(X, y)
LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=4000, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)
feature coefficient (log odds ratio) odds ratio
5 speed_(40.0, 50.0] 1.713997 5.5511
6 speed_(50.0, inf] 1.405104 4.0760
7 surface_Ice 0.647203 1.9102
4 speed_(30.0, 40.0] 0.630550 1.8786
0 weight_diff 0.469332 1.5989
10 surface_Other 0.357723 1.4301
9 surface_Not Applicable 0.235205 1.2652
13 surface_Wet -0.130009 0.8781
11 surface_Snow -0.170967 0.8428
1 sex -0.198966 0.8196
2 speed_(-inf, 10.0] -1.017386 0.3615
8 surface_Mud, Dirt, Gravel -7.091968 0.0008
3 speed_(10.0, 20.0] -29.768925 0.0000
12 surface_Unknown -22.617279 0.0000

Describe the impact of the different variables in simple language. What has the largest impact?

 

Now you pick the features#

Up above you have examples of:

  • Creating features from numbers (speed limits)
  • Creating features from 0/1 (sex)
  • Creating features from binning numbers that are one-hot encoded (speed limit bins - speed_bins)
  • Creating features from categories that are one-hot encoded (surface - surf_dummies

What else do you think matters? Try to plug in more features and see if you can get anything interesting.

  • Hot tip: The funniest/most interesting thing feature you can add is also the dumbest. Ask me about it in #algorithms if you end up getting down here.