Understanding feature selection and feature engineering when performing regressions#

Even with a large and complete dataset, not all of the features end up being important, and you don't always have all the fields you need. Sometimes performing calculations between two or more columns can yield a large improvement in your predictor's performance.

Read online Download notebook Interactive version

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

pd.set_option("display.max_columns", 200)
pd.set_option("display.max_colwidth", 200)
%matplotlib inline

Reading in our data#

Every single person in a crash#

We'll start by reading in the list of people who were involved in an accident.

Call this dataframe people.

people = pd.read_csv("data/combined-person-data.csv")
people.head(2)

	AIRBAG_DEPLOYED	ALCOHOL_TESTTYPE_CODE	ALCOHOL_TEST_CODE	BAC_CODE	CDL_FLAG	CLASS	CONDITION_CODE	DATE_OF_BIRTH	DRUG_TESTRESULT_CODE	DRUG_TEST_CODE	EJECT_CODE	EMS_UNIT_LABEL	EQUIP_PROB_CODE	FAULT_FLAG	INJ_SEVER_CODE	LICENSE_STATE_CODE	MOVEMENT_CODE	OCC_SEAT_POS_CODE	PED_LOCATION_CODE	PED_OBEY_CODE	PED_TYPE_CODE	PED_VISIBLE_CODE	PERSON_ID	PERSON_TYPE	REPORT_NO	SAF_EQUIP_CODE	SEX_CODE	VEHICLE_ID
0	1.0	NaN	0.0	NaN	N	C	0.0	1952-04-20 00:00:00	NaN	0.0	1.0	NaN	1.0	N	1	PA	NaN	NaN	NaN	NaN	NaN	NaN	48dd00ee-e033-47e7-ad1e-0b734020301b	D	AB4284000S	13.0	F	eb6aadb8-dacb-4744-a1a7-ab812c96f27f
1	1.0	NaN	0.0	NaN	N	NaN	0.0	1985-05-28 00:00:00	NaN	0.0	0.0	NaN	0.0	N	1	MD	NaN	NaN	NaN	NaN	NaN	NaN	166296bd-ffd3-4c16-aa74-4f4bf4139d8d	D	AB4313000X	13.0	F	b463eb20-2f01-4200-9d6f-b18888ce2593

How often did each severity of injury show up? (e.g. not injured, non-incapacitating injury, etc)

people.INJ_SEVER_CODE.value_counts()

1    716806
2     82642
3     76801
4     10353
5      1681
Name: INJ_SEVER_CODE, dtype: int64

We're only interested in fatalities, so let's create a new had_fatality column for when people received a fatal injury.

Confirm there were 1681 people with fatal injuries.

people['had_fatality'] = (people.INJ_SEVER_CODE == 5).astype(int)
people.had_fatality.value_counts()

0    886602
1      1681
Name: had_fatality, dtype: int64

Working on Features#

Starting our analysis#

We're going to run a regression on the impact of being male vs female on crash fatalities. Prepare a dataframe called train_df with the appropriate information in it.

Tip: What column(s) are your input, and what is your output? Aka independent and dependent variables
Tip: You'll need to convert your input column into something numeric, I suggest using .replace
Tip: We aren't interested in the "Unknown" sex - either filtering or np.nan + .dropna() might be useful ways to get rid of those columns

people['SEX_CODE'] = people.SEX_CODE.replace({'M': 0, 'F': 1, 'U': np.nan})

train_df = people[['SEX_CODE', 'had_fatality']].copy()
train_df = train_df.dropna()

Confirm that your train_df has two columns and 815,827 rows.

Tip: If you have more rows, make sure you dropped all of the rows with Unknown sex.

Tip: If you have more columns, make sure you only have your input and output columns.

train_df.shape

(815827, 2)

Run your regression#

See the effect of sex on whether the person's injuries are fatal or not. After we train the regression, we can use my ✨favorite technique✨ to display features and their odds ratios:

feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)

X = train_df.drop(columns='had_fatality')
y = train_df.had_fatality
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X, y)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)

	feature	coefficient (log odds ratio)	odds ratio
0	SEX_CODE	-0.71481	0.4893

Use words to interpret this result#

# Women are about 50% less likely to die in a car crash

Adding more features#

The actual crash data has more details - whether it was snowy/icy, whether it was a highway, etc.

Read in combined-crash-data.csv, calling it crashes, and merge it with our people dataset. I'll save you a lookup: the REPORT_NO is what matches between the two.

crashes = pd.read_csv("data/combined-crash-data.csv")
crashes.head(2)

	ACC_DATE	ACC_TIME	AGENCY_CODE	AREA_CODE	COLLISION_TYPE_CODE	COUNTY_NO	C_M_ZONE_FLAG	DISTANCE	DISTANCE_DIR_FLAG	FEET_MILES_FLAG	FIX_OBJ_CODE	HARM_EVENT_CODE1	HARM_EVENT_CODE2	JUNCTION_CODE	LANE_CODE	LATITUDE	LIGHT_CODE	LOC_CODE	LOGMILE_DIR_FLAG	LOG_MILE	LONGITUDE	MAINROAD_NAME	MUNI_CODE	RD_COND_CODE	RD_DIV_CODE	REFERENCE_NO	REFERENCE_ROAD_NAME	REFERENCE_SUFFIX	REFERENCE_TYPE_CODE	REPORT_NO	REPORT_TYPE	ROUTE_TYPE_CODE	RTE_NO	RTE_SUFFIX	SIGNAL_FLAG	SURF_COND_CODE	WEATHER_CODE
0	2018-04-10 00:00:00	01:50:00	MSP	UNK	17	18.0	N	0.0	N	F	22.03	16.0	11.0	1.0	NaN	38.277230	4.0	NaN	N	2.09	-76.682876	NEWTOWNE NECK RD	0.0	1.0	1.0	166.0	ROSEBANK RD	NaN	CO	MSP6188002Q	Injury Crash	MD	243.0	NaN	N	2.0	6.01
1	2018-05-02 00:00:00	11:06:00	ELKTON	UNK	5	7.0	N	15.0	S	F	0.00	1.0	1.0	3.0	NaN	39.613747	1.0	NaN	N	19.41	-75.837454	N BRIDGE ST	52.0	1.0	3.0	362.0	LAUREL DR	NaN	MU	BK0227001M	Injury Crash	MD	213.0	NaN	N	2.0	6.01

merged = people.merge(crashes, on='REPORT_NO')
merged.head()

	AIRBAG_DEPLOYED	ALCOHOL_TESTTYPE_CODE	BAC_CODE	CDL_FLAG	CLASS	CONDITION_CODE	DATE_OF_BIRTH	DRUG_TESTRESULT_CODE	EJECT_CODE	EMS_UNIT_LABEL	EQUIP_PROB_CODE	FAULT_FLAG	INJ_SEVER_CODE	LICENSE_STATE_CODE	MOVEMENT_CODE	OCC_SEAT_POS_CODE	PED_LOCATION_CODE	PED_OBEY_CODE	PED_TYPE_CODE	PED_VISIBLE_CODE	PERSON_ID	PERSON_TYPE	REPORT_NO	SAF_EQUIP_CODE	SEX_CODE	VEHICLE_ID	ACC_DATE	ACC_TIME	AGENCY_CODE	AREA_CODE	COLLISION_TYPE_CODE	COUNTY_NO	C_M_ZONE_FLAG	DISTANCE	DISTANCE_DIR_FLAG	FEET_MILES_FLAG	HARM_EVENT_CODE1	JUNCTION_CODE	LANE_CODE	LATITUDE	LIGHT_CODE	LOC_CODE	LOGMILE_DIR_FLAG	LOG_MILE	LONGITUDE	MAINROAD_NAME	MUNI_CODE	RD_COND_CODE	RD_DIV_CODE	REFERENCE_NO	REFERENCE_ROAD_NAME	REFERENCE_SUFFIX	REFERENCE_TYPE_CODE	REPORT_TYPE	ROUTE_TYPE_CODE	RTE_NO	RTE_SUFFIX	SIGNAL_FLAG	SURF_COND_CODE	WEATHER_CODE
0	1.0	NaN	NaN	N	C	0.0	1952-04-20 00:00:00	NaN	1.0	NaN	1.0	N	1	PA	NaN	NaN	NaN	NaN	NaN	NaN	48dd00ee-e033-47e7-ad1e-0b734020301b	D	AB4284000S	13.0	1.0	eb6aadb8-dacb-4744-a1a7-ab812c96f27f	2018-05-25 00:00:00	10:58:00	ANNAPOLIS	UNK	3	2.0	N	20.0	S	F	1.0	3.0	1.0	38.950029	1.0	NaN	S	0.00	-76.492578	HILLSMERE DR	3.0	1.0	5.01	2930.0	BAY RIDGE RD	NaN	CO	Property Damage Crash	CO	2885.0	NaN	Y	2.0	6.01
1	0.0	NaN	NaN	N	C	1.0	1994-12-13 00:00:00	NaN	1.0	NaN	1.0	Y	1	MD	NaN	NaN	NaN	NaN	NaN	NaN	974e41b6-480e-431f-9585-4b87f4c29b19	D	AB4284000S	13.0	1.0	e95fc5fb-a269-4cce-9e09-643cd6a5bde7	2018-05-25 00:00:00	10:58:00	ANNAPOLIS	UNK	3	2.0	N	20.0	S	F	1.0	3.0	1.0	38.950029	1.0	NaN	S	0.00	-76.492578	HILLSMERE DR	3.0	1.0	5.01	2930.0	BAY RIDGE RD	NaN	CO	Property Damage Crash	CO	2885.0	NaN	Y	2.0	6.01
2	1.0	NaN	NaN	N	NaN	0.0	1985-05-28 00:00:00	NaN	0.0	NaN	0.0	N	1	MD	NaN	NaN	NaN	NaN	NaN	NaN	166296bd-ffd3-4c16-aa74-4f4bf4139d8d	D	AB4313000X	13.0	1.0	b463eb20-2f01-4200-9d6f-b18888ce2593	2018-04-19 00:00:00	15:50:00	ANNAPOLIS	UNK	3	2.0	N	0.0	S	F	1.0	0.0	1.0	38.976882	1.0	NaN	N	0.01	-76.486787	COMPROMISE ST	3.0	0.0	1.00	960.0	DUKE OF GLOUCHESTER ST	NaN	MU	Property Damage Crash	MU	740.0	NaN	N	0.0	7.01
3	1.0	NaN	NaN	N	C	0.0	1960-10-04 00:00:00	NaN	0.0	NaN	0.0	Y	1	MD	NaN	NaN	NaN	NaN	NaN	NaN	f3b2743f-fbc3-4345-9419-56a0ca29102c	D	AB4313000X	13.0	1.0	3c8629d0-d524-47c1-bfbc-b18e07f3087e	2018-04-19 00:00:00	15:50:00	ANNAPOLIS	UNK	3	2.0	N	0.0	S	F	1.0	0.0	1.0	38.976882	1.0	NaN	N	0.01	-76.486787	COMPROMISE ST	3.0	0.0	1.00	960.0	DUKE OF GLOUCHESTER ST	NaN	MU	Property Damage Crash	MU	740.0	NaN	N	0.0	7.01
4	1.0	NaN	NaN	N	B	0.0	1971-05-28 00:00:00	NaN	1.0	NaN	0.0	N	1	MD	NaN	NaN	NaN	NaN	NaN	NaN	5bbe589b-a2db-4dbb-be9b-17a800d69a08	D	AB4313000Y	0.0	1.0	c4628cdb-f295-4a24-8a4b-653741ac6ae7	2018-05-16 00:00:00	13:41:00	ANNAPOLIS	UNK	4	2.0	N	0.0	W	F	1.0	2.0	2.0	38.978461	1.0	NaN	W	0.03	-76.493329	CHURCH CIR	3.0	0.0	2.00	2300.0	NORTHWEST ST	NaN	MU	Property Damage Crash	MU	660.0	NaN	Y	1.0	3.00

Examining more possible features#

How often was it wet, dry, snowy, icy, etc? What was the most common condition?

merged.SURF_COND_CODE.value_counts()

2.00     619569
1.00     147803
0.00      30921
3.00       9418
4.00       9111
99.00      3616
6.03       1697
88.00      1064
5.00        447
7.01        258
9.88        141
8.05         36
Name: SURF_COND_CODE, dtype: int64

Do you feel that a Dry road condition should be the average of Wet and Snow?

# Nope

The answer to that should be no, which means we can't use this data as numeric data. We want a different coefficient for each of these - I want to know the impact of dry, the impact of wet, the impact of snow, all separately.

Start by replacing each code with a proper description. I'll even include them here:

00 - Not Applicable
01 - Wet
02 - Dry
03 - Snow
04 - Ice
05 - Mud, Dirt, Gravel
06 - Slush
07 - Water (standing/moving)
08 - Sand
09 - Oil
88 - Other
99 - Unknown

But watch out, pandas read the column in as numbers so they might have come through a little differently than their codes.

merged['SURF_COND_CODE'] = merged.SURF_COND_CODE.replace({
    0: 'Not Applicable',
    1: 'Wet',
    2: 'Dry',
    3: 'Snow',
    4: 'Ice',
    5: 'Mud, Dirt, Gravel',
    6: 'Slush',
    7: 'Water (standing/moving)',
    8: 'Sand',
    9: 'Oil',
    88: 'Other',
    99: 'Unknown',
})

Confirm you have 147,803 wet, and a few codes you can't understand, like 6.03 and 7.01.

merged.SURF_COND_CODE.value_counts()

Dry                  619569
Wet                  147803
Not Applicable        30921
Snow                   9418
Ice                    9111
Unknown                3616
6.03                   1697
Other                  1064
Mud, Dirt, Gravel       447
7.01                    258
9.88                    141
8.05                     36
Name: SURF_COND_CODE, dtype: int64

Replace the codes you don't understand with Other.

merged['SURF_COND_CODE'] = merged.SURF_COND_CODE.replace({
    6.03: 'Other',
    7.01: 'Other',
    9.88: 'Other',
    8.05: 'Other'
})

Confirm you have 3,196 'Other'.

merged.SURF_COND_CODE.value_counts()

Dry                  619569
Wet                  147803
Not Applicable        30921
Snow                   9418
Ice                    9111
Unknown                3616
Other                  3196
Mud, Dirt, Gravel       447
Name: SURF_COND_CODE, dtype: int64

One-hot encoding#

We're going to use pd.get_dummies to build a variable you'll call surf_dummies. Each surface condition should be a 0 or 1 as to whether it was that condition (dry, icy, wet, etc).

Use a prefix= so we know they are surface conditions.

You'll want to drop the column you'll use as the reference category.

Before we do this: which column works best as the reference?

# Dry

Now build your surf_dummies variable.

surf_dummies = pd.get_dummies(merged.SURF_COND_CODE, prefix='surface').drop(columns='surface_Dry')
surf_dummies.head()

	surface_Not Applicable	surface_Wet
0	0	0
1	0	0
2	1	0
3	1	0
4	0	1

Confirm your surf_dummies looks roughly like this:

surface_Not Applicable	...	surface_Wet
0	...	0
0	...	0
1	...	0
1	...	0
0	...	1

Another regression#

Let's run another regression to see the impact of both sex and surface condition on fatalities.

Build your `train_df`#

To build your train_df, I recommend doing it either of these two ways. They both first select the important columns, then add in the one-hot encoded surf_dummies columns.

train_df = pd.DataFrame({
    'col1': merged.col1,
    'col2': merged.col2,
    'col3': merged.col3,
})
train_df = train_df.join(surf_dummies)
train_df = train_df.dropna()

or like this:

train_df = train_df[['col1','col2','col3']].copy()
train_df = train_df.join(surf_dummies)
train_df = train_df.dropna()

The second one is shorter, but the first one makes it easier to use comments to remove columns later.

train_df = pd.DataFrame({
    'sex': merged.SEX_CODE,
    'had_fatality': merged.had_fatality
})
train_df = train_df.join(surf_dummies)
train_df = train_df.dropna()

Run your regression and check your odds ratios#

Actually no, wait, first - what kind of surface do you think will have the highest fatality rate?

# Wet

train_df.shape

(815843, 9)

Confirm your train_df has 815,843 rows and 9 columns.

Tip: When you run your regression, if you get an error about not knowing what to do with U, it's because you didn't convert your sex to numbers (or if you did, you didn't do it in your original dataframe)

X = train_df.drop(columns='had_fatality')
y = train_df.had_fatality
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X, y)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)

	feature	coefficient (log odds ratio)	odds ratio
2	surface_Mud, Dirt, Gravel	1.115150	3.0500
4	surface_Other	0.596230	1.8153
6	surface_Unknown	0.340200	1.4052
7	surface_Wet	-0.098825	0.9059
1	surface_Ice	-0.557708	0.5725
5	surface_Snow	-0.652480	0.5208
0	sex	-0.718133	0.4877
3	surface_Not Applicable	-0.730582	0.4816

Is this what you expected? Why do you think this result might be the case?

More features: Vehicles#

Maybe the car they're in is related to the car they were in. Luckily, we have this information - read in combined_vehicle_data as vehicles.

vehicles = pd.read_csv("data/combined-vehicle-data.csv")
vehicles.head(2)

	AREA_DAMAGED_CODE1	AREA_DAMAGED_CODE2	AREA_DAMAGED_CODE3	AREA_DAMAGED_CODE_IMP1	AREA_DAMAGED_CODE_MAIN	BODY_TYPE_CODE	COMMERCIAL_FLAG	CONTI_DIRECTION_CODE	CV_BODY_TYPE_CODE	DAMAGE_CODE	DRIVERLESS_FLAG	FIRE_FLAG	GOING_DIRECTION_CODE	GVW_CODE	HARM_EVENT_CODE	HAZMAT_SPILL_FLAG	HIT_AND_RUN_FLAG	HZM_NUM	MOVEMENT_CODE	NUM_AXLES	PARKED_FLAG	REPORT_NO	SPEED_LIMIT	TOWED_AWAY_FLAG	TOWED_VEHICLE_CONFIG_CODE	VEHICLE_ID	VEH_MAKE	VEH_MODEL	VEH_YEAR	VIN_NO
0	8.0	9.0	10.0	10.0	10.0	23.08	N	E	NaN	5	N	N	E	NaN	9.0	NaN	N	NaN	1.0	NaN	N	ADJ487004H	30	Y	0	000238fd-44fa-4cd5-8eb7-41ab30500bec	CHEVY	TAHOE	2005.0	1GNEK13Q2J285593
1	12.0	NaN	NaN	12.0	12.0	2.00	N	E	NaN	4	N	N	E	NaN	1.0	NaN	N	NaN	1.0	NaN	N	MCP2487000M	40	Y	0	00038116-1bf9-48cc-b317-4f4375d14b60	INFI	4S	2003.0	JNKCV51E63M013580

Weights of those cars#

The car weights are stored in another file since the info had to come from an API. I looked up the VINs - vehicle identification numbers - in a government database to try to get data for each of them.

Read them and build a new dataframe that is both the vehicle data along with their weights. You can call it vehicles since you don't need the original weightless vehicle data any more.

weights = pd.read_csv("data/vins_and_weights.csv")
weights.head(2)

	VIN	Make	Model	ModelYear	weight
0	2FMDA5143TBB45576	FORD	WINDSTAR	1996	3733.0
1	2G1WC5E37E1120089	CHEVROLET	IMPALA	2014	3618.0

vehicles = vehicles.merge(weights, left_on='VIN_NO', right_on='VIN')
vehicles.head(2)

	AREA_DAMAGED_CODE1	AREA_DAMAGED_CODE2	AREA_DAMAGED_CODE3	AREA_DAMAGED_CODE_IMP1	AREA_DAMAGED_CODE_MAIN	BODY_TYPE_CODE	COMMERCIAL_FLAG	CONTI_DIRECTION_CODE	CV_BODY_TYPE_CODE	DAMAGE_CODE	DRIVERLESS_FLAG	FIRE_FLAG	GOING_DIRECTION_CODE	GVW_CODE	HARM_EVENT_CODE	HAZMAT_SPILL_FLAG	HIT_AND_RUN_FLAG	HZM_NUM	MOVEMENT_CODE	NUM_AXLES	PARKED_FLAG	REPORT_NO	SPEED_LIMIT	TOWED_AWAY_FLAG	TOWED_VEHICLE_CONFIG_CODE	VEHICLE_ID	VEH_MAKE	VEH_MODEL	VEH_YEAR	VIN_NO	VIN	Make	Model	ModelYear	weight
0	8.0	9.0	10.0	10.0	10.0	23.08	N	E	NaN	5	N	N	E	NaN	9.0	NaN	N	NaN	1.0	NaN	N	ADJ487004H	30	Y	0	000238fd-44fa-4cd5-8eb7-41ab30500bec	CHEVY	TAHOE	2005.0	1GNEK13Q2J285593	1GNEK13Q2J285593	CHEVROLET	GMT-400	1988	4300.0
1	12.0	NaN	NaN	12.0	12.0	2.00	N	E	NaN	4	N	N	E	NaN	1.0	NaN	N	NaN	1.0	NaN	N	MCP2487000M	40	Y	0	00038116-1bf9-48cc-b317-4f4375d14b60	INFI	4S	2003.0	JNKCV51E63M013580	JNKCV51E63M013580	INFINITI	G35	2003	3468.0

Confirm that your combined vehicles dataset should have 534,436 rows and 35 columns. And yes, that's less than we were working with before - you haven't combined it with the people/crashes dataset yet.

vehicles.shape

(534436, 35)

Filter your data#

We only want vehicles that are "normal" - somewhere between 1500 and 6000 pounds. Filter your vehicles to only include those in that weight range.

vehicles = vehicles[(vehicles.weight >= 1500) & (vehicles.weight <= 6000)]

Confirm that you have 532,370 vehicles in the dataset.

vehicles.shape

(532370, 35)

Add this vehicle information to your merged data#

Now we'll have a dataframe that contains information on:

The people themselves and their injuries
The crash
The vehicles

Every person came with a VEHICLE_ID column that is the vehicle they were in. You'll want to merge on that.

merged = merged.merge(vehicles, on='VEHICLE_ID')

Confirm you have 99 columns and 616,212 rows. That is a lot of possible features!

merged.shape

(616212, 99)

Another regression, because we can't get enough#

Build another train_df and run another regression about how car weight impacts the chance of fatalities. You'll want to confirm that your dataset has 616,212 and 2 columns.

train_df = merged[['weight', 'had_fatality']]
train_df = train_df.dropna()

train_df.shape

(616212, 2)

X = train_df.drop(columns='had_fatality')
y = train_df.had_fatality
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X, y)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)

	feature	coefficient (log odds ratio)	odds ratio
0	weight	-0.000147	0.9999

Can you translate that into plain English? Remember weight is in pounds.

I feel like pounds isn't the best measure for something like this. Remember how we had to adjust percentages with AP and life expectancy, and then change around the way we said things? It sounded like this:

Every 10% increase in unemployment translates to a year and a half loss of life expectancy

Instead of every single pound, maybe we could do every... some other number of pounds? One hundred? One thousand?

Run another regression with weight in thousands of pounds. Get another odds ratio. Give me another sentence English.

train_df = merged[['weight', 'had_fatality']]
train_df['weight'] = train_df['weight'] / 1000
train_df = train_df.dropna()

X = train_df.drop(columns='had_fatality')
y = train_df.had_fatality
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X, y)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)

	feature	coefficient (log odds ratio)	odds ratio
0	weight	-0.146881	0.8634

# Every thousand pounds heavier your car is increase translates to a 15% decrease in fatalities

Two-car accidents, struck and striker#

Here's the thing, though: it isn't just the weight of your car. It's the weight of both cars! If I'm in a big car and I have a wreck with a smaller car, it's the smaller car that's in trouble.

To get that value, we need to do some feature engineering, some calculating of new variables from our existing variables.

We need to jump through some hoops to do that.

Two-car accidents#

First we're going to count how many vehicles were in each accident. Since we're looking to compare the weight of two cars hitting each other, we're only going to want crashes with only two cars.

counted = vehicles.REPORT_NO.value_counts()
counted.head(10)

CB53480003      28
MDTA1229000H    26
AS0286000B      14
CE4636002C      14
MSP6063006C     13
MSP67460095     13
AE5607002W      12
CC0237003W      12
MSP63070049     12
AC1811000Q      11
Name: REPORT_NO, dtype: int64

By using .value_counts I can see how many cars were in each crash, and now I'm going to filter to get a list of all of the ones with two vehicles.

two_car_report_nos = counted[counted == 2].index
two_car_report_nos

Index(['MSP5688001T', 'ZW0166000H', 'AC2154000Z', 'AE5447007F', 'ADI489000B',
       'AE5364000K', 'MSP59510031', 'CD0633001C', 'AE5633000H', 'MCP9386003X',
       ...
       'ADI452002W', 'CB5541000B', 'GS04490005', 'AE40860053', 'AC12880051',
       'AC1695001G', 'CE39090017', 'MCP27830021', 'AC19440016', 'DA3965002D'],
      dtype='object', length=137214)

And now we'll filter my vehicles so we only have those that were in two-vehicle crashes.

vehicles = vehicles[vehicles.REPORT_NO.isin(two_car_report_nos)]

Struck and striker#

To do the math correctly, we need both the risk of someone dying in the smaller car and the risk of someone dying in the bigger car. To do this we need to separate our cars into two groups:

The 'struck' vehicle: did the person die inside?
The 'striker' vehicle: how much heavier was it than the struck car?

But we don't know which car was which, so we have to try out both versions - pretending car A was the striker, then pretending car B was the striker. It's hard to explain, but you can read Pounds That Kill - The External Costs of Vehicle Weight.pdf for more details on how it works.

cars_1 = vehicles.drop_duplicates(subset='REPORT_NO', keep='first')
cars_2 = vehicles.drop_duplicates(subset='REPORT_NO', keep='last')

cars_merged_1 = cars_1.merge(cars_2, on='REPORT_NO', suffixes=['_striker', '_struck'])
cars_merged_2 = cars_2.merge(cars_1, on='REPORT_NO', suffixes=['_striker', '_struck'])
vehicles_complete = pd.concat([cars_merged_1, cars_merged_2])
vehicles_complete.head()

	AREA_DAMAGED_CODE1_striker	AREA_DAMAGED_CODE2_striker	AREA_DAMAGED_CODE3_striker	AREA_DAMAGED_CODE_IMP1_striker	AREA_DAMAGED_CODE_MAIN_striker	BODY_TYPE_CODE_striker	COMMERCIAL_FLAG_striker	CONTI_DIRECTION_CODE_striker	CV_BODY_TYPE_CODE_striker	DAMAGE_CODE_striker	DRIVERLESS_FLAG_striker	FIRE_FLAG_striker	GOING_DIRECTION_CODE_striker	GVW_CODE_striker	HARM_EVENT_CODE_striker	HAZMAT_SPILL_FLAG_striker	HIT_AND_RUN_FLAG_striker	HZM_NUM_striker	MOVEMENT_CODE_striker	NUM_AXLES_striker	PARKED_FLAG_striker	REPORT_NO	SPEED_LIMIT_striker	TOWED_AWAY_FLAG_striker	VEHICLE_ID_striker	VEH_MAKE_striker	VEH_MODEL_striker	VEH_YEAR_striker	VIN_NO_striker	VIN_striker	Make_striker	Model_striker	ModelYear_striker	weight_striker	AREA_DAMAGED_CODE1_struck	AREA_DAMAGED_CODE2_struck	AREA_DAMAGED_CODE3_struck	AREA_DAMAGED_CODE_IMP1_struck	AREA_DAMAGED_CODE_MAIN_struck	BODY_TYPE_CODE_struck	COMMERCIAL_FLAG_struck	CONTI_DIRECTION_CODE_struck	CV_BODY_TYPE_CODE_struck	DAMAGE_CODE_struck	DRIVERLESS_FLAG_struck	FIRE_FLAG_struck	GOING_DIRECTION_CODE_struck	GVW_CODE_struck	HARM_EVENT_CODE_struck	HAZMAT_SPILL_FLAG_struck	HIT_AND_RUN_FLAG_struck	HZM_NUM_struck	MOVEMENT_CODE_struck	NUM_AXLES_struck	PARKED_FLAG_struck	SPEED_LIMIT_struck	TOWED_AWAY_FLAG_struck	VEHICLE_ID_struck	VEH_MAKE_struck	VEH_MODEL_struck	VEH_YEAR_struck	VIN_NO_struck	VIN_struck	Make_struck	Model_struck	ModelYear_struck	weight_struck
0	6.0	7.0	NaN	6.0	6.0	20.0	N	N	NaN	3	N	N	N	NaN	1.0	NaN	N	NaN	6.0	NaN	N	CB5190006B	55	N	0003b659-2785-4868-8877-0b786a284827	TOYT	TK	2011.0	5TFUY5F1XBX167340	5TFUY5F1XBX167340	TOYOTA	TUNDRA	2011	5480.0	3.0	5.0	6.0	6.0	6.0	23.08	N	N	NaN	5	N	N	N	NaN	1.0	NaN	N	NaN	6.0	NaN	N	55	Y	0e7da302-5f53-45c8-8b28-393cdc38171d	JEEP	CHEROKEE	2015.0	1C4RJFBGXFC751955	1C4RJFBGXFC751955	JEEP	GRAND CHEROKEE	2015	4545.0
1	8.0	NaN	NaN	8.0	8.0	2.0	N	W	NaN	2	N	NaN	W	NaN	1.0	NaN	N	NaN	8.0	NaN	N	ADJ4590035	5	N	00050484-d08f-4b6e-bc7e-9ec270e94660	HONDA	CIVIC	2015.0	2HGFG4A59FH702545	2HGFG4A59FH702545	HONDA	CIVIC	2015	2754.0	99.0	NaN	NaN	99.0	99.0	2.00	N	W	NaN	99	N	N	W	NaN	1.0	NaN	Y	NaN	1.0	NaN	N	5	N	3ebfe7fc-4832-414d-a5e8-391b65a2f280	HONDA	ACCORD	2013.0	1HGCT2B81DA005695	1HGCT2B81DA005695	HONDA	ACCORD	2013	3163.0
2	12.0	NaN	NaN	12.0	12.0	2.0	N	N	NaN	4	N	NaN	N	NaN	0.0	NaN	N	NaN	1.0	NaN	N	ADJ849000Z	10	N	00057af4-d848-4cee-b854-707f57581f4e	HONDA	ACCORD	2003.0	1HGCM66313A037175	1HGCM66313A037175	HONDA	ACCORD	2003	3023.0	12.0	NaN	NaN	12.0	12.0	2.00	N	NaN	NaN	3	Y	NaN	NaN	NaN	1.0	NaN	N	NaN	10.0	NaN	Y	0	N	4cf496cc-f537-447e-bb8e-3ef48c8c3ddf	HONDA	TK	2010.0	5J6RE3H73AL008454	5J6RE3H73AL008454	HONDA	CR-V	2010	3389.0
3	1.0	11.0	12.0	12.0	12.0	2.0	N	S	NaN	4	N	N	S	NaN	1.0	NaN	N	NaN	12.0	NaN	N	AE5207008Z	30	Y	00089d4a-7038-4693-9e02-b402676631af	FORD	4D	2016.0	1FADP3K28GL258987	1FADP3K28GL258987	FORD	FOCUS	2016	2932.0	11.0	12.0	NaN	12.0	12.0	23.08	N	N	NaN	4	N	N	N	NaN	1.0	NaN	N	NaN	1.0	NaN	N	30	Y	1e69b284-719f-429b-baae-6ae4976a546b	BUICK	TK	2002.0	3G5DB03E52S528316	3G5DB03E52S528316	BUICK	RENDEZVOUS	2002	4024.0
4	2.0	NaN	NaN	2.0	2.0	2.0	N	S	NaN	99	N	N	S	NaN	1.0	NaN	Y	NaN	1.0	NaN	N	ADJ859000Y	15	N	0010076b-0a45-45f3-8796-f387b39cd85d	HONDA	CIVIC	2009.0	2HGFA16689H357624	2HGFA16689H357624	HONDA	CIVIC	2009	2678.0	10.0	NaN	NaN	10.0	10.0	2.00	N	NaN	NaN	3	Y	N	NaN	NaN	1.0	NaN	N	NaN	10.0	NaN	Y	15	N	b14273cb-b3e5-42c7-af59-df069e8c2f0f	HONDA	CIVIC	2011.0	2HGFG1B87BH517469	2HGFG1B87BH517469	HONDA	CIVIC	2011	2703.0

Put people in their cars#

Which car was each person in? We'll assign that now.

merged = people.merge(vehicles_complete, left_on='VEHICLE_ID', right_on='VEHICLE_ID_struck')
merged.head(3)

	AIRBAG_DEPLOYED	ALCOHOL_TESTTYPE_CODE	BAC_CODE	CDL_FLAG	CLASS	DATE_OF_BIRTH	DRUG_TESTRESULT_CODE	EJECT_CODE	EMS_UNIT_LABEL	EQUIP_PROB_CODE	FAULT_FLAG	INJ_SEVER_CODE	LICENSE_STATE_CODE	MOVEMENT_CODE	OCC_SEAT_POS_CODE	PED_LOCATION_CODE	PED_OBEY_CODE	PED_TYPE_CODE	PED_VISIBLE_CODE	PERSON_ID	PERSON_TYPE	REPORT_NO_x	SAF_EQUIP_CODE	SEX_CODE	VEHICLE_ID	AREA_DAMAGED_CODE1_striker	AREA_DAMAGED_CODE2_striker	AREA_DAMAGED_CODE3_striker	AREA_DAMAGED_CODE_IMP1_striker	AREA_DAMAGED_CODE_MAIN_striker	BODY_TYPE_CODE_striker	COMMERCIAL_FLAG_striker	CONTI_DIRECTION_CODE_striker	CV_BODY_TYPE_CODE_striker	DAMAGE_CODE_striker	DRIVERLESS_FLAG_striker	FIRE_FLAG_striker	GOING_DIRECTION_CODE_striker	GVW_CODE_striker	HARM_EVENT_CODE_striker	HAZMAT_SPILL_FLAG_striker	HIT_AND_RUN_FLAG_striker	HZM_NUM_striker	MOVEMENT_CODE_striker	NUM_AXLES_striker	PARKED_FLAG_striker	REPORT_NO_y	SPEED_LIMIT_striker	TOWED_AWAY_FLAG_striker	VEHICLE_ID_striker	VEH_MAKE_striker	VEH_MODEL_striker	VEH_YEAR_striker	VIN_NO_striker	VIN_striker	Make_striker	Model_striker	ModelYear_striker	weight_striker	AREA_DAMAGED_CODE1_struck	AREA_DAMAGED_CODE2_struck	AREA_DAMAGED_CODE3_struck	AREA_DAMAGED_CODE_IMP1_struck	AREA_DAMAGED_CODE_MAIN_struck	BODY_TYPE_CODE_struck	COMMERCIAL_FLAG_struck	CONTI_DIRECTION_CODE_struck	CV_BODY_TYPE_CODE_struck	DAMAGE_CODE_struck	DRIVERLESS_FLAG_struck	FIRE_FLAG_struck	GOING_DIRECTION_CODE_struck	GVW_CODE_struck	HARM_EVENT_CODE_struck	HAZMAT_SPILL_FLAG_struck	HIT_AND_RUN_FLAG_struck	HZM_NUM_struck	MOVEMENT_CODE_struck	NUM_AXLES_struck	PARKED_FLAG_struck	SPEED_LIMIT_struck	TOWED_AWAY_FLAG_struck	VEHICLE_ID_struck	VEH_MAKE_struck	VEH_MODEL_struck	VEH_YEAR_struck	VIN_NO_struck	VIN_struck	Make_struck	Model_struck	ModelYear_struck	weight_struck
0	1.0	NaN	NaN	N	NaN	1985-05-28 00:00:00	NaN	0.0	NaN	0.0	N	1	MD	NaN	NaN	NaN	NaN	NaN	NaN	166296bd-ffd3-4c16-aa74-4f4bf4139d8d	D	AB4313000X	13.0	1.0	b463eb20-2f01-4200-9d6f-b18888ce2593	3.0	NaN	NaN	3.0	3.0	2.0	N	E	NaN	2	N	NaN	E	NaN	1.0	NaN	Y	NaN	2.0	NaN	N	AB4313000X	25	N	3c8629d0-d524-47c1-bfbc-b18e07f3087e	LEXU	TK	2004.0	2T2HA31U24C031048	2T2HA31U24C031048	LEXUS	RX	2004	3900.0	7.0	NaN	NaN	7.0	7.0	2.00	N	S	NaN	2	N	N	S	NaN	1.0	NaN	N	NaN	6.0	NaN	N	25	N	b463eb20-2f01-4200-9d6f-b18888ce2593	CHEVY	TK	2007.0	2CNDL13F576252855	2CNDL13F576252855	CHEVROLET	EQUINOX	2007	3776.0
1	1.0	NaN	NaN	N	C	1960-10-04 00:00:00	NaN	0.0	NaN	0.0	Y	1	MD	NaN	NaN	NaN	NaN	NaN	NaN	f3b2743f-fbc3-4345-9419-56a0ca29102c	D	AB4313000X	13.0	1.0	3c8629d0-d524-47c1-bfbc-b18e07f3087e	7.0	NaN	NaN	7.0	7.0	2.0	N	S	NaN	2	N	N	S	NaN	1.0	NaN	N	NaN	6.0	NaN	N	AB4313000X	25	N	b463eb20-2f01-4200-9d6f-b18888ce2593	CHEVY	TK	2007.0	2CNDL13F576252855	2CNDL13F576252855	CHEVROLET	EQUINOX	2007	3776.0	3.0	NaN	NaN	3.0	3.0	2.00	N	E	NaN	2	N	NaN	E	NaN	1.0	NaN	Y	NaN	2.0	NaN	N	25	N	3c8629d0-d524-47c1-bfbc-b18e07f3087e	LEXU	TK	2004.0	2T2HA31U24C031048	2T2HA31U24C031048	LEXUS	RX	2004	3900.0
2	1.0	NaN	NaN	N	C	1990-11-05 00:00:00	NaN	1.0	NaN	1.0	N	1	MD	NaN	NaN	NaN	NaN	NaN	NaN	21fff0f9-ed13-4b87-ac04-29d018aff9d4	D	AB5218001Y	13.0	1.0	4dea42c3-e02c-4c6b-8ea0-c2f8c17f147f	10.0	11.0	12.0	11.0	11.0	2.0	N	N	NaN	3	N	N	N	NaN	1.0	NaN	N	NaN	1.0	NaN	N	AB5218001Y	25	N	057a3e63-3009-4d07-b956-78f53a0ca992	LINC	TOWNCAR	2003.0	1LNHM82W53Y644084	1LNHM82W53Y644084	LINCOLN	TOWN CAR	2003	4000.0	9.0	10.0	11.0	11.0	11.0	23.08	N	S	NaN	3	N	N	W	NaN	1.0	NaN	N	NaN	12.0	NaN	N	25	N	4dea42c3-e02c-4c6b-8ea0-c2f8c17f147f	HYUN	SW	2010.0	5NMSG3AB9AH400386	5NMSG3AB9AH400386	HYUNDAI	SANTA FE	2010	3900.0

Add the crash details#

You did this already! I'm going to do it for you. We're merging on REPORT_NO_x because there are so many REPORT_NO columns duplicated across our files that pandas started giving them weird names.

merged = merged.merge(crashes, left_on='REPORT_NO_x', right_on='REPORT_NO')
merged.head(3)

	AIRBAG_DEPLOYED	ALCOHOL_TESTTYPE_CODE	BAC_CODE	CDL_FLAG	CLASS	DATE_OF_BIRTH	DRUG_TESTRESULT_CODE	EJECT_CODE	EMS_UNIT_LABEL	EQUIP_PROB_CODE	FAULT_FLAG	INJ_SEVER_CODE	LICENSE_STATE_CODE	MOVEMENT_CODE	OCC_SEAT_POS_CODE	PED_LOCATION_CODE	PED_OBEY_CODE	PED_TYPE_CODE	PED_VISIBLE_CODE	PERSON_ID	PERSON_TYPE	REPORT_NO_x	SAF_EQUIP_CODE	SEX_CODE	VEHICLE_ID	AREA_DAMAGED_CODE1_striker	AREA_DAMAGED_CODE2_striker	AREA_DAMAGED_CODE3_striker	AREA_DAMAGED_CODE_IMP1_striker	AREA_DAMAGED_CODE_MAIN_striker	BODY_TYPE_CODE_striker	COMMERCIAL_FLAG_striker	CONTI_DIRECTION_CODE_striker	CV_BODY_TYPE_CODE_striker	DAMAGE_CODE_striker	DRIVERLESS_FLAG_striker	FIRE_FLAG_striker	GOING_DIRECTION_CODE_striker	GVW_CODE_striker	HARM_EVENT_CODE_striker	HAZMAT_SPILL_FLAG_striker	HIT_AND_RUN_FLAG_striker	HZM_NUM_striker	MOVEMENT_CODE_striker	NUM_AXLES_striker	PARKED_FLAG_striker	REPORT_NO_y	SPEED_LIMIT_striker	TOWED_AWAY_FLAG_striker	VEHICLE_ID_striker	VEH_MAKE_striker	VEH_MODEL_striker	VEH_YEAR_striker	VIN_NO_striker	VIN_striker	Make_striker	Model_striker	ModelYear_striker	weight_striker	AREA_DAMAGED_CODE1_struck	AREA_DAMAGED_CODE2_struck	AREA_DAMAGED_CODE3_struck	AREA_DAMAGED_CODE_IMP1_struck	AREA_DAMAGED_CODE_MAIN_struck	BODY_TYPE_CODE_struck	COMMERCIAL_FLAG_struck	CONTI_DIRECTION_CODE_struck	CV_BODY_TYPE_CODE_struck	DAMAGE_CODE_struck	DRIVERLESS_FLAG_struck	FIRE_FLAG_struck	GOING_DIRECTION_CODE_struck	GVW_CODE_struck	HARM_EVENT_CODE_struck	HAZMAT_SPILL_FLAG_struck	HIT_AND_RUN_FLAG_struck	HZM_NUM_struck	MOVEMENT_CODE_struck	NUM_AXLES_struck	PARKED_FLAG_struck	SPEED_LIMIT_struck	TOWED_AWAY_FLAG_struck	VEHICLE_ID_struck	VEH_MAKE_struck	VEH_MODEL_struck	VEH_YEAR_struck	VIN_NO_struck	VIN_struck	Make_struck	Model_struck	ModelYear_struck	weight_struck	ACC_DATE	ACC_TIME	AGENCY_CODE	AREA_CODE	COLLISION_TYPE_CODE	COUNTY_NO	C_M_ZONE_FLAG	DISTANCE	DISTANCE_DIR_FLAG	FEET_MILES_FLAG	HARM_EVENT_CODE1	LANE_CODE	LATITUDE	LIGHT_CODE	LOC_CODE	LOGMILE_DIR_FLAG	LOG_MILE	LONGITUDE	MAINROAD_NAME	MUNI_CODE	RD_COND_CODE	RD_DIV_CODE	REFERENCE_NO	REFERENCE_ROAD_NAME	REFERENCE_SUFFIX	REFERENCE_TYPE_CODE	REPORT_NO	REPORT_TYPE	ROUTE_TYPE_CODE	RTE_NO	RTE_SUFFIX	SIGNAL_FLAG	SURF_COND_CODE	WEATHER_CODE
0	1.0	NaN	NaN	N	NaN	1985-05-28 00:00:00	NaN	0.0	NaN	0.0	N	1	MD	NaN	NaN	NaN	NaN	NaN	NaN	166296bd-ffd3-4c16-aa74-4f4bf4139d8d	D	AB4313000X	13.0	1.0	b463eb20-2f01-4200-9d6f-b18888ce2593	3.0	NaN	NaN	3.0	3.0	2.0	N	E	NaN	2	N	NaN	E	NaN	1.0	NaN	Y	NaN	2.0	NaN	N	AB4313000X	25	N	3c8629d0-d524-47c1-bfbc-b18e07f3087e	LEXU	TK	2004.0	2T2HA31U24C031048	2T2HA31U24C031048	LEXUS	RX	2004	3900.0	7.0	NaN	NaN	7.0	7.0	2.00	N	S	NaN	2	N	N	S	NaN	1.0	NaN	N	NaN	6.0	NaN	N	25	N	b463eb20-2f01-4200-9d6f-b18888ce2593	CHEVY	TK	2007.0	2CNDL13F576252855	2CNDL13F576252855	CHEVROLET	EQUINOX	2007	3776.0	2018-04-19 00:00:00	15:50:00	ANNAPOLIS	UNK	3	2.0	N	0.0	S	F	1.0	1.0	38.976882	1.0	NaN	N	0.01	-76.486787	COMPROMISE ST	3.0	0.0	1.0	960.0	DUKE OF GLOUCHESTER ST	NaN	MU	AB4313000X	Property Damage Crash	MU	740.0	NaN	N	0.0	7.01
1	1.0	NaN	NaN	N	C	1960-10-04 00:00:00	NaN	0.0	NaN	0.0	Y	1	MD	NaN	NaN	NaN	NaN	NaN	NaN	f3b2743f-fbc3-4345-9419-56a0ca29102c	D	AB4313000X	13.0	1.0	3c8629d0-d524-47c1-bfbc-b18e07f3087e	7.0	NaN	NaN	7.0	7.0	2.0	N	S	NaN	2	N	N	S	NaN	1.0	NaN	N	NaN	6.0	NaN	N	AB4313000X	25	N	b463eb20-2f01-4200-9d6f-b18888ce2593	CHEVY	TK	2007.0	2CNDL13F576252855	2CNDL13F576252855	CHEVROLET	EQUINOX	2007	3776.0	3.0	NaN	NaN	3.0	3.0	2.00	N	E	NaN	2	N	NaN	E	NaN	1.0	NaN	Y	NaN	2.0	NaN	N	25	N	3c8629d0-d524-47c1-bfbc-b18e07f3087e	LEXU	TK	2004.0	2T2HA31U24C031048	2T2HA31U24C031048	LEXUS	RX	2004	3900.0	2018-04-19 00:00:00	15:50:00	ANNAPOLIS	UNK	3	2.0	N	0.0	S	F	1.0	1.0	38.976882	1.0	NaN	N	0.01	-76.486787	COMPROMISE ST	3.0	0.0	1.0	960.0	DUKE OF GLOUCHESTER ST	NaN	MU	AB4313000X	Property Damage Crash	MU	740.0	NaN	N	0.0	7.01
2	1.0	NaN	NaN	N	C	1990-11-05 00:00:00	NaN	1.0	NaN	1.0	N	1	MD	NaN	NaN	NaN	NaN	NaN	NaN	21fff0f9-ed13-4b87-ac04-29d018aff9d4	D	AB5218001Y	13.0	1.0	4dea42c3-e02c-4c6b-8ea0-c2f8c17f147f	10.0	11.0	12.0	11.0	11.0	2.0	N	N	NaN	3	N	N	N	NaN	1.0	NaN	N	NaN	1.0	NaN	N	AB5218001Y	25	N	057a3e63-3009-4d07-b956-78f53a0ca992	LINC	TOWNCAR	2003.0	1LNHM82W53Y644084	1LNHM82W53Y644084	LINCOLN	TOWN CAR	2003	4000.0	9.0	10.0	11.0	11.0	11.0	23.08	N	S	NaN	3	N	N	W	NaN	1.0	NaN	N	NaN	12.0	NaN	N	25	N	4dea42c3-e02c-4c6b-8ea0-c2f8c17f147f	HYUN	SW	2010.0	5NMSG3AB9AH400386	5NMSG3AB9AH400386	HYUNDAI	SANTA FE	2010	3900.0	2018-06-08 00:00:00	16:44:00	ANNAPOLIS	UNK	11	2.0	N	30.0	N	F	1.0	1.0	38.978495	1.0	NaN	N	0.00	-76.496738	W WASHINGTON ST	3.0	1.0	1.0	450.0	WEST ST	NaN	MD	AB5218001Y	Property Damage Crash	MU	3380.0	NaN	N	2.0	6.01

Filter#

We already filtered out vehicles by weight, so we don't have to do that again.

Calculated features#

I'm sure you forgot what all the features are, so we'll bring back whether there was a fatality or not

Feature: Accident was fatal#

merged['had_fatality'] = (merged.INJ_SEVER_CODE == 5).astype(int)
merged.had_fatality.value_counts()

0    334152
1       244
Name: had_fatality, dtype: int64

Feature: Weight difference#

Remove everything missing weights for strikers or struck vehicles. You might need to merged.columns to remind yourself what the column names are.

merged = merged.dropna(subset=['weight_striker', 'weight_struck'])

Confirm your dataset has roughly 335,000 rows.

merged.shape

(334396, 135)

Create a new feature called weight_diff about how much heavier the striking car was compared to the struck car. Make sure you've done the math correctly!

merged['weight_diff'] = merged['weight_striker'] - merged['weight_struck']

Feature adjustment#

Make all of your weight columns in thousands of pounds instead of just in pounds. It'll help you interpret your results much better.

merged.weight_striker = merged.weight_striker / 1000
merged.weight_struck = merged.weight_struck / 1000
merged.weight_diff = merged.weight_diff / 1000
merged[['weight_striker', 'weight_struck', 'weight_diff']].describe()

	weight_striker	weight_struck	weight_diff
count	334396.000000	334396.000000	334396.000000
mean	3.608778	3.629094	-0.020316
std	0.758489	0.769407	1.077867
min	1.701000	1.701000	-3.916000
25%	3.087000	3.093000	-0.710000
50%	3.450000	3.450000	-0.009000
75%	4.057000	4.096000	0.668000
max	5.985000	5.985000	3.916000

Another regression!!!#

What is the impact of weight difference on fatality rate? Create your train_df, drop missing values, run your regression, analyze your odds ratios.

train_df = merged[['weight_diff', 'had_fatality']].copy()
train_df = train_df.dropna()

X = train_df.drop(columns='had_fatality')
y = train_df.had_fatality
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X, y)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)

	feature	coefficient (log odds ratio)	odds ratio
0	weight_diff	0.475534	1.6089

Please translate your odds ratio into plain English.

# For every 1000 lbs heavier the striking car is, you have a 60% greater chance of dying

Adding in more features#

How about speed limit? That's important, right? We can add the speed limit of the striking vehicle with SPEED_LIMIT_striker.

train_df = merged[['weight_diff', 'had_fatality', 'SPEED_LIMIT_striker']].copy()
train_df = train_df.dropna()

X = train_df.drop(columns='had_fatality')
y = train_df.had_fatality
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X, y)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)

	feature	coefficient (log odds ratio)	odds ratio
0	weight_diff	0.465337	1.5926
1	SPEED_LIMIT_striker	0.054314	1.0558

Can you translate the speed limit odds ratio into plain English?

Feature engineering: Speed limits#

Honestly, that's a pretty bad way to go about things. What's more fun is if we translate speed limits into bins.

First, we'll use pd.cut to assign each speed limit a category.

speed_bins = [-np.inf, 10, 20, 30, 40, 50, np.inf]
merged['speed_bin'] = pd.cut(merged.SPEED_LIMIT_struck, bins=speed_bins)
merged[['SPEED_LIMIT_striker', 'speed_bin']].head(10)

	SPEED_LIMIT_striker	speed_bin
0	25	(20.0, 30.0]
1	25	(20.0, 30.0]
2	25	(20.0, 30.0]
3	25	(20.0, 30.0]
4	25	(20.0, 30.0]
5	5	(-inf, 10.0]
6	45	(40.0, 50.0]
7	45	(40.0, 50.0]
8	45	(40.0, 50.0]
9	45	(40.0, 50.0]

Then we'll one-hot encode around 20-30mph speed limits.

speed_dummies = pd.get_dummies(merged.speed_bin, 
                               prefix='speed').drop('speed_(20.0, 30.0]', axis=1)
speed_dummies.head()

	speed_(-inf, 10.0]	speed_(10.0, 20.0]	speed_(30.0, 40.0]	speed_(40.0, 50.0]	speed_(50.0, inf]
0	0	0	0	0	0
1	0	0	0	0	0
2	0	0	0	0	0
3	0	0	0	0	0
4	0	0	0	0	0

Running a regression#

I like this layout for creating train_df, it allows us to easily add dummies and do a little replacing/encoding when we're building binary features like for sex.

If the below gives you an error, it's because SEX_CODE is already a number. In that case, just remove .replace({'M': 1, 'F': 0, 'U': np.nan }).

train_df = pd.DataFrame({
    'weight_diff': merged.weight_diff,
    'sex': merged.SEX_CODE,#.replace({'M': 1, 'F': 0, 'U': np.nan }),
    'had_fatality': merged.had_fatality,
})
train_df = train_df.join(speed_dummies)
train_df = train_df.join(surf_dummies)
train_df = train_df.dropna()
train_df.head()

	weight_diff	sex	surface_Not Applicable	surface_Wet
0	0.124	1.0	0	0
1	-0.124	1.0	0	0
2	0.100	1.0	1	0
3	-0.100	1.0	1	0
4	-0.100	0.0	0	1

X = train_df.drop(columns='had_fatality')
y = train_df.had_fatality

clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)
clf.fit(X, y)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=4000, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)

	feature	coefficient (log odds ratio)	odds ratio
5	speed_(40.0, 50.0]	1.713997	5.5511
6	speed_(50.0, inf]	1.405104	4.0760
7	surface_Ice	0.647203	1.9102
4	speed_(30.0, 40.0]	0.630550	1.8786
0	weight_diff	0.469332	1.5989
10	surface_Other	0.357723	1.4301
9	surface_Not Applicable	0.235205	1.2652
13	surface_Wet	-0.130009	0.8781
11	surface_Snow	-0.170967	0.8428
1	sex	-0.198966	0.8196
2	speed_(-inf, 10.0]	-1.017386	0.3615
8	surface_Mud, Dirt, Gravel	-7.091968	0.0008
3	speed_(10.0, 20.0]	-29.768925	0.0000
12	surface_Unknown	-22.617279	0.0000

Describe the impact of the different variables in simple language. What has the largest impact?

Now you pick the features#

Up above you have examples of:

Creating features from numbers (speed limits)
Creating features from 0/1 (sex)
Creating features from binning numbers that are one-hot encoded (speed limit bins - speed_bins)
Creating features from categories that are one-hot encoded (surface - surf_dummies

What else do you think matters? Try to plug in more features and see if you can get anything interesting.

Hot tip: The funniest/most interesting thing feature you can add is also the dumbest. Ask me about it in #algorithms if you end up getting down here.

Understanding feature selection and feature engineering when performing regressions#

Reading in our data#

Every single person in a crash#

Working on Features#

Starting our analysis#

Run your regression#

Use words to interpret this result#

Adding more features#

Examining more possible features#

One-hot encoding#

Another regression#

Build your `train_df`#

Run your regression and check your odds ratios#

More features: Vehicles#

Weights of those cars#

Filter your data#

Add this vehicle information to your merged data#

Another regression, because we can't get enough#

Two-car accidents, struck and striker#

Two-car accidents#

Struck and striker#

Put people in their cars#

Add the crash details#

Filter#

Calculated features#

Feature: Accident was fatal#

Feature: Weight difference#

Feature adjustment#

Another regression!!!#

Adding in more features#

Feature engineering: Speed limits#

Running a regression#

Now you pick the features#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects

Understanding feature selection and feature engineering when performing regressions#

Reading in our data#

Every single person in a crash#

Working on Features#

Starting our analysis#

Run your regression#

Use words to interpret this result#

Adding more features#

Examining more possible features#

One-hot encoding#

Another regression#

Build your train_df#

Run your regression and check your odds ratios#

More features: Vehicles#

Weights of those cars#

Filter your data#

Add this vehicle information to your merged data#

Another regression, because we can't get enough#

Two-car accidents, struck and striker#

Two-car accidents#

Struck and striker#

Put people in their cars#

Add the crash details#

Filter#

Calculated features#

Feature: Accident was fatal#

Feature: Weight difference#

Feature adjustment#

Another regression!!!#

Adding in more features#

Feature engineering: Speed limits#

Running a regression#

Now you pick the features#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects

Build your `train_df`#