Tickets vs warnings with the Boston Globe#
In 2003 and 2004, the Boston Globe published a series called Speed Trap: Who Gets a Ticket, Who Gets a Break?. Using speeding data from Massachusetts, we'll reproduce some segments of this analysis using logistic regression.
This piece was brought to my attention by Jonathan Stray, who put together the notebook this section is based on. As to what's special about this project, let's just go right to the source:
Why such old data? Well, for one thing it's a classic example of multi-variable regression in journalism. But also, for a brief period of April and May 2001, the state government entered both tickets and warnings from the paper citations. In Massachusetts at that time, both were written on the same form, with only a checkbox indicating which it was. A ticket means a fine and a raises your insurance premium. A warning means nothing. Having data on both means we can ask who gets a ticket vs. a warning for the same circumstances.
Generally ticketing data just says who got a ticket, full stop. The fact that we have a yes/no column - ticketed/warned - makes this dataset ripe for logistic regression.
We can find The Boston Globe's full methodology here. It's a great document, full of tables and details and even a link at the bottom to download the original dataset (the link is broken now, but hey, it's over 15 years later).
Using this large dataset, the Globe attempted to control for different variables by finding drivers who fit into the same buckets, then comparing their outcomes:
The Globe relied primarily on comparisons of carefully controlled groups of drivers. For example, how many drivers who lived in a town received a ticket, and how many received a warning, when speeding at the same speed, in the same speed zone?
Not fully satisfied with that approach, they also reached out to a statistician for further analysis. She performed a logistic regression, same as what we've been doing:
To be confident that no other factor, or combination of factors, recorded on the citations accounted for the differences (such as age, sex, time of day), the Globe asked a professor of statistics at Babson College, Elaine I. Allen, to look at the database. She found that the differences of race, sex and age were statistically significant, controlling for all other factors. The report on her analysis, using a statistical technique called logistic regression, is also printed in full below.
Let's get to work!
Our data#
We'll pull in pandas, statsmodels for the regression, and numpy for computing odds ratios
import pandas as pd
import statsmodels.formula.api as smf
import numpy as np
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 100)
Then we'll pull in our data. Unknown data is coded as U
, so we'll just pass that to na_values
to be missing data.
# Unknown data is always coded as 'U', so we'll pass that to na_values
df = pd.read_csv('data/tickets-warnings.csv', na_values='U')
df.head()
TYPE
is the column as to whether they received a ticket or a warning - T
vs W
- and most of the other columns are self-explanatory.
It's a lot lot lot of data, but we're only interested in a few columns. Let's take a look at the output of the Globe's regression to see what we should keep.
Along with the columns needed for our regression, we're also going to keep MPH
for a little trick later on.
df = df[['TYPE','AGENCY3','SEX','BLACK','ASIAN','HISPANIC','MINORITY','AGE','MPH','MPHOVER','INTOWN', 'INSTATE', 'DAYNIGHT']].copy()
df.head()
Definitely looks cleaner!
Non-logistic analysis#
When you first start investigating the dataset, you probably (hopefully?) won't leap right to logistic regression. It's useful to poke around a little bit, see what's going on, check whether anything seems wrong or if you can find any interesting leads.
Let's say we have a suspicion that race plays a role in whether someone is ticketed as opposed to just getting a warning. We might start by grouping by MINORITY
and seeing how many people fall into each ticketing category.
pd.crosstab(df.MINORITY, df.TYPE)
Unfortunately we're terrible at reading numbers and we just love graphics, so we're going to throw a .plot
on that code to turn it into a chart.
pd.crosstab(df.MINORITY, df.TYPE).plot(kind='bar')
Looks like minorities have fewer tickets and fewer warnings! But that's not too interesting, since there are probably fewer minorities overall. We're probably more interested in the percentage of stops that become a ticket compared to those that are just warnings.
pd.crosstab(df.MINORITY, df.TYPE, normalize='index')
pd.crosstab(df.MINORITY, df.TYPE, normalize='index').plot(kind='bar')
Over half of minorities stopped receive tickets instead of warnings, while it's the opposite for white drivers. A lot of investigations might stop right here, publish a "Minorities more likely to be ticketed when stopped" headline, and call it a day. We can do better!
If we wanted to dig around in the data some more, we could see if it reveals driving differences as logged by the police. Maybe the minority population is disproportionately young, inexperienced drivers, or maybe on average they drive more miles per hour over the speed limit. What we want to know is, controlling for all other factors, are minorities more likely to be ticketed when stopped?
When we think "controlling for" a little bell goes off in our heads and we immediately get started on a regression.
The regression#
Preparing our data#
The TYPE
column - whether the driver received a ticket or not - comes in two options.
T
for ticketedW
for warning
These labels are going to be what we're predicting.
We've dealt with categorical variables before, using C()
to break each category out into its own feature. You can't use C()
to deal with categorical output labels. When using statsmodels, you'll need to convert this into a number.
To get around this, we'll create a new column called ticketed
, which will be 1 if the driver received a ticket and 0 if they received a warning.
df['ticketed'] = df.TYPE.replace({'T': 1, 'W': 0})
df.head()
First regression#
Now that our data is prepared, we can perform our regression. Let's start by wrapping each of our categorical variables in C()
to see how statsmodels does without any guidance.
model = smf.logit("""
ticketed ~
C(AGENCY3)
+ C(SEX)
+ C(MINORITY)
+ AGE
+ MPHOVER
+ C(INTOWN)
+ C(INSTATE)
+ C(DAYNIGHT)
""", data=df)
results = model.fit()
results.summary()
Seems like a solid regression - check out that p-value of 0.00! That's some excellent statistical significance.
The p-values of all of our features looks pretty good, too. Since we're confident in our overall regression we can move on to check out the odds ratios.
coefs = pd.DataFrame({
'coef': results.params.values,
'odds ratio': np.exp(results.params.values),
'name': results.params.index
})
coefs
While these are solid results, the problem is they don't match the Boston Globe's regression. Take a look at their output:
Most notably, they've partitioned both AGE
and MPHOVER
into brackets. This allows them to compare each age group against the others, instead of our absurd sweeping statement of "for every year older you are, your odds of getting a ticket goes down by 0.015."
Feature engineering#
Creating new columns that you'll use for regression is feature engineering, and it's a very useful way to make more accurate and more interesting comparisons on your dataset.
Remember when we used np.divide to convert from 1's of miles to 1000's of miles? That was feature engineering, too! This one's just a little more advanced.
Let's bin both AGE
and MPHOVER
in the same way that the Globe did.
labels = [
'under 25',
'26-39',
'40-55',
'56-70',
'over 70'
]
breaks = [0, 25, 39, 55, 70, 999]
df['age_bin'] = pd.cut(df.AGE, bins=breaks, labels=labels)
# Confirm it worked by looking at a random 10
df[['AGE', 'age_bin']].sample(10)
labels = [
'<10mph over',
'10-15mph over',
'16-20mph over',
'over 20mph over'
]
breaks = [0, 9, 15, 19, 999]
df['mphover_bin'] = pd.cut(df.MPHOVER, bins=breaks, labels=labels)
# Confirm it worked by looking at a random ten
df[['MPHOVER', 'mphover_bin']].sample(10)
Looks good! We've now created the missing features. Next up is tuning our reference categories to match what The Globe used.
Setting reference categories#
When you're using C()
with categorical variables, you always need to set your reference category that the odds ratio is calculated against. For example, our regression output listed these two features:
C(SEX)[T.M]
, sex whenM
(so reference category is female)C(MINORITY)[T.W]
, minority whenW
(reference category is minority)
The Globe's regression used male as the reference category for sex, and white as the reference category for race. While it's all the same in terms of math, you typically want to set your reference category to be the most common category, as it makes more sense to compare things to a baseline. If we had multiple races in our dataset - for example white, Asian, African American, Hispanic - we wouldn't compare everything to the Asian baseline, we'd most probably use white.
If we wanted to do things "right" or "normal," we'd want to check the most common category is in each feature (just ignore the numeric ones).
# If you remember mean/median/mode, mode is the one no
# one ever talks about - it's the most common value!
df.mode()
Note that the most common mphover_bin
is 10-15, but when thinking about "what everything changes in reference to" you might want to use under 10, as it makes the most sense in terms of what you'd end up writing in your article. Arguments could be made both ways, though!
In this case we just want to match what the Globe did, so we'll just copy over their reference categories.
model = smf.logit("""
ticketed ~
C(AGENCY3, Treatment('S'))
+ C(SEX, Treatment('F'))
+ C(MINORITY, Treatment('W'))
+ C(mphover_bin, Treatment('<10mph over'))
+ C(age_bin, Treatment('under 25'))
+ C(INTOWN, Treatment('N'))
+ C(INTOWN, Treatment('N'))
""", data=df)
results = model.fit()
results.summary()
coefs = pd.DataFrame({
'coef': results.params.values,
'odds ratio': np.exp(results.params.values),
'name': results.params.index
})
coefs
coefs = pd.DataFrame({
'coef': results.params.values,
'odds ratio': np.exp(results.params.values),
'name': results.params.index
})
coefs
Turn them binary#
Take the statements you want to make and just type them in there
model = smf.logit("""
got_ticket ~
SEX == 'F'
+ MINORITY == 'M'
+ AGE
+ MPHOVER
+ INTOWN == 'N'
""", data=df)
results = model.fit()
results.summary()
coefs = pd.DataFrame({
'coef': results.params.values,
'odds ratio': np.exp(results.params.values),
'name': results.params.index
})
coefs
Engineer features#
You can even get fancier. What about people going VERY fast? They probably get ticketed?
model = smf.logit("""
got_ticket ~
SEX == 'F'
+ MINORITY == 'M'
+ AGE
+ MPHOVER
+ INTOWN == 'N'
+ MPH > 80
""", data=df)
results = model.fit()
results.summary()
coefs = pd.DataFrame({
'coef': results.params.values,
'odds ratio': np.exp(results.params.values),
'name': results.params.index
})
coefs
But let's just edit this one sec...#
model = smf.logit("""
got_ticket ~
C(AGENCY3, Treatment('L'))
+ C(SEX, Treatment('M'))
+ C(MINORITY, Treatment('W'))
+ AGE
+ MPHOVER
+ C(INTOWN, Treatment('Y'))
+ MPH > 80
""", data=df)
results = model.fit()
results.summary()
coefs = pd.DataFrame({
'coef': results.params.values,
'odds ratio': np.exp(results.params.values),
'name': results.params.index
})
coefs