Tickets vs warnings with the Boston Globe#

In 2003 and 2004, the Boston Globe published a series called Speed Trap: Who Gets a Ticket, Who Gets a Break?. Using speeding data from Massachusetts, we'll reproduce some segments of this analysis using logistic regression.

Read online Download notebook Interactive version

This piece was brought to my attention by Jonathan Stray, who put together the notebook this section is based on. As to what's special about this project, let's just go right to the source:

Why such old data? Well, for one thing it's a classic example of multi-variable regression in journalism. But also, for a brief period of April and May 2001, the state government entered both tickets and warnings from the paper citations. In Massachusetts at that time, both were written on the same form, with only a checkbox indicating which it was. A ticket means a fine and a raises your insurance premium. A warning means nothing. Having data on both means we can ask who gets a ticket vs. a warning for the same circumstances.

Generally ticketing data just says who got a ticket, full stop. The fact that we have a yes/no column - ticketed/warned - makes this dataset ripe for logistic regression.

We can find The Boston Globe's full methodology here. It's a great document, full of tables and details and even a link at the bottom to download the original dataset (the link is broken now, but hey, it's over 15 years later).

Using this large dataset, the Globe attempted to control for different variables by finding drivers who fit into the same buckets, then comparing their outcomes:

The Globe relied primarily on comparisons of carefully controlled groups of drivers. For example, how many drivers who lived in a town received a ticket, and how many received a warning, when speeding at the same speed, in the same speed zone?

Not fully satisfied with that approach, they also reached out to a statistician for further analysis. She performed a logistic regression, same as what we've been doing:

To be confident that no other factor, or combination of factors, recorded on the citations accounted for the differences (such as age, sex, time of day), the Globe asked a professor of statistics at Babson College, Elaine I. Allen, to look at the database. She found that the differences of race, sex and age were statistically significant, controlling for all other factors. The report on her analysis, using a statistical technique called logistic regression, is also printed in full below.

Let's get to work!

Our data#

We'll pull in pandas, statsmodels for the regression, and numpy for computing odds ratios

import pandas as pd
import statsmodels.formula.api as smf
import numpy as np

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 100)

Then we'll pull in our data. Unknown data is coded as U, so we'll just pass that to na_values to be missing data.

# Unknown data is always coded as 'U', so we'll pass that to na_values
df = pd.read_csv('data/tickets-warnings.csv', na_values='U')
df.head()

	TYPE	CITATION	DATE	DOW	AGENCY	AGENCY2	AGENCY3	LOCAL	OFFICER	LICSTATE	CLASS	CDL	RACE	MINORITY	SEX	FEMALE	SEARCH	SEARCH2	LOCATE2	LOCATION	TIME	AMPM	TIEMDAY	DAYNIGHT	DAY	DESCRIPT	AMOUNT	MPH	ZONE	MPHOVER	MPHPCT	MPHGROUP	YOB	AGE	AGEGROUP	AGEBAND	ZIP	NEIGHBOR	INNEIGH	REGSTATE	V_MAKE	V_TYPE	V_YEAR	V_AGE	V_AGEGRP	COLOR	HOMESTATE	HOMETOWN	INTOWN	INSTATE	INSTATE2
0	T	K0001506	20010411	Wednesday	State Police Troop A-4	State Police	S	N	8.247791e+15	MA	D	N	W	W	M	0.0	NaN	NaN	Woburn	Woburn	5	PM	b) afternoon	day	1	SPEEDING	125.0	80	65	15.0	23.0	b) 10 to 15	1980	21.0	21-25	16-25	1876.0	NaN	NaN	MA	NaN	NaN	0	NaN	NaN	NaN	MA	Tewksbury	N	Y	1
1	T	K0001507	20010417	Tuesday	State Police Troop A-4	State Police	S	N	8.247791e+15	MA	D	N	W	W	F	1.0	NaN	NaN	Somerville	Somerville	10	AM	a) morning	day	1	TRAFFIC VIOLATION	50.0	0	0	NaN	NaN	NaN	1965	36.0	36-40	26-39	2135.0	Allston-Brighton	N	MA	DODG	ARIES	1988	13.0	older	WHITE	MA	Boston	N	Y	1
2	T	K0001509	20010420	Friday	State Police Troop A-4	State Police	S	N	8.247791e+15	NH	NaN	N	W	W	F	1.0	NaN	NaN	Somerville	Somerville	4	PM	b) afternoon	day	1	FAILURE TO STOP	NaN	0	0	NaN	NaN	NaN	1940	61.0	61-65	40+	3062.0	NaN	NaN	NH	NaN	NaN	0	NaN	NaN	NaN	NH	NaN	N	N	0
3	T	K0001510	20010420	Friday	State Police Troop A-4	State Police	S	N	8.247791e+15	NH	NaN	N	W	W	M	0.0	NaN	NaN	Woburn	Woburn	8	PM	c) evening	night	0	KEEP IN RIGHT LANE	100.0	0	0	NaN	NaN	NaN	1949	52.0	51-55	40+	3045.0	NaN	NaN	NH	NaN	NaN	0	NaN	NaN	NaN	NH	NaN	N	N	0
4	T	K0001511	20010427	Friday	State Police Troop A-4	State Police	S	N	8.247791e+15	MA	D	N	W	W	M	0.0	NaN	NaN	Woburn	Woburn	8	AM	a) morning	day	1	SPEEDING	NaN	85	65	20.0	31.0	c) 16 to 20	1977	24.0	21-25	16-25	1835.0	NaN	NaN	MA	MAZD	B3000	2000	1.0	new	BLACK	MA	Haverhill	N	Y	1

TYPE is the column as to whether they received a ticket or a warning - T vs W - and most of the other columns are self-explanatory.

It's a lot lot lot of data, but we're only interested in a few columns. Let's take a look at the output of the Globe's regression to see what we should keep.

Along with the columns needed for our regression, we're also going to keep MPH for a little trick later on.

df = df[['TYPE','AGENCY3','SEX','BLACK','ASIAN','HISPANIC','MINORITY','AGE','MPH','MPHOVER','INTOWN', 'INSTATE', 'DAYNIGHT']].copy()
df.head()

	TYPE	AGENCY3	SEX	MINORITY	AGE	MPH	MPHOVER	INTOWN	INSTATE	DAYNIGHT
0	T	S	M	W	21.0	80	15.0	N	Y	day
1	T	S	F	W	36.0	0	NaN	N	Y	day
2	T	S	F	W	61.0	0	NaN	N	N	day
3	T	S	M	W	52.0	0	NaN	N	N	night
4	T	S	M	W	24.0	85	20.0	N	Y	day

Definitely looks cleaner!

Non-logistic analysis#

When you first start investigating the dataset, you probably (hopefully?) won't leap right to logistic regression. It's useful to poke around a little bit, see what's going on, check whether anything seems wrong or if you can find any interesting leads.

Let's say we have a suspicion that race plays a role in whether someone is ticketed as opposed to just getting a warning. We might start by grouping by MINORITY and seeing how many people fall into each ticketing category.

pd.crosstab(df.MINORITY, df.TYPE)

TYPE	T	W
MINORITY
M	16072	12636
W	65624	68291

Unfortunately we're terrible at reading numbers and we just love graphics, so we're going to throw a .plot on that code to turn it into a chart.

pd.crosstab(df.MINORITY, df.TYPE).plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x1232ce898>

Looks like minorities have fewer tickets and fewer warnings! But that's not too interesting, since there are probably fewer minorities overall. We're probably more interested in the percentage of stops that become a ticket compared to those that are just warnings.

pd.crosstab(df.MINORITY, df.TYPE, normalize='index')

TYPE	T	W
MINORITY
M	0.559844	0.440156
W	0.490042	0.509958

pd.crosstab(df.MINORITY, df.TYPE, normalize='index').plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x1258b57b8>

Over half of minorities stopped receive tickets instead of warnings, while it's the opposite for white drivers. A lot of investigations might stop right here, publish a "Minorities more likely to be ticketed when stopped" headline, and call it a day. We can do better!

If we wanted to dig around in the data some more, we could see if it reveals driving differences as logged by the police. Maybe the minority population is disproportionately young, inexperienced drivers, or maybe on average they drive more miles per hour over the speed limit. What we want to know is, controlling for all other factors, are minorities more likely to be ticketed when stopped?

When we think "controlling for" a little bell goes off in our heads and we immediately get started on a regression.

The regression#

Preparing our data#

The TYPE column - whether the driver received a ticket or not - comes in two options.

T for ticketed
W for warning

These labels are going to be what we're predicting.

We've dealt with categorical variables before, using C() to break each category out into its own feature. You can't use C() to deal with categorical output labels. When using statsmodels, you'll need to convert this into a number.

To get around this, we'll create a new column called ticketed, which will be 1 if the driver received a ticket and 0 if they received a warning.

df['ticketed'] = df.TYPE.replace({'T': 1, 'W': 0})
df.head()

	TYPE	AGENCY3	SEX	MINORITY	AGE	MPH	MPHOVER	INTOWN	INSTATE	DAYNIGHT	ticketed
0	T	S	M	W	21.0	80	15.0	N	Y	day	1
1	T	S	F	W	36.0	0	NaN	N	Y	day	1
2	T	S	F	W	61.0	0	NaN	N	N	day	1
3	T	S	M	W	52.0	0	NaN	N	N	night	1
4	T	S	M	W	24.0	85	20.0	N	Y	day	1

First regression#

Now that our data is prepared, we can perform our regression. Let's start by wrapping each of our categorical variables in C() to see how statsmodels does without any guidance.

model = smf.logit("""
    ticketed ~ 
        C(AGENCY3)
        + C(SEX)
        + C(MINORITY)
        + AGE
        + MPHOVER
        + C(INTOWN)
        + C(INSTATE)
        + C(DAYNIGHT)
        """, data=df)
results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.560816
         Iterations 6

Logit Regression Results
Dep. Variable:	ticketed	No. Observations:	89691
Model:	Logit	Df Residuals:	89680
Method:	MLE	Df Model:	10
Date:	Mon, 30 Dec 2019	Pseudo R-squ.:	0.1896
Time:	17:10:50	Log-Likelihood:	-50300.
converged:	True	LL-Null:	-62070.
Covariance Type:	nonrobust	LLR p-value:	0.000

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-1.3659	0.051	-26.825	0.000	-1.466	-1.266
C(AGENCY3)[T.L]	-0.6342	0.030	-20.814	0.000	-0.694	-0.574
C(AGENCY3)[T.O]	-0.3540	0.177	-2.001	0.045	-0.701	-0.007
C(AGENCY3)[T.S]	1.0537	0.034	31.442	0.000	0.988	1.119
C(SEX)[T.M]	0.2594	0.016	16.456	0.000	0.229	0.290
C(MINORITY)[T.W]	-0.2887	0.023	-12.341	0.000	-0.335	-0.243
C(INTOWN)[T.Y]	-0.3132	0.019	-16.872	0.000	-0.350	-0.277
C(INSTATE)[T.Y]	-0.3561	0.023	-15.161	0.000	-0.402	-0.310
C(DAYNIGHT)[T.night]	-0.1295	0.016	-7.874	0.000	-0.162	-0.097
AGE	-0.0146	0.001	-25.208	0.000	-0.016	-0.013
MPHOVER	0.1800	0.002	91.851	0.000	0.176	0.184

Seems like a solid regression - check out that p-value of 0.00! That's some excellent statistical significance.

The p-values of all of our features looks pretty good, too. Since we're confident in our overall regression we can move on to check out the odds ratios.

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'name': results.params.index
})
coefs

	coef	odds ratio	name
0	-1.365858	0.255162	Intercept
1	-0.634193	0.530363	C(AGENCY3)[T.L]
2	-0.354048	0.701841	C(AGENCY3)[T.O]
3	1.053720	2.868303	C(AGENCY3)[T.S]
4	0.259407	1.296161	C(SEX)[T.M]
5	-0.288694	0.749242	C(MINORITY)[T.W]
6	-0.313248	0.731068	C(INTOWN)[T.Y]
7	-0.356147	0.700370	C(INSTATE)[T.Y]
8	-0.129495	0.878539	C(DAYNIGHT)[T.night]
9	-0.014586	0.985520	AGE
10	0.180014	1.197234	MPHOVER

While these are solid results, the problem is they don't match the Boston Globe's regression. Take a look at their output:

Most notably, they've partitioned both AGE and MPHOVER into brackets. This allows them to compare each age group against the others, instead of our absurd sweeping statement of "for every year older you are, your odds of getting a ticket goes down by 0.015."

Feature engineering#

Creating new columns that you'll use for regression is feature engineering, and it's a very useful way to make more accurate and more interesting comparisons on your dataset.

Remember when we used np.divide to convert from 1's of miles to 1000's of miles? That was feature engineering, too! This one's just a little more advanced.

Let's bin both AGE and MPHOVER in the same way that the Globe did.

labels = [
    'under 25',
    '26-39',
    '40-55',
    '56-70',
    'over 70'
]
breaks = [0, 25, 39, 55, 70, 999]
df['age_bin'] = pd.cut(df.AGE, bins=breaks, labels=labels)

# Confirm it worked by looking at a random 10
df[['AGE', 'age_bin']].sample(10)

	AGE	age_bin
27222	30.0	26-39
73995	36.0	26-39
17683	21.0	under 25
54067	17.0	under 25
100250	24.0	under 25
119787	41.0	40-55
133885	44.0	40-55
47583	62.0	56-70
96039	30.0	26-39
164049	38.0	26-39

labels = [
    '<10mph over',
    '10-15mph over',
    '16-20mph over',
    'over 20mph over'
]
breaks = [0, 9, 15, 19, 999]
df['mphover_bin'] = pd.cut(df.MPHOVER, bins=breaks, labels=labels)

# Confirm it worked by looking at a random ten
df[['MPHOVER', 'mphover_bin']].sample(10)

	MPHOVER	mphover_bin
119105	13.0	10-15mph over
118859	13.0	10-15mph over
114502	10.0	10-15mph over
153466	13.0	10-15mph over
105049	17.0	16-20mph over
79856	NaN	NaN
16362	10.0	10-15mph over
61511	10.0	10-15mph over
14570	18.0	16-20mph over
159299	10.0	10-15mph over

Looks good! We've now created the missing features. Next up is tuning our reference categories to match what The Globe used.

Setting reference categories#

When you're using C() with categorical variables, you always need to set your reference category that the odds ratio is calculated against. For example, our regression output listed these two features:

C(SEX)[T.M], sex when M (so reference category is female)
C(MINORITY)[T.W], minority when W (reference category is minority)

The Globe's regression used male as the reference category for sex, and white as the reference category for race. While it's all the same in terms of math, you typically want to set your reference category to be the most common category, as it makes more sense to compare things to a baseline. If we had multiple races in our dataset - for example white, Asian, African American, Hispanic - we wouldn't compare everything to the Asian baseline, we'd most probably use white.

If we wanted to do things "right" or "normal," we'd want to check the most common category is in each feature (just ignore the numeric ones).

# If you remember mean/median/mode, mode is the one no
# one ever talks about - it's the most common value!
df.mode()

	TYPE	AGENCY3	SEX	BLACK	ASIAN	HISPANIC	MINORITY	AGE	MPH	MPHOVER	INTOWN	INSTATE	DAYNIGHT	ticketed	age_bin	mphover_bin
0	T	L	M	0.0	0.0	0.0	W	20.0	0	10.0	N	Y	day	1	26-39	10-15mph over

Note that the most common mphover_bin is 10-15, but when thinking about "what everything changes in reference to" you might want to use under 10, as it makes the most sense in terms of what you'd end up writing in your article. Arguments could be made both ways, though!

In this case we just want to match what the Globe did, so we'll just copy over their reference categories.

model = smf.logit("""
    ticketed ~ 
        C(AGENCY3, Treatment('S'))
        + C(SEX, Treatment('F'))
        + C(MINORITY, Treatment('W'))
        + C(mphover_bin, Treatment('<10mph over'))
        + C(age_bin, Treatment('under 25'))
        + C(INTOWN, Treatment('N'))
        + C(INTOWN, Treatment('N'))
        """, data=df)
results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.544180
         Iterations 6

Logit Regression Results
Dep. Variable:	ticketed	No. Observations:	89662
Model:	Logit	Df Residuals:	89648
Method:	MLE	Df Model:	13
Date:	Mon, 30 Dec 2019	Pseudo R-squ.:	0.2137
Time:	17:16:30	Log-Likelihood:	-48792.
converged:	True	LL-Null:	-62050.
Covariance Type:	nonrobust	LLR p-value:	0.000

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-1.6859	0.057	-29.807	0.000	-1.797	-1.575
C(AGENCY3, Treatment('S'))[T.B]	-1.2829	0.035	-36.849	0.000	-1.351	-1.215
C(AGENCY3, Treatment('S'))[T.L]	-1.8969	0.021	-90.185	0.000	-1.938	-1.856
C(AGENCY3, Treatment('S'))[T.O]	-1.6323	0.176	-9.278	0.000	-1.977	-1.287
C(SEX, Treatment('F'))[T.M]	0.2681	0.016	16.758	0.000	0.237	0.300
C(MINORITY, Treatment('W'))[T.M]	0.2891	0.024	12.118	0.000	0.242	0.336
C(mphover_bin, Treatment('<10mph over'))[T.10-15mph over]	2.9218	0.056	51.812	0.000	2.811	3.032
C(mphover_bin, Treatment('<10mph over'))[T.16-20mph over]	3.8197	0.059	65.292	0.000	3.705	3.934
C(mphover_bin, Treatment('<10mph over'))[T.over 20mph over]	4.7987	0.060	79.494	0.000	4.680	4.917
C(age_bin, Treatment('under 25'))[T.26-39]	-0.2371	0.020	-12.055	0.000	-0.276	-0.199
C(age_bin, Treatment('under 25'))[T.40-55]	-0.3841	0.021	-18.472	0.000	-0.425	-0.343
C(age_bin, Treatment('under 25'))[T.56-70]	-0.5022	0.033	-15.280	0.000	-0.567	-0.438
C(age_bin, Treatment('under 25'))[T.over 70]	-1.0228	0.070	-14.650	0.000	-1.160	-0.886
C(INTOWN, Treatment('N'))[T.Y]	-0.3679	0.018	-19.952	0.000	-0.404	-0.332

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'name': results.params.index
})
coefs

	coef	odds ratio	name
0	-1.685876	0.185282	Intercept
1	-1.282857	0.277244	C(AGENCY3, Treatment('S'))[T.B]
2	-1.896896	0.150034	C(AGENCY3, Treatment('S'))[T.L]
3	-1.632316	0.195476	C(AGENCY3, Treatment('S'))[T.O]
4	0.268142	1.307532	C(SEX, Treatment('F'))[T.M]
5	0.289091	1.335213	C(MINORITY, Treatment('W'))[T.M]
6	2.921751	18.573780	C(mphover_bin, Treatment('<10mph over'))[T.10-15mph over]
7	3.819712	45.591055	C(mphover_bin, Treatment('<10mph over'))[T.16-20mph over]
8	4.798733	121.356610	C(mphover_bin, Treatment('<10mph over'))[T.over 20mph over]
9	-0.237085	0.788924	C(age_bin, Treatment('under 25'))[T.26-39]
10	-0.384065	0.681087	C(age_bin, Treatment('under 25'))[T.40-55]
11	-0.502249	0.605168	C(age_bin, Treatment('under 25'))[T.56-70]
12	-1.022786	0.359592	C(age_bin, Treatment('under 25'))[T.over 70]
13	-0.367935	0.692162	C(INTOWN, Treatment('N'))[T.Y]

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'name': results.params.index
})
coefs

	coef	odds ratio	name
0	-1.685876	0.185282	Intercept
1	-1.282857	0.277244	C(AGENCY3, Treatment('S'))[T.B]
2	-1.896896	0.150034	C(AGENCY3, Treatment('S'))[T.L]
3	-1.632316	0.195476	C(AGENCY3, Treatment('S'))[T.O]
4	0.268142	1.307532	C(SEX, Treatment('F'))[T.M]
5	0.289091	1.335213	C(MINORITY, Treatment('W'))[T.M]
6	2.921751	18.573780	C(mphover_bin, Treatment('<10mph over'))[T.10-15mph over]
7	3.819712	45.591055	C(mphover_bin, Treatment('<10mph over'))[T.16-20mph over]
8	4.798733	121.356610	C(mphover_bin, Treatment('<10mph over'))[T.over 20mph over]
9	-0.237085	0.788924	C(age_bin, Treatment('under 25'))[T.26-39]
10	-0.384065	0.681087	C(age_bin, Treatment('under 25'))[T.40-55]
11	-0.502249	0.605168	C(age_bin, Treatment('under 25'))[T.56-70]
12	-1.022786	0.359592	C(age_bin, Treatment('under 25'))[T.over 70]
13	-0.367935	0.692162	C(INTOWN, Treatment('N'))[T.Y]

Turn them binary#

Take the statements you want to make and just type them in there

model = smf.logit("""
    got_ticket ~ 
        SEX == 'F'
        + MINORITY == 'M'
        + AGE
        + MPHOVER
        + INTOWN == 'N'
        """, data=df)
results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.614257
         Iterations 6

Logit Regression Results
Dep. Variable:	got_ticket	No. Observations:	92293
Model:	Logit	Df Residuals:	92287
Method:	MLE	Df Model:	5
Date:	Sat, 14 Dec 2019	Pseudo R-squ.:	0.1120
Time:	00:03:32	Log-Likelihood:	-56692.
converged:	True	LL-Null:	-63841.
Covariance Type:	nonrobust	LLR p-value:	0.000

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.0477	0.036	-57.157	0.000	-2.118	-1.977
SEX == 'F'[T.True]	-0.3313	0.015	-22.638	0.000	-0.360	-0.303
MINORITY == 'M'[T.True]	0.3587	0.021	16.791	0.000	0.317	0.401
INTOWN == 'N'[T.True]	0.6169	0.017	37.202	0.000	0.584	0.649
AGE	-0.0149	0.001	-27.822	0.000	-0.016	-0.014
MPHOVER	0.1554	0.002	90.607	0.000	0.152	0.159

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'name': results.params.index
})
coefs

	coef	odds ratio	name
0	-2.047714	0.129030	Intercept
1	-0.331280	0.718004	SEX == 'F'[T.True]
2	0.358665	1.431417	MINORITY == 'M'[T.True]
3	0.616885	1.853146	INTOWN == 'N'[T.True]
4	-0.014880	0.985230	AGE
5	0.155384	1.168107	MPHOVER

Engineer features#

You can even get fancier. What about people going VERY fast? They probably get ticketed?

model = smf.logit("""
    got_ticket ~ 
        SEX == 'F'
        + MINORITY == 'M'
        + AGE
        + MPHOVER
        + INTOWN == 'N'
        + MPH > 80
        """, data=df)
results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.598351
         Iterations 8

Logit Regression Results
Dep. Variable:	got_ticket	No. Observations:	92293
Model:	Logit	Df Residuals:	92286
Method:	MLE	Df Model:	6
Date:	Sat, 14 Dec 2019	Pseudo R-squ.:	0.1350
Time:	00:04:20	Log-Likelihood:	-55224.
converged:	True	LL-Null:	-63841.
Covariance Type:	nonrobust	LLR p-value:	0.000

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-1.7422	0.036	-48.093	0.000	-1.813	-1.671
SEX == 'F'[T.True]	-0.3202	0.015	-21.654	0.000	-0.349	-0.291
MINORITY == 'M'[T.True]	0.3506	0.022	16.291	0.000	0.308	0.393
INTOWN == 'N'[T.True]	0.5355	0.017	32.241	0.000	0.503	0.568
MPH > 80[T.True]	3.0770	0.093	33.002	0.000	2.894	3.260
AGE	-0.0142	0.001	-26.282	0.000	-0.015	-0.013
MPHOVER	0.1312	0.002	74.617	0.000	0.128	0.135

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'name': results.params.index
})
coefs

	coef	odds ratio	name
0	-1.742182	0.175138	Intercept
1	-0.320213	0.725995	SEX == 'F'[T.True]
2	0.350607	1.419929	MINORITY == 'M'[T.True]
3	0.535527	1.708348	INTOWN == 'N'[T.True]
4	3.077031	21.693898	MPH > 80[T.True]
5	-0.014202	0.985899	AGE
6	0.131154	1.140143	MPHOVER

But let's just edit this one sec...#

model = smf.logit("""
    got_ticket ~ 
        C(AGENCY3, Treatment('L'))
        + C(SEX, Treatment('M'))
        + C(MINORITY, Treatment('W'))
        + AGE
        + MPHOVER
        + C(INTOWN, Treatment('Y'))
        + MPH > 80
        """, data=df)
results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.559333
         Iterations 8

Logit Regression Results
Dep. Variable:	got_ticket	No. Observations:	89691
Model:	Logit	Df Residuals:	89681
Method:	MLE	Df Model:	9
Date:	Sat, 14 Dec 2019	Pseudo R-squ.:	0.1918
Time:	00:07:20	Log-Likelihood:	-50167.
converged:	True	LL-Null:	-62070.
Covariance Type:	nonrobust	LLR p-value:	0.000

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.5833	0.041	-63.180	0.000	-2.663	-2.503
C(AGENCY3, Treatment('L'))[T.B]	0.6214	0.030	20.535	0.000	0.562	0.681
C(AGENCY3, Treatment('L'))[T.O]	0.2703	0.175	1.544	0.123	-0.073	0.614
C(AGENCY3, Treatment('L'))[T.S]	1.5857	0.020	78.111	0.000	1.546	1.626
C(SEX, Treatment('M'))[T.F]	-0.2533	0.016	-16.177	0.000	-0.284	-0.223
C(MINORITY, Treatment('W'))[T.M]	0.2732	0.023	11.712	0.000	0.227	0.319
C(INTOWN, Treatment('Y'))[T.N]	0.3572	0.018	19.730	0.000	0.322	0.393
MPH > 80[T.True]	1.7589	0.095	18.430	0.000	1.572	1.946
AGE	-0.0138	0.001	-24.133	0.000	-0.015	-0.013
MPHOVER	0.1678	0.002	83.702	0.000	0.164	0.172

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'name': results.params.index
})
coefs

	coef	odds ratio	name
0	-2.583331	0.075522	Intercept
1	0.621379	1.861494	C(AGENCY3, Treatment('L'))[T.B]
2	0.270345	1.310416	C(AGENCY3, Treatment('L'))[T.O]
3	1.585722	4.882816	C(AGENCY3, Treatment('L'))[T.S]
4	-0.253340	0.776204	C(SEX, Treatment('M'))[T.F]
5	0.273155	1.314104	C(MINORITY, Treatment('W'))[T.M]
6	0.357249	1.429392	C(INTOWN, Treatment('Y'))[T.N]
7	1.758940	5.806282	MPH > 80[T.True]
8	-0.013812	0.986282	AGE
9	0.167754	1.182646	MPHOVER

Tickets vs warnings with the Boston Globe#

Our data#

Non-logistic analysis#

The regression#

Preparing our data#

First regression#

Feature engineering#

Setting reference categories#

Turn them binary#

Engineer features#

But let's just edit this one sec...#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects