Evaluating linear regressions#
Not all linear regressions are created equal! And sometimes the things inside of them just don't make sense.
In this section we'll discuss what you can do to make sure your regressions are halfway decent, how to pick between them, and whether or not to include particular features.
Our dataset#
Let's go back to our car crash regression: we're predicting how many crashes a set of very, very terrible drivers get into based on how much they drive.
I went ahead and added a couple other columns to make our regressions a little more interesting.
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
df = pd.DataFrame([
{'miles': 2000, 'car_age': 4, 'crashes': 2},
{'miles': 2000, 'car_age': 2, 'crashes': 0},
{'miles': 2000, 'car_age': 6, 'crashes': 3},
{'miles': 5000, 'car_age': 10, 'crashes': 3},
{'miles': 5000, 'car_age': 3, 'crashes': 6},
{'miles': 5000, 'car_age': 6, 'crashes': 5}
])
# What effect does the number of miles driven have on the number of crashes?
model = smf.ols(formula='crashes ~ np.divide(miles, 1000)', data=df)
results = model.fit()
results.summary()
Now let's ask some questions about it.
Regression quality#
The first question we can ask is about the regression itself. How good is it?
When you ask how "good" a regression is, you're probably asking how descriptive it is. This is explained in the upper right-hand corner, with the R-squared value.
- If our R squared was 1.0, 100% of the change in crashes can be explained by the number of miles
- If our R squared is 0.0, 0% of the change in crashes can be explained by the number of miles
In this case, our R squared is 0.591, which means the number of miles driven counts for about 60% of how many crashes someone gets into. The other 40% can be things like weather, luck, driving skill - other things we aren't measuring here, but which might factor in.
Comparing regressions#
Let's say we ran another regression, taking into account miles driven as well as car age. Does this new regression do a better job explaining the variation in number of crashes?
# What effect does the number of miles driven AND the car's age have on the number of crashes?
model = smf.ols(formula='crashes ~ np.divide(miles, 1000) + car_age', data=df)
results = model.fit()
results.summary()
If we look at R-squared, the new one has a higher value. This means the new regression explains more of the variation in crashes.
An extra 2 percentage points covered! Amazing! ...except for the fact that every time you add new parameters to a regression, the R-squared is going to go up. Every single time. Every single time.
This means a regression with a thousand features is going to explain more than one with one feature, even if the extra features are only explaining random noise. As a result, no, we can't compare our regressions with R-squared.
We have a few other options to compare regressions, though! And by "a few" I mean "potentially a million."
You could use all of these in one way or another, but the one to pay attention to here is adjusted R-squared. It's R-squared, but adjusted for the number of features we're giving the model. For R-squared, it always goes up if you add new features. For adjusted R-squared, though, it only goes up if the extra parameters are useful.
If adjusted R-squared goes up, we (probably) have a better model.
"What about the other things in that list? AIC? BIC?" Good question: you can use all the other ones when you talk to someone who knows statistics. We'll talk about Prob (F-statistic) a little bit later, down below.
Our adjusted R-squared for the original regression was 0.489, while our adjusted R-squared for the new regression was 0.367. Since our adjusted R-squared went down with the new regression, adding car_age
didn't actually add anything to our model!
But hey - shouldn't more data always be better data? To figure out what happened, let's look at the features themselves instead of the overall model.
Feature quality#
Since we've figured out that adding new features to our model doesn't necessarily improve it, let's take a look at how to measure features. How can we tell a good and useful feature from a bad one?
It's easy to get excited about a big coefficient, something that claims to have a nice large effect on our output. But caution is required: features will always have a coefficient, but that doesn't mean that the coefficient is valid. I can tell you all day every day that my uncle works for Nintendo, but unless he actually works there then it doesn't matter how often I say it.
Let's examine the feature descriptions from the statsmodels output, all those rows down at the bottom. We're interested in P>|t|
, which is the feature's p-value.
The p-value is the most common way to talk about whether a feature (and its coefficient) is meaningful or not. P-values are commonly described as the chance that a result was just a lucky/unlucky accident.
The standard measure of p-value validity is 0.05, which a layman might describe as "we'd get this result accidentally only 5% of the time." It's what people usually mean when they refer to statistical significance.
Note: P-values have plenty of flaws and the "accidental results" description isn't really the most rigorous, but we're going to live with it.
If we look at the p-value for our miles, it's 0.074, which does NOT MEET THE 0.05 THRESHOLD FOR STATISTICAL SIGNIFICANCE!!! It's saying there's a 7.4% chance of this just being an accident, as opposed to under a 5% chance. Horrifying, terrifying, really.
...horrifying except for the fact that 0.05 is a completely arbitrary (yet well-accepted) measure, and sometimes people even use 0.1 (10%) or even 0.01 (1%) instead! So depending on how we're feeling when we wake up in the morning, we can totally feel free to use this. See the 'discussion topics' section for a bit more on this topic.
Our data is very small and very fake, so let's not stress out too much here.
Change in p-values#
Let's step back to our two different regression, one that was just miles and one that was miles and the age of the car. When we compare the outputs, the p value for miles changed dramatically between the old and new regressions.
The p-value for miles
jumped from 0.074 to 0.121 when we moved to the new regression, and the car's age came in at a mind-boggling-high 0.664 (67%)! Remember: higher p-values are worse, they're a greater chance that the relationship to the output (the number of crashes) is just an accident.
Even if we're feeling gracious about miles, there's no way we can believe the car's age has anything to do with the number of crashes with its 0.664 p-value. Exposed! Car age's uncle does not work for Nintendo.
It's important to note that adding car age wasn't just useless, either: by including the car's age in our regression we negatively impacted the p-value of miles driven, almost doubling it! It might be tempting to add in a million and one features to make your regression "more informed," but we can see from this example that useless features just confuse the regression. Because our regression also had to take into account the car's age, it couldn't listen to miles
as much as it should have.
As a result of adding car_age
, the adjusted R-squared dropped and p-values went crazy. Bad regression!
Picking what features to use - "feature selection" - is a big and complicated part of machine learning. If you generally stick to only adding features that make sense and removing ones that have high p-values, you'll probably be in a good place when you present your findings to a stats person to review.
Regression p values#
It isn't just our features that have p-values - the overall regression gets a p-value, too! This is where people who publish papers will get up on a cardboard box and exclaim "MY RESULTS ARE STATISTICALLY SIGNIFICANT!"
You can find the p-value for the entire regression in the top left, listed as Prob (F-statistic). Here is the p-value for our first regression:
In our original regression, the p-value for our overall regression is actually the same as it was for our miles
feature, about 0.074. We can compare this p-value to the regression that included the car age:
Another point that shows how bad that second regression was! The p-value jumped from 0.074 to 0.234 - even if that regression had the most exciting coefficients in the world, a 0.234 p-value is way way way too high. In no universe could it be considered statistically significant, sorry!
Note that you won't use p-values to choose between models. You'll just use a p-value to determine whether the linear regression model meets your standard of statistical significance. If it doesn't, throw it out!
Review#
In this section, we talked about evaluating both models and features.
For a linear regression model, the R-squared can be used to see how much of the output is described by the regression. Every time you add features, though, the R-squared will go up! To compare models with different numbers of features you'll need to use adjusted r-squared. Adjusted R-squared is smart enough to take into account how many features there are.
When evaluating individual features, you're typically interested in a p-value that's less than 0.05. Removing bad features can improve your model, as your regression can start paying attention to the things that matter. More data isn't necessarily better!
Models also have p values that determine whether the result can be considered "statistically significant." You won't use them to decide between models, but they're important in evaluating whether you can trust a model.
Discussion topics#
TODO