Finding outliers with standard deviation and regression#
To reproduce this finding from the Dallas Morning News, we'll need to use standard deviation and regression to identify schools that performed suspiciously well in certain standardized tests.
Finding suspicious behavior by tracking down outliers#
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import numpy as np
from statsmodels.sandbox.regression.predstd import wls_prediction_std
pd.set_option("display.max_rows", 200)
pd.set_option("display.max_columns", 200)
Reading in our data#
We'll start by opening up our dataset - standardized test performance at each school, for fourth graders in 2004.
df = pd.read_csv("data/cfy04e4.dat", usecols=['r_all_rs', 'CNAME', 'CAMPUS'])
df = df.set_index('CAMPUS').add_suffix('_fourth')
df.head()
That dataset had a lot of columns, but we're only interested in their reading scores.
How can we find out who did suspiciously well? We should probably start with figuring out which schools performed normally. Statistically speaking, "normal performance" is probably the median (one of the three types of averages).
df.r_all_rs_fourth.median()
The median is the 50% mark of the score: half of the schools did better than 2226, and half of the schools did worse.
As a school's score gets further and further from the median, the school is going further and further from being an "average school." If a schools get a lot lot lot of points, it should probably be looked at. But how many extra points is enough to make the school suspicious?
df.r_all_rs_fourth.describe()
If 2226 points is the average, 2273 puts a school at the 75% mark. 75% still doesn't seem very suspicious, though, as that's still one out of every four schools. We want something higher! 95%? 99.7%?
df.r_all_rs_fourth.hist(bins=20)
We want those very few datapoints on the far right!
Instead of picking an arbitrary number or percentile, we're going to start using a statistical measure called the standard deviation. It's a measurement of how spread out the data is (std
in the list above). To explain how unusual a data point is, you can say "it's 1.5 standard deviations from the mean" or "2.75 standard deviations from the mean."
df['std_dev_rs_fourth'] = (df.r_all_rs_fourth - df.r_all_rs_fourth.mean()) / df.r_all_rs_fourth.std()
df.head()
Now every school has a new column that explains how many standard deviations away from the average score it is. We can easily look at the top and bottom performers!
df.sort_values(by='std_dev_rs_fourth').head(20)
df.sort_values(by='std_dev_rs_fourth', ascending=False).head(20)
But let's not get ahead of ourselves!#
The thing is, though: some schools are just going to be better. Getting a good score doesn't mean a school is cheating, it just means, well, that they got a good score!
We could use this list of standard deviations to investigate each and every school that did well, seeing if it makes sense that they did so well. That'd be good reporting! But it might also be a waste of time, as there wasn't anything "unusual" about these schools, they just... did well.
Now we need to ask ourselves when could a test score be suspicious? The Dallas Morning News realized they could look at scores across years at the same school - we were looking at fourth graders just now, but how did those students perform when they were in third grade? Did they just do average, and now suddenly they're geniuses? Suspicious!
Since we were working with 2004's fourth grader data, we'll now combine it with 2003's third-grader data.
third_graders = pd.read_csv("data/cfy03e3.dat", usecols=['CAMPUS', 'r_all_rs'])
third_graders = third_graders.set_index('CAMPUS').add_suffix('_third')
merged = df.join(third_graders)
merged.head()
Graphing for research#
One of the things we could do is plot the third grade scores as compared to the fourth grade scores, and see if anything stands out.
fig, ax = plt.subplots(figsize=(4,4))
ax.set_xlim(2000, 2500)
ax.set_ylim(1900, 2500)
ax.set_facecolor('lightgrey')
ax.grid(True, color='white')
ax.set_axisbelow(True)
sns.regplot('r_all_rs_third',
'r_all_rs_fourth',
data=merged,
marker='.',
line_kws={"color": "black", "linewidth": 1},
scatter_kws={"color": "grey"})
highlight = merged.loc[57905115]
plt.plot(highlight.r_all_rs_third, highlight.r_all_rs_fourth, 'ro')
This is one of the graphics from the Dallas Morning News Piece, and we've highlighted one of the suspicious schools, Harrell Budd Elementary.
merged.loc[57905115]
You'd expect a school to perform similarly between third and fourth grade, but somehow this one school jumped from about 2150 points to 2500 points in one year! Seems suspicious.
But why is that suspicious? It's because we don't expect a school to make a jump of 350 points in one year, going from average to a high performer.
But what do we expect a school to do? I'm not sure, but just like The Dallas Morning News we can surely ask statistics for help!
Using a regression to predict fourth-grade scores#
The Dallas Morning News decided to run a regression, which is a way of predicting how two different variables interact. In this case, we want to see the relationship between a third grade score and a fourth grade score.
First we'll need to get rid of missing data, because regressions hate hate hate missing data.
print("Before dropping missing data", merged.shape)
merged = merged.dropna()
print("After dropping missing data", merged.shape)
And now we can ask what the relationship is between third-grade scores and fourth-grade scores.
import statsmodels.formula.api as smf
model = smf.ols("r_all_rs_fourth ~ r_all_rs_third", data=merged)
results = model.fit()
results.summary()
What's this all mean? It doesn't matter! What matters is that if we know what third-grade score a school got, we can predict its fourth-grade score.
# What should these schools have gotten?
sample = pd.DataFrame({ 'r_all_rs_third': [2140, 2200, 2500] })
results.predict(sample)
merged['predicted_fourth'] = results.predict()
merged.head()
Using standard deviations with regression error#
Notice how there's a difference between the actual fourth-grade score and the predicted fourth-grade score. This is called the error or residual. The bigger the error, the bigger the difference between what was expected and what actually happened.
Remember how we were suspicious of that one school because it performed normally, but then performed really well? A school like that is going to have a really big error!
To calculate what a "big error" is, we're going to use our old friend standard deviation. Before we used standard deviation to see how far a schools' score was from the average score. This time we're going to use standard deviation to see how far the school's error is from the average error!
# Just trust me, this is how you do it
merged['error_std_dev'] = results.resid / np.sqrt(results.mse_resid)
merged.head()
The more standard deviations away from the mean a school's error is, the more suspicious its fourth-grade performance is.
merged.sort_values(by='error_std_dev', ascending=False).head(10)
Reproducing the story#
From The Dallas Morning News:
"In statistician's lingo, these schools had at least one average score that was more than three standard deviations away from what would be predicted based on their scores in other grades or on other tests
While we've been talking about schools with a major increase between the two years, we're also interested in schools with a major drop. That could indicate cheating in 2003 and a return to "real" testing in 2004.
Let's check out all of our suspicious schools according to the three standard deviations test they performed.
merged[merged.error_std_dev.abs() > 3]
But then they level things up a bit:
Using a stricter standard - four standard deviations from predictions - 41 schools have suspect scores
merged[merged.error_std_dev.abs() > 4]
Our dataset isn't as thorough as theirs - we're only looking at one combination of tests - but it's the same idea.
Finding other suspicious scores#
We might assume a school that does well in reading probably also does well in math.
What if they did well in one, but not the other? While the school might just have a strong department in one particular field, such discrepancies could be worth investigating.
Let's look at fifth graders' math and reading scores from 2004.
df = pd.read_csv("data/cfy04e5.dat", usecols=['CAMPUS', 'CNAME', 'm_all_rs', 'r_all_rs'])
df = df.set_index('CAMPUS').add_suffix('_fifth')
df.head()
Building the graphic#
While it isn't necessary, reproducing the graphics is always fun.
fig, ax = plt.subplots(figsize=(4,4))
ax.set_xlim(1900, 2500)
ax.set_ylim(1800, 2750)
ax.set_facecolor('lightgrey')
ax.grid(True, color='white')
ax.set_axisbelow(True)
sns.regplot('r_all_rs_fifth',
'm_all_rs_fifth',
data=df,
marker='.',
line_kws={"color": "black", "linewidth": 1},
scatter_kws={"color": "grey"})
highlight = df.loc[101912236]
plt.plot(highlight.r_all_rs_fifth, highlight.m_all_rs_fifth, 'ro')
Running the regression#
We can't be exactly sure of the relationship between math and reading scores - it's a lot of schools! - so we'll run a regression to figure out how the two scores typically interact.
print("Before dropping missing data", df.shape)
df = df.dropna()
print("After dropping missing data", df.shape)
import statsmodels.formula.api as smf
model = smf.ols("m_all_rs_fifth ~ r_all_rs_fifth", data=df)
results = model.fit()
results.summary()
And now, just like last time, we calculate how many standard deviations away the actual score was from the predicted score. Large number of standard deviations away means a school is worth a look!
df['error_std_dev'] = results.resid / np.sqrt(results.mse_resid)
df[df.error_std_dev.abs() > 3].sort_values(by='error_std_dev', ascending=False)
Wow, look at that! Sanderson Elementary looks like they either have a really exceptional math program or something suspicious is going on.
Review#
First, we learned about using standard deviation as a measurement of how unusual a measurement in a data point might be. Data points that fall many standard deviations from the mean - either above or below - might be worth investigating as bad data or from other suspicious angles (cheating schools, in this case).
Then we learned how a linear regression can determine the relationship between two numbers. In this case, it was how third-grade scores relate to fourth-grade scores, and then how math and reading scores relate to one another. By using a regression, you can use one variable to predict what the other should be.
Finally, we used the residual or error from the regression to see how far off each prediction was. Just like we did with the original scores, we used standard deviation to find usually suspiciously large errors. Even though yes, our regression might not be perfect, times when it's very wrong probably call for an investigation.
Discussion points#
- Why would this analysis be based on standard deviations away from the predicted value instead of just the predicted value?
- Standard deviation is how far away from the "average" a school is. Let's say you scored 3 standard deviations away from the average, but it was only a 5-point difference. What kind of situation could lead to that? Is it as important as being 3 standard deviations away but with a 50-point difference?
- The Dallas Morning News specifically called out schools with scores "more than three standard deviations away from what would be predicted based on their scores in other grades or on other tests." Do you think they ignored schools that were 2.99 standard deviations away?
- Did we ignore those schools? If we did, how could we be more cautious in the future?
- What are the pros and cons of selecting a cutoff like three standard deviations away from the predicted value? Note that three standard deviations is a typical number in stats
- What's the difference between a school with predicted scores -3 standard deviations away as compared to +3 standard deviations away? Do we need to pay attention to both, or only one?
- What next steps should we take after we've calculated these findings?
- If a school did have a strong math department and a weak English department, they would definitely be predicted incorrectly. What happens to that school after being flagged by research like this?