The Tampa Bay Times and school performance#
Story: The story, and a critique
Author: Various parts are various people! Nathaniel Lash did the good investigation, but we're reproducing a less-than-stellar approach here.
Topics: Linear Regression, Residuals
Datasets
- 0066897-gr04_rsch_2014.xls: 4th grader pass rates for standardized tests, from Florida Dept of Education
- FRL-1314-School-Web-Survey-3-Final.xls: Free and reduced price lunch data, from Florida Dept of Education
- MembershipSchoolRaceGender1415.xls: School population by gender, from Florida Dept of Education
What's the story?#
We're trying to see what kind of effect things like race and poverty might have on school test score data. In this section, our analysis is a naive approach that yields inaccurate results. While the math is correct the data behind it contains a fatal flaw.
Imports#
We'll want pandas and seaborn. You'll want want to set pandas to display a lot of columns and rows at a time.
import pandas as pd
import seaborn as sns
pd.set_option("display.max_columns", 200)
pd.set_option("display.max_rows", 200)
Reading in our data#
We'll start by reading in the dataset, being sure to read in the district and school number as strings in case we need to merge on anything later. If pandas gets its way, it would read the district/school numbers in as integers and turn something like 0001
into 1
. This is unbelievably common when reading datasets with IDs into Excel or pandas, and is always something you should watch out for!
df = pd.read_csv("data/data-cleaned-merged.csv", dtype={'District Number': str, 'School Number': 'str'})
df.head()
This dataset includes school information, as well as
- The percentage of students who passed their 4th grade standardized reading test
- The percentage of students receiving free or reduced price lunch, as a proxy for poverty
- The percentage of students that are Black/African-American
Converting to percentages#
It's really easy to get mixed up later if we don't have our percentage columns as actual presents. Multiply any percentages that go 0-1 by 100 to turn them into 0-100 instead.
- Tip: Make sure your numbers are 1-100 after you multiply!
df.pct_free_or_reduced = df.pct_free_or_reduced * 100
df.pct_black = df.pct_black * 100
df.head()
Graphing our data#
Use seaborn's regplot
to plot the relationship between free/reduced lunch and percent passing, and the same with percent black and percent passing.
- Tip: You can use
scatter_kws={'alpha':0.3}
to see things a bit more nicely
sns.regplot(data=df, x='pct_free_or_reduced', y='pct_passing', scatter_kws={'alpha':0.3})
sns.regplot(data=df, x='pct_black', y='pct_passing', scatter_kws={'alpha':0.3})
Linear regression#
Now let's be a little more exact: run a linear regression for percent passing that takes into account both percent black and percent free or reduced.
import statsmodels.formula.api as smf
model = smf.ols("pct_passing ~ pct_black + pct_free_or_reduced", data=df)
result = model.fit()
result.summary()
Describe the relationship coefficient using "real" words#
For example, "For every X change, we get Y change"
Overperformers and underperformers#
The point of the regression is to predict the percent passing, right? We can use result.predict()
to get the predicted passing rate for each school. Try to run it below:
result.predict()
Now, let's save that value into a new column, we can call it predicted_passing
. It won't work for schools that are missing pct_black
or pct_free_or_reduced
, though, so first we'll need to drop those rows.
df.shape
df = df.dropna(subset=['pct_black', 'pct_free_or_reduced'])
df.shape
df['predicted_passing'] = result.predict()
df.head()
Confirm that Charles W. Duval had a predicted passing rate of 32.#
df[df['School Name'].str.contains("CHARLES")]
Now let's find the difference between the predicted passing rate and the actual passing rate#
If we're being stats-y, this is called the residual. Save it into a new column called.... residual
.
You could calculate it as df.predicted_passing - df.pct_passing
but you can also just ask for result.resid
.
df['residual'] = result.resid
df.head()
Find the 10 schools that did much worse than predicted, based on the demographics we're looking at#
- PRINCETON HOUSE CHARTER should be the worst, with PEPIN ACADEMIES below that
df.sort_values(by='residual').head(10)
Find the top 10 schools that did better than predicted, based on the demographics we're looking at#
- PARKWAY MIDDLE SCHOOL should be the best, and PATHWAYS should be second
df.sort_values(by='residual', ascending=False).head(10)
What problems might our analysis have?#
We brought in two things we thought would do a good job covering socio-economic and demographic patterns. What else might we be missing?
- Tip: Pay attention to the names of the schools
Is there a between using the raw number for the residual as opposed to standard deviation? (See Texas schools cheating scandal)