The Tampa Bay Times and school performance#

Story: The story, and a critique

Author: Various parts are various people! Nathaniel Lash did the good investigation, but we're reproducing a less-than-stellar approach here.

Topics: Linear Regression, Residuals

Datasets

  • 0066897-gr04_rsch_2014.xls: 4th grader pass rates for standardized tests, from Florida Dept of Education
  • FRL-1314-School-Web-Survey-3-Final.xls: Free and reduced price lunch data, from Florida Dept of Education
  • MembershipSchoolRaceGender1415.xls: School population by gender, from Florida Dept of Education

What's the story?#

We're trying to see what kind of effect things like race and poverty might have on school test score data. In this section, our analysis is a naive approach that yields inaccurate results. While the math is correct the data behind it contains a fatal flaw.

Imports#

We'll want pandas and seaborn. You'll want want to set pandas to display a lot of columns and rows at a time.

import pandas as pd
import seaborn as sns

pd.set_option("display.max_columns", 200)
pd.set_option("display.max_rows", 200)

Reading in our data#

We'll start by reading in the dataset, being sure to read in the district and school number as strings in case we need to merge on anything later. If pandas gets its way, it would read the district/school numbers in as integers and turn something like 0001 into 1. This is unbelievably common when reading datasets with IDs into Excel or pandas, and is always something you should watch out for!

df = pd.read_csv("data/data-cleaned-merged.csv", dtype={'District Number': str, 'School Number': 'str'})
df.head()
District Number District Name School Number School Name pct_passing pct_free_or_reduced pct_black
0 01 ALACHUA 0021 CHARLES W. DUVAL ELEM SCHOOL 36.0 0.959119 0.903226
1 01 ALACHUA 0031 J. J. FINLEY ELEMENTARY SCHOOL 74.0 0.546689 0.287375
2 01 ALACHUA 0041 STEPHEN FOSTER ELEMENTARY SCHOOL 71.0 0.606987 0.383158
3 01 ALACHUA 0052 A.QUINN JONES/EXCEP.STUDENT CENTER NaN 0.802817 0.666667
4 01 ALACHUA 0071 LAKE FOREST ELEMENTARY SCHOOL 19.0 0.957655 0.849231

This dataset includes school information, as well as

  • The percentage of students who passed their 4th grade standardized reading test
  • The percentage of students receiving free or reduced price lunch, as a proxy for poverty
  • The percentage of students that are Black/African-American

Converting to percentages#

It's really easy to get mixed up later if we don't have our percentage columns as actual presents. Multiply any percentages that go 0-1 by 100 to turn them into 0-100 instead.

  • Tip: Make sure your numbers are 1-100 after you multiply!
df.pct_free_or_reduced = df.pct_free_or_reduced * 100
df.pct_black = df.pct_black * 100
df.head()
District Number District Name School Number School Name pct_passing pct_free_or_reduced pct_black
0 01 ALACHUA 0021 CHARLES W. DUVAL ELEM SCHOOL 36.0 95.911950 90.322581
1 01 ALACHUA 0031 J. J. FINLEY ELEMENTARY SCHOOL 74.0 54.668930 28.737542
2 01 ALACHUA 0041 STEPHEN FOSTER ELEMENTARY SCHOOL 71.0 60.698690 38.315789
3 01 ALACHUA 0052 A.QUINN JONES/EXCEP.STUDENT CENTER NaN 80.281690 66.666667
4 01 ALACHUA 0071 LAKE FOREST ELEMENTARY SCHOOL 19.0 95.765472 84.923077

Graphing our data#

Use seaborn's regplot to plot the relationship between free/reduced lunch and percent passing, and the same with percent black and percent passing.

  • Tip: You can use scatter_kws={'alpha':0.3} to see things a bit more nicely
sns.regplot(data=df, x='pct_free_or_reduced', y='pct_passing', scatter_kws={'alpha':0.3})
<matplotlib.axes._subplots.AxesSubplot at 0x117d69358>
sns.regplot(data=df, x='pct_black', y='pct_passing', scatter_kws={'alpha':0.3})
<matplotlib.axes._subplots.AxesSubplot at 0x117ea7208>

Linear regression#

Now let's be a little more exact: run a linear regression for percent passing that takes into account both percent black and percent free or reduced.

import statsmodels.formula.api as smf

model = smf.ols("pct_passing ~ pct_black + pct_free_or_reduced", data=df)
result = model.fit()
result.summary()
OLS Regression Results
Dep. Variable: pct_passing R-squared: 0.575
Model: OLS Adj. R-squared: 0.575
Method: Least Squares F-statistic: 1398.
Date: Thu, 07 Nov 2019 Prob (F-statistic): 0.00
Time: 12:58:10 Log-Likelihood: -7963.4
No. Observations: 2070 AIC: 1.593e+04
Df Residuals: 2067 BIC: 1.595e+04
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 89.3659 0.707 126.460 0.000 87.980 90.752
pct_black -0.2041 0.011 -18.669 0.000 -0.226 -0.183
pct_free_or_reduced -0.3984 0.012 -34.271 0.000 -0.421 -0.376
Omnibus: 178.385 Durbin-Watson: 1.569
Prob(Omnibus): 0.000 Jarque-Bera (JB): 560.988
Skew: -0.423 Prob(JB): 1.52e-122
Kurtosis: 5.406 Cond. No. 213.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Describe the relationship coefficient using "real" words#

For example, "For every X change, we get Y change"

 

Overperformers and underperformers#

The point of the regression is to predict the percent passing, right? We can use result.predict() to get the predicted passing rate for each school. Try to run it below:

result.predict()

Now, let's save that value into a new column, we can call it predicted_passing. It won't work for schools that are missing pct_black or pct_free_or_reduced, though, so first we'll need to drop those rows.

df.shape
(2070, 7)
df = df.dropna(subset=['pct_black', 'pct_free_or_reduced'])
df.shape
(2070, 7)
df['predicted_passing'] = result.predict()
df.head()
District Number District Name School Number School Name pct_passing pct_free_or_reduced pct_black predicted_passing
0 01 ALACHUA 0021 CHARLES W. DUVAL ELEM SCHOOL 36.0 95.911950 90.322581 32.722663
1 01 ALACHUA 0031 J. J. FINLEY ELEMENTARY SCHOOL 74.0 54.668930 28.737542 61.722168
2 01 ALACHUA 0041 STEPHEN FOSTER ELEMENTARY SCHOOL 71.0 60.698690 38.315789 57.365169
4 01 ALACHUA 0071 LAKE FOREST ELEMENTARY SCHOOL 19.0 95.765472 84.923077 33.883060
6 01 ALACHUA 0091 LITTLEWOOD ELEMENTARY SCHOOL 56.0 59.394904 30.733229 59.432166

Confirm that Charles W. Duval had a predicted passing rate of 32.#

df[df['School Name'].str.contains("CHARLES")]
District Number District Name School Number School Name pct_passing pct_free_or_reduced pct_black predicted_passing
0 01 ALACHUA 0021 CHARLES W. DUVAL ELEM SCHOOL 36.0 95.911950 90.322581 32.722663
249 06 BROWARD 3221 CHARLES DREW ELEMENTARY SCHOOL 38.0 96.194825 66.998342 37.370480
349 10 CLAY 0071 CHARLES E. BENNETT ELEMENTARY SCHOO 57.0 78.465347 16.586538 54.722458
495 13 MIAMI DADE 1401 CHARLES R DREW K-8 CENTER 25.0 98.039216 91.964286 31.540153
530 13 MIAMI DADE 2331 CHARLES R HADLEY ELEM SCHOOL 66.0 84.563107 0.203459 55.637109
691 13 MIAMI DADE 5991 CHARLES DAVID WYCHE JR ELEMENTARY 46.0 89.185393 17.622378 50.240512

Now let's find the difference between the predicted passing rate and the actual passing rate#

If we're being stats-y, this is called the residual. Save it into a new column called.... residual.

You could calculate it as df.predicted_passing - df.pct_passing but you can also just ask for result.resid.

df['residual'] = result.resid
df.head()
District Number District Name School Number School Name pct_passing pct_free_or_reduced pct_black predicted_passing residual
0 01 ALACHUA 0021 CHARLES W. DUVAL ELEM SCHOOL 36.0 95.911950 90.322581 32.722663 3.277337
1 01 ALACHUA 0031 J. J. FINLEY ELEMENTARY SCHOOL 74.0 54.668930 28.737542 61.722168 12.277832
2 01 ALACHUA 0041 STEPHEN FOSTER ELEMENTARY SCHOOL 71.0 60.698690 38.315789 57.365169 13.634831
4 01 ALACHUA 0071 LAKE FOREST ELEMENTARY SCHOOL 19.0 95.765472 84.923077 33.883060 -14.883060
6 01 ALACHUA 0091 LITTLEWOOD ELEMENTARY SCHOOL 56.0 59.394904 30.733229 59.432166 -3.432166

Find the 10 schools that did much worse than predicted, based on the demographics we're looking at#

  • PRINCETON HOUSE CHARTER should be the worst, with PEPIN ACADEMIES below that
df.sort_values(by='residual').head(10)
District Number District Name School Number School Name pct_passing pct_free_or_reduced pct_black predicted_passing residual
1410 48 ORANGE 0055 PRINCETON HOUSE CHARTER 9.0 25.000000 14.743590 76.397521 -67.397521
1073 29 HILLSBOROUGH 6609 PEPIN ACADEMIES 4.0 37.653240 24.747475 69.315062 -65.315062
2180 68 FSDB 0011 DEAF ELEMENTARY SCHOOL (FSDB) 7.0 69.491525 14.782609 58.665531 -51.665531
1717 50 PALM BEACH 4037 LEARNING PATH ACADEMY 0.0 93.650794 14.569536 49.084725 -49.084725
1948 53 POLK 9203 B.E.S.T. 7.0 69.767442 50.000000 51.367702 -44.367702
420 12 COLUMBIA 0400 SHINING STAR ACADEMY OF THE ARTS 43.0 4.721030 4.950495 86.474808 -43.474808
1424 48 ORANGE 0185 RENAISSANCE CHTR SCH AT CHICKASAW 44.0 0.696056 10.146444 87.017732 -43.017732
1819 52 PINELLAS 3231 RICHARD L. SANDERS SCHOOL 17.0 50.000000 51.886792 58.857333 -41.857333
879 22 GLADES 0056 PEMAYETV EMAHAKV CHARTER OUR WAY SC 48.0 0.000000 0.581395 89.247257 -41.247257
1810 52 PINELLAS 2441 CHI CHI RODRIQUEZ ACADEMY 15.0 74.025974 26.760563 54.414434 -39.414434

Find the top 10 schools that did better than predicted, based on the demographics we're looking at#

  • PARKWAY MIDDLE SCHOOL should be the best, and PATHWAYS should be second
df.sort_values(by='residual', ascending=False).head(10)
District Number District Name School Number School Name pct_passing pct_free_or_reduced pct_black predicted_passing residual
165 06 BROWARD 0701 PARKWAY MIDDLE SCHOOL 100.0 85.758706 86.286788 37.591106 62.408894
304 06 BROWARD 5372 PATHWAYS ACADEMY K-8 CENTER 83.0 95.652174 64.206642 38.156444 44.843556
661 13 MIAMI DADE 5131 N DADE CENTER FOR MODERN LANGUAGE 89.0 76.767677 57.323232 47.084347 41.915653
566 13 MIAMI DADE 3101 FRANK CRAWFORD MARTIN K-8 CENTER 91.0 54.096916 59.079284 55.757254 35.242746
2187 74 FAMU LAB SCH 0351 FAMU DEVELOP RESEARCH 77.0 68.710359 96.881497 42.220239 34.779761
281 06 BROWARD 5021 SOMERSET NEIGHBORHOOD SCHOOL 77.0 73.611111 80.834915 43.543061 33.456939
1943 53 POLK 8121 HARTRIDGE ACADEMY 96.0 61.157025 9.243697 63.116233 32.883767
481 13 MIAMI DADE 1001 CORAL PARK ELEMENTARY SCHOOL 90.0 77.669903 0.688468 58.284153 31.715847
285 06 BROWARD 5031 CHARTER SCHOOL OF EXCELLENCE 77.0 74.817518 68.592058 45.561248 31.438752
677 13 MIAMI DADE 5561 FRANCES S. TUCKER ELEM. SCHOOL 75.0 93.253012 41.504854 43.745658 31.254342

What problems might our analysis have?#

We brought in two things we thought would do a good job covering socio-economic and demographic patterns. What else might we be missing?

  • Tip: Pay attention to the names of the schools

Is there a between using the raw number for the residual as opposed to standard deviation? (See Texas schools cheating scandal)