The Tampa Bay Times and school performance#
Story: The story, and a critique
Author: Various parts are various people! Nathaniel Lash did the good investigation, but we're reproducing a less-than-stellar approach here.
Topics: Linear Regression, Residuals
Datasets
- 0066897-gr04_rsch_2014.xls: 4th grader pass rates for standardized tests, from Florida Dept of Education
- FRL-1314-School-Web-Survey-3-Final.xls: Free and reduced price lunch data, from Florida Dept of Education
- MembershipSchoolRaceGender1415.xls: School population by gender, from Florida Dept of Education
What's the story?#
We're trying to see what kind of effect things like race and poverty might have on school test score data by reproducing (more or less) a critique of the Tampa Bay Times analysis. We're doing this analysis first because it's a rather typical analysis, and it will highlight what's special about the Tampa Bay Times' approach in the next section.
It's important to note that this analysis is a naive approach that yields inaccurate results. While the math is correct and the analysis is a standard approach to issues of economics and school performance, the data contains a fatal flaw and the analysis kind of misses the point. You can skip to the bottom if you'd like to see what's so wrong.
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
junk = pd.read_excel("data/0066897-gr04_rsch_2014.xls", nrows=12)
junk
The real headers don't begin until down on the "Grade," "District Number," etc row, so we're going to skip the first 10 rows. We'll also need to drop the one after the header, as it's more notes.
Sometimes the school number starts with 0
- for example, Charles W. Duval Elementary is School Number 0021
. Pandas (and Excel!) will assume it's a number and drop the leading zeroes, leading to a School Number of 21
instead. We'll tell .read_csv
to read the column in as a string instead.
Finally, we're ignoring any cell with *
in it, which is used for data that is missing or has been stripped out.
reading_scores = pd.read_excel("data/0066897-gr04_rsch_2014.xls",
skiprows=10,
dtype={'School Number': 'str'},
na_values=['*'])
reading_scores.head(4)
We'll remove that Number of Points Possible
row and the state totals row by asking for everything after row index 2.
reading_scores = reading_scores.loc[2:]
reading_scores.head(4)
Checking out our data#
How does our data look? Let's use .describe()
since it's almost all numeric.
reading_scores.describe()
Lunch data#
When performing analysis of schools and poverty, lunch data is oftentimes very important.
A typical question one would ask of a school system might go something like, "how do students who live in poverty perform?" In a world with zero privacy, you'd take the test scores of these kids along with their household income and do a study to see how the two relate. Fortunately, that data is not quite available!
Instead, you typically need to start from school-level average test scores. That takes care of the academic part, and then you need to somehow relate in economic forces.
As much as your research might want it, schools don't publish the income of their students' families. You also can't reliably use the Census income data of the surrounding areas, as you don't know where exactly these students come from, or what tracts do or do not have children.
Instead, you typically use a metric that schools do publish: the rate of free and reduced lunch. According to the USDA, in schools that participate in the National School Lunch Program:
Free lunches are available to children in households with incomes at or below 130 percent of poverty.
Reduced-price lunches are available to children in households with incomes between 130 and 185 percent of poverty.
Other students may qualify as well, but roughly speaking it's a well-known proxy among education research.
The free or reduced lunch percentage is then compared to the test scores using regression, and there you go on your results.
First we're going to do the analysis, then we're going to talk about the limitations of this metric, along with a fatal flaw that destroys its use in this situation.
Reading in lunch data#
Again, we have *
for missing data, 0021
-style school identifiers that need to be read in as strings, and finally some weird column names. We'll skip the unnecessary header rows and manually label the rest of them.
colnames = ['district_no', 'district_name', 'school_no',
'school_name', 'total_members', 'num_free',
'num_reduced', 'provision_2', 'direct_cert']
lunch = pd.read_excel("data/FRL-1314-School-Web-Survey-3-Final.xls",
sheet_name='School',
skiprows=5,
dtype={'school_no': 'str', 'district_no': 'str'},
names=colnames,
na_values=['*'])
lunch.head()
Notice how even though we specified dtype
as string when we read in the district and school codes, pandas still dropped the leading zeroes! We'll use .str.pad
to add them back in.
lunch.school_no = lunch.school_no.str.pad(width=4, side='left', fillchar='0')
lunch.district_no = lunch.district_no.str.pad(width=2, side='left', fillchar='0')
lunch.head(3)
Feature engineering#
Let's add in a new column that is the percent of students who are eligible for free or reduced-price lunch.
- Free, reduced price, provision 2, and CEP direct cert are all kinds of reduced lunch.
- Total members it the total number of students at the school
- If we run a regression, percentages work a lot better as 0-100 instead of 0-1, so we multiply by 100
lunch['pct_free_or_reduced'] = (lunch.num_free + lunch.num_reduced + lunch.provision_2
+ lunch.direct_cert) / lunch.total_members * 100
lunch.head()
Race data#
Again, the xls file is a little dirty so we need to set a whole bushel of options when reading it in.
race = pd.read_excel("data/MembershipSchoolRaceGender1415.xls",
sheet_name='SCHOOL',
skiprows=4,
dtype={'School #': 'str', 'District #': 'str'},
na_values=[' ', '*'])
race.head()
Each row seems to be a different Grade, from pre-Kindergarten to 12th grade. But if you poke down far enough, there's one weird one: TOTAL
!
race[13:17]
The total rows adds adds up all the other rows for that school and provides an aggregate. Instead of doing the math ourselves to compute the racial statistics for each school, we'll use these TOTAL
rows.
race = race[race.Grade == 'TOTAL']
race.head()
Notice the school name isn't the actual name of the school! It's not a problem, though - as long as we have the District # and School # we'll be able to match it up with the schools in the end.
Cleaning up race counts#
When a school has no students of a certain race, it just doesn't put anything in the column. This means a lot of NaN
values that should be zeros! Let's fill in those NaN
values with 0.
race = race.fillna(0)
race.head(3)
Feature engineering#
We'll now create a new column called pct_black
that is the percentage of black students. Typically you'd take a larger view of race issues, but in this case we're just trying to reproduce what was done by the critique-er.
race['pct_black'] = race['Black or African American'] / race['Total']
race.head()
Merging our datasets#
Let's take a look at the first couple rows of our three datasets:
- Our reading score data
- Our free lunch data
- Our race data
reading_scores.head(2)
lunch.head(2)
race.head(2)
We need to merge them, but school numbers repeat in difference districts. You'll need to join on district AND school number to successfully perform each merge.
merged = reading_scores.merge(lunch,
left_on=['District Number', 'School Number'],
right_on=['district_no', 'school_no'])
merged.shape
merged = merged.merge(race,
left_on=['district_no', 'school_no'],
right_on=['District #', 'School #'])
merged.shape
merged.head()
Cleaning up our columns#
We're only interested in a few of the columns:
- District number
- School number
- Percent passing
- Percent free or reduced lunch
- Percent Black
So let's just select only those columns.
cols = [
'District Number',
'District Name',
'School Number',
'School Name',
'Percentage Passing (Achievement Levels 3 and Above)',
'pct_free_or_reduced',
'pct_black'
]
merged = merged[cols]
merged.head()
While we're at it, we should probably rename Percentage Passing (Achievement Levels 3 and Above)
to pct_passing
because it's so so long.
merged = merged.rename(columns={'Percentage Passing (Achievement Levels 3 and Above)': 'pct_passing'})
merged.head()
Converting to percentages#
It's really easy to get mixed up later if we don't have our percentage columns as actual presents. Let's multiply any percentages that go 0-1 by 100 to turn them into 0-100 instead. It looks like pct_black is the only one that needs attention.
merged['pct_black'] = merged.pct_black * 100
merged.head()
Graphing our data#
We'll use seaborn's regplot
to plot the relationship between free/reduced lunch and percent passing, and the same with percent black.
sns.regplot(data=merged,
x='pct_free_or_reduced',
y='pct_passing',
scatter_kws={'alpha':0.3})
sns.regplot(data=merged,
x='pct_black',
y='pct_passing',
scatter_kws={'alpha':0.3})
This is people start yelling "Look at that line! It goes down! Negative correlation, case closed!" But no, we're better than this. We know about linear regression! We'll use it to quantify the relationship.
Linear regression#
Now let's be a little more exact, and run a linear regression that takes into account both percent black and percent free or reduced.
# We'll remove any rows missing data
print("Before removing missing data:", merged.shape)
merged = merged.dropna()
print("After removing missing data:", merged.shape)
import statsmodels.formula.api as smf
model = smf.ols('pct_passing ~ pct_black + pct_free_or_reduced', data=merged)
result = model.fit()
result.summary()
Understanding the regression#
Most of the time when performing a regression, we're looking for a "when X goes up this much, Y goes up this much" kind of sentence. In this case, we see our pct_free_or_reduced coefficient is about -0.4. This can be explained by a sentence like:
Controlling for race, every 10 percentage point increase in free or reduced lunch translates to a 4 point drop in passing rate
We could do the same thing with the pct_black coefficient, too.
Controlling for free or reduced lunch, every 10 percentage point increase in black students translates to a 2 point drop in passing rate
There's only one issue with performing linear regressions as a journalist, though: these numbers don't tell us much that's actually useful. While it's certainly an extra number you could try to throw into your story, it doesn't really mean much more than "income is more important than race."
Before we find out the flaws in our data, let's first see how to put this regression to use.
Overperformers and underperformers#
One of the most useful things you can do with a regression is make predictions. We can actually use our regression to say hey, now that you've looked at all of these schools and figured out the relationship between these variables, what should each school have scored?
Making predictions#
We can use result.predict()
to get the predicted passing rate for each school.
result.predict()
Now, let's save that value into a new column, we can call it predicted_passing
.
merged['predicted_passing'] = result.predict()
merged.head()
We can use this predicted passing rate to find the difference between the actual passing rate and the predicted passing rate. That way we can see which schools did better or worse than they were predicted to do.
If we're being stats-y, this is called the residual.
merged['residual'] = merged['pct_passing'] - merged['predicted_passing']
merged.head()
Using predictions#
Now we could find, for example, the top 5 schools that did better or worse than expected.
# Negative residual, did worse than expected
merged.sort_values(by='residual').head(5)
# Positive residual, did better than predicted
merged.sort_values(by='residual', ascending=False).head(5)
featured_names = [
'LAKEWOOD ELEMENTARY SCHOOL',
'CAMPBELL PARK ELEMENTARY SCHOOL',
'MAXIMO ELEMENTARY SCHOOL',
'MELROSE ELEMENTARY SCHOOL',
'FAIRMOUNT PARK ELEMENTARY SCHOOL'
]
is_pinellas = merged['District Name'] == 'PINELLAS'
is_featured = merged['School Name'].isin(featured_names)
featured = merged[is_featured & is_pinellas]
featured
From looking at our results, all of the featured Pinellas County schools are performing well below expectations. We have a few problems, though!
Exceptional student education#
If you read through the critique, it involves a feature that we don't have - exceptional students minus gifted. Exceptional students include students with all manner of disabilities or special circumstances - deaf students, emotional disabilities, autism spectrum, developmental disabled, and many more categories.
If a school has an especially high or low number of these students, it's sure to affect their passing rates. Why don't we have it in our analysis? I couldn't find the data on a school level, only on the county level. According to the author, just by adding that one column suddenly three of our residuals swing to positive, and our huge negatives come much closer to the center:
How were we supposed to know that ESE students have such an impact? This is why you always run your analysis by the experts! They'll be able to suggest other columns you might want to add to your regression, and help you understand whether their impact is meaningful or not.
For example, whether ESE students are required to take tests or not would have a large impact on whether that additional information is meaningful.
Free lunch gotchas (a big one!)#
Speaking of experts: free lunch seems pretty straightforward, right? Even though it's a popular benchmark for analysis, it turns out that things aren't so simple.
It turns out that many schools and school systems across the US have simplified the free lunch program: instead of students needing to meeting a certain income requirements, lunch for all students is federally funded. While these are typically high-poverty areas, you can no longer use the percent free lunch as a meaningful proxy for poverty if all of the students can receive it, regardless of income.
Additionally, this new system was adopted at different rates and in different ways across districts and states, which makes measurements even less useful!
So where does that leave us? If we can't use free lunch, we're now stuck with just using race as our only regression feature. Seems a little simplistic!
More readings on free lunch as a proxy for poverty:
So what do we do?#
Seems like it might be time for a new approach.
A different approach#
Instead of trying to see what schools performed well and what schools performed poorly, let's step back and examine our underlying assumptions. Powered by statistics, we've come to the conclusion that schools with more minority students and more poverty have lower passing rates.
This could lead to a pretty typical path for investigation: what are the over performing or under performing schools doing differently? And while it's a valid approach, it's taking for granted that for an individual student, their performance is anchored by their race and income level. That rich, white students performed well, and poor, minority students performed poorly. You can "fix" a school by adding rich, white kids to bring up the average, but it's only their higher test scores that are a contributing factor.
The Tampa Bay Times wanted to investigate a question that's slightly more complicated than that: were these black students performing worse in highly segregated schools than if they were attending integrated schools? So not only would the average of the school go up thanks to higher average scores of rich, white students, but would it also affect the students that the regression would expect to be low performers.
Where's the data?#
To answer this question, you need to be able to see how similar students perform as they are moved through schools with different racial makeups - segregated vs integrated. If students start at the same place, how do they perform down the road when their education has been in different environments?
The Tampa Bay Times was able to crack this nut using a test given to new kindergarteners that assessed their readiness to learn how to read. By tracking students' performance between kindergarten and their first reading test in third grade, the Times could see what kind of differences might exist between the students in each environment.
Our new analysis#
Let's start by reading in our pre-cleaned data. Note that this is a completely different dataset from what we were using before, as it was organized and cleaned up for a presentation from NICAR.
The column names make it pretty obvious as to what's what.
It's been cleaned the dataset from that notebook a bit to make all percentages be 0-100 instead of 0-1, but that's the only change.
df = pd.read_csv("data/FL_elementaries_adjusted.csv")
df = df.dropna()
df.head()
The percent_pre_kinder_ready
column is going to be our judge of a cohort's starting point. From the presentation:
How well the kids do before they’ve spent a day in the class is critical. Because what if the kids are coming to school less prepared than every other kid in the state of Florida? How do we show it’s not just something in the water that the school board has no control over?
Definitely read their walkthrough, as it's a great visual approach that's very different than what we're doing here!
The question we're going to be looking at here is: given a set of students that start off at the same place, do they end up performing worse if a school is less integrated? We'll first try making our integration judgement using percent_black_enrollment
(note that we're ignoring free lunch here because of the reasons mentioned above).
# How does kindergarten readiness and percent black relate to 3rd grade passing rates?
model = smf.ols("percent_passing_reading ~ percent_pre_kinder_ready + percent_black_enrollment", data=df)
result = model.fit()
result.summary()
For students that start off at the same place: as the black enrollment climbs, test scores go down. But this doesn't actually isn't what we're looking for, since our question isn't so much "what is the affect of a certain percentage of black students," and more a question of integration. A simplification might be, what happens if a school is majority black?
While we could compute majority black easily, it's also one of the columns!
# How does kindergarten readiness and being a majority-black school to 3rd grade passing rates?
model = smf.ols("percent_passing_reading ~ percent_pre_kinder_ready + majority_black_enrollment_flag", data=df)
result = model.fit()
result.summary()
The coefficient of -15.58 for majority_black_enrollment_flag[T.True]
means that when that column switches from 0 (minority black) to 1 (majority black), the passing rate drops by 15.58 points. According to our analysis, even if a school's students start off equally prepared, being a majority-black school translates into a drop of over 15 points in the reading test pass rate.
And those specific Pinellas schools we're interested in?
# What was the predicted passing rate for each school
df['predicted_passing'] = result.predict()
# How much different they performed compared to the expectation
df['residual'] = df['percent_passing_reading'] - df['predicted_passing']
df[df.FF_school_flag]
Even when adjusting for initial kindergarten scores and minority/majority black students, the features schools still under performing.
In the end, all this analysis boiled down to these two sentences:
A Times analysis of statewide kindergarten readiness data shows that new students in Pinellas County’s most segregated schools show up no less prepared than students in scores of other struggling, high-poverty schools. It’s only after a few years in Pinellas classrooms that they’re falling behind their peers statewide.
Discussion topics#
Do ESE students really have an effect our results? Check out the names of some of schools that performed far below predicted. Do the names tell you anything about the students that go there?
When looking at the difference between how a school's predicted and actual passing rate, we used the raw residual. What might be a downside? Think about schools with a high predicted score compared to those with a low predicted score.
An alternative to the raw residual would be the standard deviation, which we used to analyze cheating Texas schools. Why might standard deviation a more effective measure of under/over performing?
In our second regression, we found that even after testing equally while entering kindergarten, students at a majority-black school tend to perform worse on their third-grade reading tests. How is this different than the findings from our first regression, where students at a majority-black school performing worse?
What policy changes or investigation angles would you suggest after our initial analysis? What would change after our second analysis?
Our final analysis only resulted in two sentences in the final published piece. Despite that, how might the findings have colored the rest of our research?
Which regression does a better job representing the state of passing reading tests? It isn't an emotional appeal, there's actually a mathematical answer! You might need to check back in our evaluating regressions page if you forget how to compare regressions.
In our second analysis, why is there a difference between using percent black enrollment as compared to majority black?
In this analysis we used 50% black as a proxy for integrated vs integrated schools. What might have changed if we pushed this number to something higher, like 75% or even 90%?
While we couldn't reproduce it, the critique stated that three of the five "failure factories" schools outperformed their peers. Let's say this was true: does it change our findings on the relationship between school performance and their having a majority/minority black students?