# The Tampa Bay Times and school performance#

**Story:** The story, and a critique

**Author:** Various parts are various people! Nathaniel Lash did the good investigation, but we're reproducing a less-than-stellar approach here.

**Topics:** Linear Regression, Residuals

**Datasets**

**0066897-gr04_rsch_2014.xls:**4th grader pass rates for standardized tests, from Florida Dept of Education**FRL-1314-School-Web-Survey-3-Final.xls:**Free and reduced price lunch data, from Florida Dept of Education**MembershipSchoolRaceGender1415.xls:**School population by gender, from Florida Dept of Education

## What's the story?#

We're trying to see what kind of effect things like race and poverty might have on school test score data. In this section, **our analysis is a naive approach that yields inaccurate results.** While the *math* is correct the data behind it contains a fatal flaw.

### Imports#

We'll want pandas and seaborn. You'll want want to set pandas to display a lot of columns and rows at a time.

```
import pandas as pd
import seaborn as sns
pd.set_option("display.max_columns", 200)
pd.set_option("display.max_rows", 200)
```

## Reading in our data#

We'll start by reading in the dataset, being sure to read in the **district and school number as strings** in case we need to merge on anything later. If pandas gets its way, it would read the district/school numbers in as integers and turn something like `0001`

into `1`

. This is unbelievably common when reading datasets with IDs into Excel or pandas, and is always something you should watch out for!

```
df = pd.read_csv("data/data-cleaned-merged.csv", dtype={'District Number': str, 'School Number': 'str'})
df.head()
```

This dataset includes school information, as well as

- The percentage of students who passed their 4th grade standardized reading test
- The percentage of students receiving free or reduced price lunch, as a proxy for poverty
- The percentage of students that are Black/African-American

### Converting to percentages#

It's really easy to get mixed up later if we don't have our percentage columns as actual presents. Multiply any percentages that go 0-1 by 100 to turn them into 0-100 instead.

**Tip:**Make sure your numbers are 1-100 after you multiply!

```
df.pct_free_or_reduced = df.pct_free_or_reduced * 100
df.pct_black = df.pct_black * 100
df.head()
```

# Graphing our data#

Use seaborn's `regplot`

to plot the relationship between free/reduced lunch and percent passing, and the same with percent black and percent passing.

**Tip:**You can use`scatter_kws={'alpha':0.3}`

to see things a bit more nicely

```
sns.regplot(data=df, x='pct_free_or_reduced', y='pct_passing', scatter_kws={'alpha':0.3})
```

```
sns.regplot(data=df, x='pct_black', y='pct_passing', scatter_kws={'alpha':0.3})
```

# Linear regression#

Now let's be a little more exact: run a linear regression for percent passing that takes into account both percent black and percent free or reduced.

```
import statsmodels.formula.api as smf
model = smf.ols("pct_passing ~ pct_black + pct_free_or_reduced", data=df)
result = model.fit()
result.summary()
```

## Describe the relationship coefficient using "real" words#

For example, "For every X change, we get Y change"

```
```

# Overperformers and underperformers#

The point of the regression is to predict the percent passing, right? We can use `result.predict()`

to get the predicted passing rate for each school. Try to run it below:

```
result.predict()
```

Now, let's **save that value into a new column**, we can call it `predicted_passing`

. It won't work for schools that are missing `pct_black`

or `pct_free_or_reduced`

, though, so first we'll need to drop those rows.

```
df.shape
```

```
df = df.dropna(subset=['pct_black', 'pct_free_or_reduced'])
df.shape
```

```
df['predicted_passing'] = result.predict()
df.head()
```

### Confirm that Charles W. Duval had a predicted passing rate of 32.#

```
df[df['School Name'].str.contains("CHARLES")]
```

## Now let's find the difference between the predicted passing rate and the actual passing rate#

If we're being stats-y, this is called **the residual**. Save it into a new column called.... `residual`

.

You could calculate it as `df.predicted_passing - df.pct_passing`

but you can also just ask for `result.resid`

.

```
df['residual'] = result.resid
df.head()
```

### Find the 10 schools that did much worse than predicted, based on the demographics we're looking at#

- PRINCETON HOUSE CHARTER should be the worst, with PEPIN ACADEMIES below that

```
df.sort_values(by='residual').head(10)
```

### Find the top 10 schools that did better than predicted, based on the demographics we're looking at#

- PARKWAY MIDDLE SCHOOL should be the best, and PATHWAYS should be second

```
df.sort_values(by='residual', ascending=False).head(10)
```

# What problems might our analysis have?#

We brought in two things we thought would do a good job covering socio-economic and demographic patterns. What else might we be missing?

**Tip:**Pay attention to the names of the schools

Is there a between using the raw number for the residual as opposed to standard deviation? (See Texas schools cheating scandal)

```
```