# The Tampa Bay Times and school performance#

**Story:** The story, and a critique

**Author:** A million people, but I believe Nathaniel Lash did the data analysis

**Topics:** Linear Regression, Residuals

**Datasets**

**0066897-gr04_rsch_2014.xls:**4th grader pass rates for standardized tests, from Florida Dept of Education**FRL-1314-School-Web-Survey-3-Final.xls:**Free and reduced price lunch data, from Florida Dept of Education**MembershipSchoolRaceGender1415.xls:**School population by gender, from Florida Dept of Education

# What's the story?#

We're trying to see what kind of effect race and poverty have on school test score data. Their actual story doesn't include a regression, but they used one behind the scenes for research.

## Imports#

You'll want pandas and seaborn. You'll want want to set pandas to display a lot of columns and rows at a time.

```
import pandas as pd
import seaborn as sns
pd.set_option("display.max_columns", 200)
pd.set_option("display.max_rows", 200)
```

# The Tampa Bay Times and school performance#

**Story:** The story, and a critique

**Author:** A million people, but I believe Nathaniel Lash did the data analysis

**Topics:** Linear Regression, Residuals

**Datasets**

**data-cleaned-merged.csv:**The following three files, cleaned and merged**0066897-gr04_rsch_2014.xls:**4th grader pass rates for standardized tests, from Florida Dept of Education**FRL-1314-School-Web-Survey-3-Final.xls:**Free and reduced price lunch data, from Florida Dept of Education**MembershipSchoolRaceGender1415.xls:**School population by gender, from Florida Dept of Education

# What's the story?#

We're trying to see what kind of effect race and poverty have on school test score data. Their actual story doesn't include a regression, but they used one behind the scenes for research.

```
```

# Reading in our data#

Read in the dataset, being sure to read in the district and school number as strings in case we need to merge on anything later. If pandas gets its way, it would read them in as integers and turn `0001`

into `1`

. This dataset includes school information, as well as

- The percentage of students who passed their 4th grade standardized reading test
- The percentage of students receiving free or reduced price lunch, as a proxy for poverty
- The percentage of students that are Black/African-American

```
df = pd.read_csv("data/data-cleaned-merged.csv", dtype={'District Number': str, 'School Number': 'str'})
df.head()
```

### Converting to percentages#

It's really easy to get mixed up later if we don't have our percentage columns as actual presents. Multiply any percentages that go 0-1 by 100 to turn them into 0-100 instead.

**Tip:**Make sure your numbers are 1-100 after you multiply!

```
df.pct_free_or_reduced = df.pct_free_or_reduced * 100
df.pct_black = df.pct_black * 100
df.head()
```

# Graphing our data#

Use seaborn's `regplot`

to plot the relationship between free/reduced lunch and percent passing, and the same with percent black and percent passing.

**Tip:**You can use`scatter_kws={'alpha':0.3}`

to see things a bit more nicely

```
sns.regplot(data=df, x='pct_free_or_reduced', y='pct_passing', scatter_kws={'alpha':0.3})
```

```
sns.regplot(data=df, x='pct_black', y='pct_passing', scatter_kws={'alpha':0.3})
```

# Linear regression#

Now let's be a little more exact: run a linear regression for percent passing that takes into account both percent black and percent free or reduced.

```
import statsmodels.formula.api as smf
model = smf.ols("pct_passing ~ pct_black + pct_free_or_reduced", data=df)
result = model.fit()
result.summary()
```

## Describe the relationship coefficient using "real" words#

For example, "For every X change, we get Y change"

```
```

# Overperformers and underperformers#

The point of the regression is to predict the percent passing, right? We can use `result.predict()`

to get the predicted passing rate for each school. Try to run it below:

```
result.predict()
```

Now, let's **save that value into a new column**, we can call it `predicted_passing`

. It won't work for schools that are missing `pct_black`

or `pct_free_or_reduced`

, though, so first we'll need to drop those rows.

```
df.shape
```

```
df = df.dropna(subset=['pct_black', 'pct_free_or_reduced'])
df.shape
```

```
df['predicted_passing'] = result.predict()
df.head()
```

### Confirm that Charles W. Duval had a predicted passing rate of 32.#

```
df[df['School Name'].str.contains("CHARLES")]
```

## Now let's find the difference between the predicted passing rate and the actual passing rate#

If we're being stats-y, this is called **the residual**. Save it into a new column called.... `residual`

.

You could calculate it as `df.predicted_passing - df.pct_passing`

but you can also just ask for `result.resid`

.

```
df['residual'] = result.resid
df.head()
```

### Find the 10 schools that did much worse than predicted, based on the demographics we're looking at#

- PRINCETON HOUSE CHARTER should be the worst, with PEPIN ACADEMIES below that

```
df.sort_values(by='residual').head(10)
```

### Find the top 10 schools that did better than predicted, based on the demographics we're looking at#

- PARKWAY MIDDLE SCHOOL should be the best, and PATHWAYS should be second

```
df.sort_values(by='residual', ascending=False).head(10)
```

# What problems might our analysis have?#

We brought in two things we thought would do a good job covering socio-economic and demographic patterns. What else might we be missing?

**Tip:**Pay attention to the names of the schools

Is there a between using the raw number for the residual as opposed to standard deviation? (See Texas schools cheating scandal)

```
```