The Tampa Bay Times and school performance#

Story: The story, and a critique

Author: Various parts are various people! Nathaniel Lash did the good investigation, but we're reproducing a less-than-stellar approach here.

Topics: Linear Regression, Residuals

Datasets

0066897-gr04_rsch_2014.xls: 4th grader pass rates for standardized tests, from Florida Dept of Education
FRL-1314-School-Web-Survey-3-Final.xls: Free and reduced price lunch data, from Florida Dept of Education
MembershipSchoolRaceGender1415.xls: School population by gender, from Florida Dept of Education

What's the story?#

We're trying to see what kind of effect things like race and poverty might have on school test score data by reproducing (more or less) a critique of the Tampa Bay Times analysis. We're doing this analysis first because it's a rather typical analysis, and it will highlight what's special about the Tampa Bay Times' approach in the next section.

It's important to note that this analysis is a naive approach that yields inaccurate results. While the math is correct and the analysis is a standard approach to issues of economics and school performance, the data contains a fatal flaw and the analysis kind of misses the point. You can skip to the bottom if you'd like to see what's so wrong.

Read online Download notebook Interactive version

import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)

The data#

Test score data#

While we have a lot of options for what tests we can use, let's stick with reading scores.

As is typical with government data, there's a lot of junk up at the file. Let's start by reading in the file "normally" so we can check this garbage out!

junk = pd.read_excel("data/0066897-gr04_rsch_2014.xls", nrows=12)
junk

	On January 21, 2014, the State Board of Education established passing scores for all FCAT 2.0 assessments, as required by Section 1008.22(3), Florida Statutes. Although state law does not require grade 4 students to pass FCAT 2.0 Reading in order to move to the next grade level, the State Board has established the minimum score in Achievement Level 3 as the passing score for FCAT 2.0 Reading assessments.	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9	Unnamed: 10	Unnamed: 11	Unnamed: 12	Unnamed: 13	Unnamed: 14	Unnamed: 15	Unnamed: 16
0	Note: To provide meaningful results and to pro...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2014 FLORIDA COMPREHENSIVE ASSESSMENT TEST® 2....	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NEXT GENERATION SUNSHINE STATE STANDARDS	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	State Report of School Results	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Grade 04	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	READING	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	NaN	Total Test Scores	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Mean Points Earned	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	% in each Achievement Level	NaN	NaN	NaN	NaN	NaN	By Content Area	NaN	NaN	NaN
9	Grade	District Number	District Name	School Number	School Name	Number of Students	Mean Developmental Scale Score	1	2.0	3.0	4.0	5.0	Percentage Passing (Achievement Levels 3 and A...	Vocabulary	Reading Application	Literary Analysis: Fiction and Nonfiction	Informational Text and Research Process
10	4	Number of Points Possible	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9	14	11	11
11	04	00	STATE TOTALS	0000	GRADE 04	196,786	213	14	25.0	26.0	24.0	11.0	61	6	10	8	8

The real headers don't begin until down on the "Grade," "District Number," etc row, so we're going to skip the first 10 rows. We'll also need to drop the one after the header, as it's more notes.

Sometimes the school number starts with 0 - for example, Charles W. Duval Elementary is School Number 0021. Pandas (and Excel!) will assume it's a number and drop the leading zeroes, leading to a School Number of 21 instead. We'll tell .read_csv to read the column in as a string instead.

Finally, we're ignoring any cell with * in it, which is used for data that is missing or has been stripped out.

reading_scores = pd.read_excel("data/0066897-gr04_rsch_2014.xls",
                               skiprows=10,
                               dtype={'School Number': 'str'},
                               na_values=['*'])
reading_scores.head(4)

	Grade	District Number	District Name	School Number	School Name	Number of Students	Mean Developmental Scale Score	1	2	3	4	5	Percentage Passing (Achievement Levels 3 and Above)	Vocabulary	Reading Application	Literary Analysis: Fiction and Nonfiction	Informational Text and Research Process
0	4	Number of Points Possible	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9.0	14.0	11.0	11.0
1	4	00	STATE TOTALS	0000	GRADE 04	196,786	213.0	14.0	25.0	26.0	24.0	11.0	61.0	6.0	10.0	8.0	8.0
2	4	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	45	203.0	24.0	40.0	24.0	9.0	2.0	36.0	5.0	8.0	7.0	7.0
3	4	01	ALACHUA	0031	J. J. FINLEY ELEMENTARY SCHOOL	84	219.0	6.0	20.0	32.0	26.0	15.0	74.0	6.0	11.0	8.0	8.0

We'll remove that Number of Points Possible row and the state totals row by asking for everything after row index 2.

reading_scores = reading_scores.loc[2:]
reading_scores.head(4)

	Grade	District Number	District Name	School Number	School Name	Number of Students	Mean Developmental Scale Score	1	2	3	4	5	Percentage Passing (Achievement Levels 3 and Above)	Vocabulary	Reading Application	Literary Analysis: Fiction and Nonfiction	Informational Text and Research Process
2	4	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	45	203.0	24.0	40.0	24.0	9.0	2.0	36.0	5.0	8.0	7.0	7.0
3	4	01	ALACHUA	0031	J. J. FINLEY ELEMENTARY SCHOOL	84	219.0	6.0	20.0	32.0	26.0	15.0	74.0	6.0	11.0	8.0	8.0
4	4	01	ALACHUA	0041	STEPHEN FOSTER ELEMENTARY SCHOOL	85	223.0	6.0	24.0	18.0	25.0	28.0	71.0	7.0	11.0	8.0	8.0
5	4	01	ALACHUA	0052	A.QUINN JONES/EXCEP.STUDENT CENTER	5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Checking out our data#

How does our data look? Let's use .describe() since it's almost all numeric.

reading_scores.describe()

	Grade	Mean Developmental Scale Score	1	2	3	4	5	Percentage Passing (Achievement Levels 3 and Above)	Vocabulary	Reading Application	Literary Analysis: Fiction and Nonfiction	Informational Text and Research Process
count	2207.0	2079.000000	2079.000000	2079.000000	2079.000000	2079.000000	2079.000000	2079.000000	2079.000000	2079.000000	2079.000000	2079.000000
mean	4.0	211.737855	15.424723	26.076960	25.704185	22.721982	10.092833	58.517557	6.063973	9.651275	7.439153	7.595960
std	0.0	8.531156	10.730085	9.526966	6.327183	9.884397	8.041970	17.435445	0.799570	1.209159	0.916130	0.958375
min	4.0	175.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	3.000000	5.000000	4.000000	4.000000
25%	4.0	206.000000	7.000000	19.000000	22.000000	16.000000	4.000000	46.000000	6.000000	9.000000	7.000000	7.000000
50%	4.0	211.000000	14.000000	26.000000	26.000000	22.000000	8.000000	58.000000	6.000000	10.000000	7.000000	8.000000
75%	4.0	218.000000	21.000000	32.000000	30.000000	29.000000	14.000000	72.000000	7.000000	10.000000	8.000000	8.000000
max	4.0	242.000000	86.000000	80.000000	56.000000	63.000000	62.000000	100.000000	8.000000	13.000000	10.000000	10.000000

Lunch data#

When performing analysis of schools and poverty, lunch data is oftentimes very important.

A typical question one would ask of a school system might go something like, "how do students who live in poverty perform?" In a world with zero privacy, you'd take the test scores of these kids along with their household income and do a study to see how the two relate. Fortunately, that data is not quite available!

Instead, you typically need to start from school-level average test scores. That takes care of the academic part, and then you need to somehow relate in economic forces.

As much as your research might want it, schools don't publish the income of their students' families. You also can't reliably use the Census income data of the surrounding areas, as you don't know where exactly these students come from, or what tracts do or do not have children.

Instead, you typically use a metric that schools do publish: the rate of free and reduced lunch. According to the USDA, in schools that participate in the National School Lunch Program:

Free lunches are available to children in households with incomes at or below 130 percent of poverty.

Reduced-price lunches are available to children in households with incomes between 130 and 185 percent of poverty.

Other students may qualify as well, but roughly speaking it's a well-known proxy among education research.

The free or reduced lunch percentage is then compared to the test scores using regression, and there you go on your results.

First we're going to do the analysis, then we're going to talk about the limitations of this metric, along with a fatal flaw that destroys its use in this situation.

Reading in lunch data#

Again, we have * for missing data, 0021-style school identifiers that need to be read in as strings, and finally some weird column names. We'll skip the unnecessary header rows and manually label the rest of them.

colnames = ['district_no', 'district_name', 'school_no', 
            'school_name', 'total_members', 'num_free', 
            'num_reduced', 'provision_2', 'direct_cert']

lunch = pd.read_excel("data/FRL-1314-School-Web-Survey-3-Final.xls",
                      sheet_name='School',
                      skiprows=5,
                      dtype={'school_no': 'str', 'district_no': 'str'},
                      names=colnames,
                      na_values=['*'])
lunch.head()

	district_no	district_name	school_no	school_name	total_members	num_free	num_reduced
0	1	ALACHUA	21	CHARLES W. DUVAL ELEM SCHOOL	318.0	290.0	15.0
1	1	ALACHUA	31	J. J. FINLEY ELEMENTARY SCHOOL	589.0	297.0	25.0
2	1	ALACHUA	41	STEPHEN FOSTER ELEMENTARY SCHL	458.0	257.0	21.0
3	1	ALACHUA	52	A.QUINN JONES/EXCEP.STUD.CTR.	71.0	55.0	2.0
4	1	ALACHUA	71	LAKE FOREST ELEMENTARY SCHOOL	307.0	286.0	8.0

Notice how even though we specified dtype as string when we read in the district and school codes, pandas still dropped the leading zeroes! We'll use .str.pad to add them back in.

lunch.school_no = lunch.school_no.str.pad(width=4, side='left', fillchar='0')
lunch.district_no = lunch.district_no.str.pad(width=2, side='left', fillchar='0')
lunch.head(3)

	district_no	district_name	school_no	school_name	total_members	num_free	num_reduced
0	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	318.0	290.0	15.0
1	01	ALACHUA	0031	J. J. FINLEY ELEMENTARY SCHOOL	589.0	297.0	25.0
2	01	ALACHUA	0041	STEPHEN FOSTER ELEMENTARY SCHL	458.0	257.0	21.0

Feature engineering#

Let's add in a new column that is the percent of students who are eligible for free or reduced-price lunch.

Free, reduced price, provision 2, and CEP direct cert are all kinds of reduced lunch.
Total members it the total number of students at the school
If we run a regression, percentages work a lot better as 0-100 instead of 0-1, so we multiply by 100

lunch['pct_free_or_reduced'] = (lunch.num_free + lunch.num_reduced + lunch.provision_2
                                + lunch.direct_cert) / lunch.total_members * 100
lunch.head()

	district_no	district_name	school_no	school_name	total_members	num_free	num_reduced	pct_free_or_reduced
0	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	318.0	290.0	15.0	95.911950
1	01	ALACHUA	0031	J. J. FINLEY ELEMENTARY SCHOOL	589.0	297.0	25.0	54.668930
2	01	ALACHUA	0041	STEPHEN FOSTER ELEMENTARY SCHL	458.0	257.0	21.0	60.698690
3	01	ALACHUA	0052	A.QUINN JONES/EXCEP.STUD.CTR.	71.0	55.0	2.0	80.281690
4	01	ALACHUA	0071	LAKE FOREST ELEMENTARY SCHOOL	307.0	286.0	8.0	95.765472

Race data#

Again, the xls file is a little dirty so we need to set a whole bushel of options when reading it in.

race = pd.read_excel("data/MembershipSchoolRaceGender1415.xls",
                     sheet_name='SCHOOL',
                     skiprows=4,
                     dtype={'School #': 'str', 'District #': 'str'},
                     na_values=[' ', '*'])
race.head()

	District #	District	School #	School	Grade	White	Black or African American	Hispanic/Latino	Asian	Native Hawaiian or Other Pacific Islander	American Indian or Alaska Native	Two or More Races	Female Total	Male Total	Total
0	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	PK	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	KG	NaN	36.0	2.0	NaN	NaN	NaN	NaN	15.0	23.0	38.0
2	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	1	NaN	33.0	3.0	1.0	NaN	NaN	NaN	16.0	21.0	37.0
3	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	2	1.0	38.0	1.0	NaN	NaN	NaN	3.0	21.0	22.0	43.0
4	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	3	NaN	32.0	4.0	NaN	NaN	NaN	2.0	18.0	20.0	38.0

Each row seems to be a different Grade, from pre-Kindergarten to 12th grade. But if you poke down far enough, there's one weird one: TOTAL!

race[13:17]

	District #	District	School #	School	Grade	White	Black or African American	Hispanic/Latino	Asian	Native Hawaiian or Other Pacific Islander	American Indian or Alaska Native	Two or More Races	Female Total	Male Total	Total
13	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	12	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14	01	ALACHUA	0021	SCHOOL TOTALS	TOTAL	3.0	196.0	10.0	1.0	NaN	NaN	7.0	102.0	115.0	217.0
15	01	ALACHUA	0031	J. J. FINLEY ELEMENTARY SCHOOL	PK	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16	01	ALACHUA	0031	J. J. FINLEY ELEMENTARY SCHOOL	KG	37.0	24.0	17.0	16.0	NaN	NaN	5.0	49.0	50.0	99.0

The total rows adds adds up all the other rows for that school and provides an aggregate. Instead of doing the math ourselves to compute the racial statistics for each school, we'll use these TOTAL rows.

race = race[race.Grade == 'TOTAL']
race.head()

	District #	District	School #	School	Grade	White	Black or African American	Hispanic/Latino	Asian	Native Hawaiian or Other Pacific Islander	American Indian or Alaska Native	Two or More Races	Female Total	Male Total	Total
14	01	ALACHUA	0021	SCHOOL TOTALS	TOTAL	3.0	196.0	10.0	1.0	0.0	0.0	7.0	102.0	115.0	217.0
29	01	ALACHUA	0031	SCHOOL TOTALS	TOTAL	214.0	173.0	108.0	76.0	1.0	0.0	30.0	297.0	305.0	602.0
44	01	ALACHUA	0041	SCHOOL TOTALS	TOTAL	161.0	182.0	44.0	37.0	2.0	0.0	49.0	247.0	228.0	475.0
59	01	ALACHUA	0052	SCHOOL TOTALS	TOTAL	22.0	64.0	2.0	0.0	0.0	0.0	8.0	20.0	76.0	96.0
74	01	ALACHUA	0071	SCHOOL TOTALS	TOTAL	28.0	276.0	3.0	1.0	0.0	1.0	16.0	152.0	173.0	325.0

Notice the school name isn't the actual name of the school! It's not a problem, though - as long as we have the District # and School # we'll be able to match it up with the schools in the end.

Cleaning up race counts#

When a school has no students of a certain race, it just doesn't put anything in the column. This means a lot of NaN values that should be zeros! Let's fill in those NaN values with 0.

race = race.fillna(0)
race.head(3)

	District #	District	School #	School	Grade	White	Black or African American	Hispanic/Latino	Asian	Native Hawaiian or Other Pacific Islander	Two or More Races	Female Total	Male Total	Total
14	01	ALACHUA	0021	SCHOOL TOTALS	TOTAL	3.0	196.0	10.0	1.0	0.0	7.0	102.0	115.0	217.0
29	01	ALACHUA	0031	SCHOOL TOTALS	TOTAL	214.0	173.0	108.0	76.0	1.0	30.0	297.0	305.0	602.0
44	01	ALACHUA	0041	SCHOOL TOTALS	TOTAL	161.0	182.0	44.0	37.0	2.0	49.0	247.0	228.0	475.0

Feature engineering#

We'll now create a new column called pct_black that is the percentage of black students. Typically you'd take a larger view of race issues, but in this case we're just trying to reproduce what was done by the critique-er.

race['pct_black'] = race['Black or African American'] / race['Total']
race.head()

	District #	District	School #	School	Grade	White	Black or African American	Hispanic/Latino	Asian	Native Hawaiian or Other Pacific Islander	American Indian or Alaska Native	Two or More Races	Female Total	Male Total	Total	pct_black
14	01	ALACHUA	0021	SCHOOL TOTALS	TOTAL	3.0	196.0	10.0	1.0	0.0	0.0	7.0	102.0	115.0	217.0	0.903226
29	01	ALACHUA	0031	SCHOOL TOTALS	TOTAL	214.0	173.0	108.0	76.0	1.0	0.0	30.0	297.0	305.0	602.0	0.287375
44	01	ALACHUA	0041	SCHOOL TOTALS	TOTAL	161.0	182.0	44.0	37.0	2.0	0.0	49.0	247.0	228.0	475.0	0.383158
59	01	ALACHUA	0052	SCHOOL TOTALS	TOTAL	22.0	64.0	2.0	0.0	0.0	0.0	8.0	20.0	76.0	96.0	0.666667
74	01	ALACHUA	0071	SCHOOL TOTALS	TOTAL	28.0	276.0	3.0	1.0	0.0	1.0	16.0	152.0	173.0	325.0	0.849231

Merging our datasets#

Let's take a look at the first couple rows of our three datasets:

Our reading score data
Our free lunch data
Our race data

reading_scores.head(2)

	Grade	District Number	District Name	School Number	School Name	Number of Students	Mean Developmental Scale Score	1	2	3	4	5	Percentage Passing (Achievement Levels 3 and Above)	Vocabulary	Reading Application	Literary Analysis: Fiction and Nonfiction	Informational Text and Research Process
2	4	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	45	203.0	24.0	40.0	24.0	9.0	2.0	36.0	5.0	8.0	7.0	7.0
3	4	01	ALACHUA	0031	J. J. FINLEY ELEMENTARY SCHOOL	84	219.0	6.0	20.0	32.0	26.0	15.0	74.0	6.0	11.0	8.0	8.0

lunch.head(2)

	district_no	district_name	school_no	school_name	total_members	num_free	num_reduced	provision_2	direct_cert	pct_free_or_reduced
0	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	318.0	290.0	15.0	0.0	0.0	95.91195
1	01	ALACHUA	0031	J. J. FINLEY ELEMENTARY SCHOOL	589.0	297.0	25.0	0.0	0.0	54.66893

race.head(2)

	District #	District	School #	School	Grade	White	Black or African American	Hispanic/Latino	Asian	Native Hawaiian or Other Pacific Islander	American Indian or Alaska Native	Two or More Races	Female Total	Male Total	Total	pct_black
14	01	ALACHUA	0021	SCHOOL TOTALS	TOTAL	3.0	196.0	10.0	1.0	0.0	0.0	7.0	102.0	115.0	217.0	0.903226
29	01	ALACHUA	0031	SCHOOL TOTALS	TOTAL	214.0	173.0	108.0	76.0	1.0	0.0	30.0	297.0	305.0	602.0	0.287375

We need to merge them, but school numbers repeat in difference districts. You'll need to join on district AND school number to successfully perform each merge.

merged = reading_scores.merge(lunch,
                             left_on=['District Number', 'School Number'],
                             right_on=['district_no', 'school_no'])
merged.shape

(2205, 27)

merged = merged.merge(race,
                     left_on=['district_no', 'school_no'],
                     right_on=['District #', 'School #'])
merged.shape

(2189, 43)

merged.head()

	Grade_x	District Number	District Name	School Number	School Name	Number of Students	Mean Developmental Scale Score	1	2	3	4	5	Percentage Passing (Achievement Levels 3 and Above)	Vocabulary	Reading Application	Literary Analysis: Fiction and Nonfiction	Informational Text and Research Process	district_no	district_name	school_no	school_name	total_members	num_free	num_reduced	pct_free_or_reduced	District #	District	School #	School	Grade_y	White	Black or African American	Hispanic/Latino	Asian	Native Hawaiian or Other Pacific Islander	American Indian or Alaska Native	Two or More Races	Female Total	Male Total	Total	pct_black
0	4	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	45	203.0	24.0	40.0	24.0	9.0	2.0	36.0	5.0	8.0	7.0	7.0	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	318.0	290.0	15.0	95.911950	01	ALACHUA	0021	SCHOOL TOTALS	TOTAL	3.0	196.0	10.0	1.0	0.0	0.0	7.0	102.0	115.0	217.0	0.903226
1	4	01	ALACHUA	0031	J. J. FINLEY ELEMENTARY SCHOOL	84	219.0	6.0	20.0	32.0	26.0	15.0	74.0	6.0	11.0	8.0	8.0	01	ALACHUA	0031	J. J. FINLEY ELEMENTARY SCHOOL	589.0	297.0	25.0	54.668930	01	ALACHUA	0031	SCHOOL TOTALS	TOTAL	214.0	173.0	108.0	76.0	1.0	0.0	30.0	297.0	305.0	602.0	0.287375
2	4	01	ALACHUA	0041	STEPHEN FOSTER ELEMENTARY SCHOOL	85	223.0	6.0	24.0	18.0	25.0	28.0	71.0	7.0	11.0	8.0	8.0	01	ALACHUA	0041	STEPHEN FOSTER ELEMENTARY SCHL	458.0	257.0	21.0	60.698690	01	ALACHUA	0041	SCHOOL TOTALS	TOTAL	161.0	182.0	44.0	37.0	2.0	0.0	49.0	247.0	228.0	475.0	0.383158
3	4	01	ALACHUA	0052	A.QUINN JONES/EXCEP.STUDENT CENTER	5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01	ALACHUA	0052	A.QUINN JONES/EXCEP.STUD.CTR.	71.0	55.0	2.0	80.281690	01	ALACHUA	0052	SCHOOL TOTALS	TOTAL	22.0	64.0	2.0	0.0	0.0	0.0	8.0	20.0	76.0	96.0	0.666667
4	4	01	ALACHUA	0071	LAKE FOREST ELEMENTARY SCHOOL	43	193.0	51.0	30.0	7.0	9.0	2.0	19.0	4.0	7.0	6.0	6.0	01	ALACHUA	0071	LAKE FOREST ELEMENTARY SCHOOL	307.0	286.0	8.0	95.765472	01	ALACHUA	0071	SCHOOL TOTALS	TOTAL	28.0	276.0	3.0	1.0	0.0	1.0	16.0	152.0	173.0	325.0	0.849231

Cleaning up our columns#

We're only interested in a few of the columns:

District number
School number
Percent passing
Percent free or reduced lunch
Percent Black

So let's just select only those columns.

cols = [
    'District Number',
    'District Name',
    'School Number',
    'School Name',
    'Percentage Passing (Achievement Levels 3 and Above)',
    'pct_free_or_reduced',
    'pct_black'
    
]
merged = merged[cols]
merged.head()

	District Number	District Name	School Number	School Name	Percentage Passing (Achievement Levels 3 and Above)	pct_free_or_reduced	pct_black
0	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	36.0	95.911950	0.903226
1	01	ALACHUA	0031	J. J. FINLEY ELEMENTARY SCHOOL	74.0	54.668930	0.287375
2	01	ALACHUA	0041	STEPHEN FOSTER ELEMENTARY SCHOOL	71.0	60.698690	0.383158
3	01	ALACHUA	0052	A.QUINN JONES/EXCEP.STUDENT CENTER	NaN	80.281690	0.666667
4	01	ALACHUA	0071	LAKE FOREST ELEMENTARY SCHOOL	19.0	95.765472	0.849231

While we're at it, we should probably rename Percentage Passing (Achievement Levels 3 and Above) to pct_passing because it's so so long.

merged = merged.rename(columns={'Percentage Passing (Achievement Levels 3 and Above)': 'pct_passing'})
merged.head()

	District Number	District Name	School Number	School Name	pct_passing	pct_free_or_reduced	pct_black
0	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	36.0	95.911950	0.903226
1	01	ALACHUA	0031	J. J. FINLEY ELEMENTARY SCHOOL	74.0	54.668930	0.287375
2	01	ALACHUA	0041	STEPHEN FOSTER ELEMENTARY SCHOOL	71.0	60.698690	0.383158
3	01	ALACHUA	0052	A.QUINN JONES/EXCEP.STUDENT CENTER	NaN	80.281690	0.666667
4	01	ALACHUA	0071	LAKE FOREST ELEMENTARY SCHOOL	19.0	95.765472	0.849231

Converting to percentages#

It's really easy to get mixed up later if we don't have our percentage columns as actual presents. Let's multiply any percentages that go 0-1 by 100 to turn them into 0-100 instead. It looks like pct_black is the only one that needs attention.

merged['pct_black'] = merged.pct_black * 100
merged.head()

	District Number	District Name	School Number	School Name	pct_passing	pct_free_or_reduced	pct_black
0	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	36.0	95.911950	90.322581
1	01	ALACHUA	0031	J. J. FINLEY ELEMENTARY SCHOOL	74.0	54.668930	28.737542
2	01	ALACHUA	0041	STEPHEN FOSTER ELEMENTARY SCHOOL	71.0	60.698690	38.315789
3	01	ALACHUA	0052	A.QUINN JONES/EXCEP.STUDENT CENTER	NaN	80.281690	66.666667
4	01	ALACHUA	0071	LAKE FOREST ELEMENTARY SCHOOL	19.0	95.765472	84.923077

Graphing our data#

We'll use seaborn's regplot to plot the relationship between free/reduced lunch and percent passing, and the same with percent black.

sns.regplot(data=merged,
            x='pct_free_or_reduced',
            y='pct_passing',
            scatter_kws={'alpha':0.3})

<matplotlib.axes._subplots.AxesSubplot at 0x11e10d518>

sns.regplot(data=merged,
            x='pct_black',
            y='pct_passing',
            scatter_kws={'alpha':0.3})

<matplotlib.axes._subplots.AxesSubplot at 0x11cde4cf8>

This is people start yelling "Look at that line! It goes down! Negative correlation, case closed!" But no, we're better than this. We know about linear regression! We'll use it to quantify the relationship.

Linear regression#

Now let's be a little more exact, and run a linear regression that takes into account both percent black and percent free or reduced.

# We'll remove any rows missing data
print("Before removing missing data:", merged.shape)
merged = merged.dropna()
print("After removing missing data:", merged.shape)

Before removing missing data: (2189, 7)
After removing missing data: (2070, 7)

import statsmodels.formula.api as smf

model = smf.ols('pct_passing ~ pct_black + pct_free_or_reduced', data=merged)
result = model.fit()

result.summary()

OLS Regression Results
Dep. Variable:	pct_passing	R-squared:	0.575
Model:	OLS	Adj. R-squared:	0.575
Method:	Least Squares	F-statistic:	1398.
Date:	Tue, 25 Feb 2020	Prob (F-statistic):	0.00
Time:	16:10:20	Log-Likelihood:	-7963.4
No. Observations:	2070	AIC:	1.593e+04
Df Residuals:	2067	BIC:	1.595e+04
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	89.3659	0.707	126.460	0.000	87.980	90.752
pct_black	-0.2041	0.011	-18.669	0.000	-0.226	-0.183
pct_free_or_reduced	-0.3984	0.012	-34.271	0.000	-0.421	-0.376

Omnibus:	178.385	Durbin-Watson:	1.569
Prob(Omnibus):	0.000	Jarque-Bera (JB):	560.988
Skew:	-0.423	Prob(JB):	1.52e-122
Kurtosis:	5.406	Cond. No.	213.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Understanding the regression#

Most of the time when performing a regression, we're looking for a "when X goes up this much, Y goes up this much" kind of sentence. In this case, we see our pct_free_or_reduced coefficient is about -0.4. This can be explained by a sentence like:

Controlling for race, every 10 percentage point increase in free or reduced lunch translates to a 4 point drop in passing rate

We could do the same thing with the pct_black coefficient, too.

Controlling for free or reduced lunch, every 10 percentage point increase in black students translates to a 2 point drop in passing rate

There's only one issue with performing linear regressions as a journalist, though: these numbers don't tell us much that's actually useful. While it's certainly an extra number you could try to throw into your story, it doesn't really mean much more than "income is more important than race."

Before we find out the flaws in our data, let's first see how to put this regression to use.

Overperformers and underperformers#

One of the most useful things you can do with a regression is make predictions. We can actually use our regression to say hey, now that you've looked at all of these schools and figured out the relationship between these variables, what should each school have scored?

Making predictions#

We can use result.predict() to get the predicted passing rate for each school.

result.predict()

array([32.72266299, 61.72216826, 57.36516932, ..., 77.7076644 ,
       42.22023935, 75.18874729])

Now, let's save that value into a new column, we can call it predicted_passing.

merged['predicted_passing'] = result.predict()
merged.head()

	District Number	District Name	School Number	School Name	pct_passing	pct_free_or_reduced	pct_black	predicted_passing
0	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	36.0	95.911950	90.322581	32.722663
1	01	ALACHUA	0031	J. J. FINLEY ELEMENTARY SCHOOL	74.0	54.668930	28.737542	61.722168
2	01	ALACHUA	0041	STEPHEN FOSTER ELEMENTARY SCHOOL	71.0	60.698690	38.315789	57.365169
4	01	ALACHUA	0071	LAKE FOREST ELEMENTARY SCHOOL	19.0	95.765472	84.923077	33.883060
6	01	ALACHUA	0091	LITTLEWOOD ELEMENTARY SCHOOL	56.0	59.394904	30.733229	59.432166

We can use this predicted passing rate to find the difference between the actual passing rate and the predicted passing rate. That way we can see which schools did better or worse than they were predicted to do.

If we're being stats-y, this is called the residual.

merged['residual'] = merged['pct_passing'] - merged['predicted_passing']
merged.head()

	District Number	District Name	School Number	School Name	pct_passing	pct_free_or_reduced	pct_black	predicted_passing	residual
0	01	ALACHUA	0021	CHARLES W. DUVAL ELEM SCHOOL	36.0	95.911950	90.322581	32.722663	3.277337
1	01	ALACHUA	0031	J. J. FINLEY ELEMENTARY SCHOOL	74.0	54.668930	28.737542	61.722168	12.277832
2	01	ALACHUA	0041	STEPHEN FOSTER ELEMENTARY SCHOOL	71.0	60.698690	38.315789	57.365169	13.634831
4	01	ALACHUA	0071	LAKE FOREST ELEMENTARY SCHOOL	19.0	95.765472	84.923077	33.883060	-14.883060
6	01	ALACHUA	0091	LITTLEWOOD ELEMENTARY SCHOOL	56.0	59.394904	30.733229	59.432166	-3.432166

Using predictions#

Now we could find, for example, the top 5 schools that did better or worse than expected.

# Negative residual, did worse than expected 
merged.sort_values(by='residual').head(5)

	District Number	District Name	School Number	School Name	pct_passing	pct_free_or_reduced	pct_black	predicted_passing	residual
1410	48	ORANGE	0055	PRINCETON HOUSE CHARTER	9.0	25.000000	14.743590	76.397521	-67.397521
1073	29	HILLSBOROUGH	6609	PEPIN ACADEMIES	4.0	37.653240	24.747475	69.315062	-65.315062
2180	68	FSDB	0011	DEAF ELEMENTARY SCHOOL (FSDB)	7.0	69.491525	14.782609	58.665531	-51.665531
1717	50	PALM BEACH	4037	LEARNING PATH ACADEMY	0.0	93.650794	14.569536	49.084725	-49.084725
1948	53	POLK	9203	B.E.S.T.	7.0	69.767442	50.000000	51.367702	-44.367702

# Positive residual, did better than predicted
merged.sort_values(by='residual', ascending=False).head(5)

	District Number	District Name	School Number	School Name	pct_passing	pct_free_or_reduced	pct_black	predicted_passing	residual
165	06	BROWARD	0701	PARKWAY MIDDLE SCHOOL	100.0	85.758706	86.286788	37.591106	62.408894
304	06	BROWARD	5372	PATHWAYS ACADEMY K-8 CENTER	83.0	95.652174	64.206642	38.156444	44.843556
661	13	MIAMI DADE	5131	N DADE CENTER FOR MODERN LANGUAGE	89.0	76.767677	57.323232	47.084347	41.915653
566	13	MIAMI DADE	3101	FRANK CRAWFORD MARTIN K-8 CENTER	91.0	54.096916	59.079284	55.757254	35.242746
2187	74	FAMU LAB SCH	0351	FAMU DEVELOP RESEARCH	77.0	68.710359	96.881497	42.220239	34.779761

featured_names = [
    'LAKEWOOD ELEMENTARY SCHOOL',
    'CAMPBELL PARK ELEMENTARY SCHOOL',
    'MAXIMO ELEMENTARY SCHOOL',
    'MELROSE ELEMENTARY SCHOOL',
    'FAIRMOUNT PARK ELEMENTARY SCHOOL'
]

is_pinellas = merged['District Name'] == 'PINELLAS'
is_featured = merged['School Name'].isin(featured_names)

featured = merged[is_featured & is_pinellas]
featured

	District Number	District Name	School Number	School Name	pct_passing	pct_free_or_reduced	pct_black	predicted_passing	residual
1783	52	PINELLAS	0481	CAMPBELL PARK ELEMENTARY SCHOOL	22.0	75.850891	79.928952	42.835711	-20.835711
1789	52	PINELLAS	1211	FAIRMOUNT PARK ELEMENTARY SCHOOL	7.0	78.101072	83.821263	41.144884	-34.144884
1803	52	PINELLAS	2021	LAKEWOOD ELEMENTARY SCHOOL	29.0	71.272727	76.639344	45.330921	-16.330921
1806	52	PINELLAS	2281	MAXIMO ELEMENTARY SCHOOL	21.0	67.128713	82.932166	45.697393	-24.697393
1808	52	PINELLAS	2371	MELROSE ELEMENTARY SCHOOL	18.0	82.706767	86.238532	38.816752	-20.816752

From looking at our results, all of the featured Pinellas County schools are performing well below expectations. We have a few problems, though!

Exceptional student education#

If you read through the critique, it involves a feature that we don't have - exceptional students minus gifted. Exceptional students include students with all manner of disabilities or special circumstances - deaf students, emotional disabilities, autism spectrum, developmental disabled, and many more categories.

If a school has an especially high or low number of these students, it's sure to affect their passing rates. Why don't we have it in our analysis? I couldn't find the data on a school level, only on the county level. According to the author, just by adding that one column suddenly three of our residuals swing to positive, and our huge negatives come much closer to the center:

FCAT by school

How were we supposed to know that ESE students have such an impact? This is why you always run your analysis by the experts! They'll be able to suggest other columns you might want to add to your regression, and help you understand whether their impact is meaningful or not.

For example, whether ESE students are required to take tests or not would have a large impact on whether that additional information is meaningful.

Free lunch gotchas (a big one!)#

Speaking of experts: free lunch seems pretty straightforward, right? Even though it's a popular benchmark for analysis, it turns out that things aren't so simple.

It turns out that many schools and school systems across the US have simplified the free lunch program: instead of students needing to meeting a certain income requirements, lunch for all students is federally funded. While these are typically high-poverty areas, you can no longer use the percent free lunch as a meaningful proxy for poverty if all of the students can receive it, regardless of income.

Additionally, this new system was adopted at different rates and in different ways across districts and states, which makes measurements even less useful!

So where does that leave us? If we can't use free lunch, we're now stuck with just using race as our only regression feature. Seems a little simplistic!

More readings on free lunch as a proxy for poverty:

https://www.npr.org/sections/ed/2015/01/30/379330001/true-or-false-free-and-reduced-price-lunch-poor

https://nces.ed.gov/blogs/nces/post/free-or-reduced-price-lunch-a-proxy-for-poverty

https://www.ers.usda.gov/topics/food-nutrition-assistance/child-nutrition-programs/national-school-lunch-program/

So what do we do?#

Seems like it might be time for a new approach.

A different approach#

Instead of trying to see what schools performed well and what schools performed poorly, let's step back and examine our underlying assumptions. Powered by statistics, we've come to the conclusion that schools with more minority students and more poverty have lower passing rates.

This could lead to a pretty typical path for investigation: what are the over performing or under performing schools doing differently? And while it's a valid approach, it's taking for granted that for an individual student, their performance is anchored by their race and income level. That rich, white students performed well, and poor, minority students performed poorly. You can "fix" a school by adding rich, white kids to bring up the average, but it's only their higher test scores that are a contributing factor.

The Tampa Bay Times wanted to investigate a question that's slightly more complicated than that: were these black students performing worse in highly segregated schools than if they were attending integrated schools? So not only would the average of the school go up thanks to higher average scores of rich, white students, but would it also affect the students that the regression would expect to be low performers.

Where's the data?#

To answer this question, you need to be able to see how similar students perform as they are moved through schools with different racial makeups - segregated vs integrated. If students start at the same place, how do they perform down the road when their education has been in different environments?

The Tampa Bay Times was able to crack this nut using a test given to new kindergarteners that assessed their readiness to learn how to read. By tracking students' performance between kindergarten and their first reading test in third grade, the Times could see what kind of differences might exist between the students in each environment.

Our new analysis#

Let's start by reading in our pre-cleaned data. Note that this is a completely different dataset from what we were using before, as it was organized and cleaned up for a presentation from NICAR.

The column names make it pretty obvious as to what's what.

It's been cleaned the dataset from that notebook a bit to make all percentages be 0-100 instead of 0-1, but that's the only change.

df = pd.read_csv("data/FL_elementaries_adjusted.csv")
df = df.dropna()

df.head()

	year	total_enrollment	school	pinellas_flag	percent_pre_kinder_ready	percent_passing_reading	percent_passing_math	percent_free_lunch	percent_black_enrollment	majority_free_lunch_flag	majority_black_enrollment_flag	district	FF_school_flag
0	2014	549	LAKEWOOD ELEMENTARY SCHOOL	True	62.0	19	15	68.852459	81.785064	True	True	PINELLAS	True
1	2014	410	MELROSE ELEMENTARY SCHOOL	True	51.0	13	7	80.731707	85.853658	True	True	PINELLAS	True
2	2014	591	CAMPBELL PARK ELEMENTARY SCHOOL	True	48.0	24	20	72.927242	81.725888	True	True	PINELLAS	True
3	2014	632	FAIRMOUNT PARK ELEMENTARY SCHL	True	47.0	14	9	77.056962	85.126582	True	True	PINELLAS	True
4	2014	510	MAXIMO ELEMENTARY SCHOOL	True	66.0	22	11	62.156863	79.411765	True	True	PINELLAS	True

The percent_pre_kinder_ready column is going to be our judge of a cohort's starting point. From the presentation:

How well the kids do before they’ve spent a day in the class is critical. Because what if the kids are coming to school less prepared than every other kid in the state of Florida? How do we show it’s not just something in the water that the school board has no control over?

Definitely read their walkthrough, as it's a great visual approach that's very different than what we're doing here!

The question we're going to be looking at here is: given a set of students that start off at the same place, do they end up performing worse if a school is less integrated? We'll first try making our integration judgement using percent_black_enrollment (note that we're ignoring free lunch here because of the reasons mentioned above).

# How does kindergarten readiness and percent black relate to 3rd grade passing rates?
model = smf.ols("percent_passing_reading ~ percent_pre_kinder_ready + percent_black_enrollment", data=df)

result = model.fit()
result.summary()

OLS Regression Results
Dep. Variable:	percent_passing_reading	R-squared:	0.653
Model:	OLS	Adj. R-squared:	0.652
Method:	Least Squares	F-statistic:	1642.
Date:	Thu, 27 Feb 2020	Prob (F-statistic):	0.00
Time:	10:06:31	Log-Likelihood:	-6436.4
No. Observations:	1751	AIC:	1.288e+04
Df Residuals:	1748	BIC:	1.290e+04
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	20.5767	1.376	14.951	0.000	17.877	23.276
percent_pre_kinder_ready	0.6234	0.017	35.644	0.000	0.589	0.658
percent_black_enrollment	-0.2821	0.009	-31.129	0.000	-0.300	-0.264

Omnibus:	45.233	Durbin-Watson:	1.654
Prob(Omnibus):	0.000	Jarque-Bera (JB):	75.480
Skew:	-0.218	Prob(JB):	4.07e-17
Kurtosis:	3.919	Cond. No.	469.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

For students that start off at the same place: as the black enrollment climbs, test scores go down. But this doesn't actually isn't what we're looking for, since our question isn't so much "what is the affect of a certain percentage of black students," and more a question of integration. A simplification might be, what happens if a school is majority black?

While we could compute majority black easily, it's also one of the columns!

# How does kindergarten readiness and being a majority-black school to 3rd grade passing rates?
model = smf.ols("percent_passing_reading ~ percent_pre_kinder_ready + majority_black_enrollment_flag", data=df)

result = model.fit()

result.summary()

OLS Regression Results
Dep. Variable:	percent_passing_reading	R-squared:	0.587
Model:	OLS	Adj. R-squared:	0.587
Method:	Least Squares	F-statistic:	1244.
Date:	Thu, 27 Feb 2020	Prob (F-statistic):	0.00
Time:	10:06:35	Log-Likelihood:	-6587.3
No. Observations:	1751	AIC:	1.318e+04
Df Residuals:	1748	BIC:	1.320e+04
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	11.2690	1.405	8.022	0.000	8.514	14.024
majority_black_enrollment_flag[T.True]	-15.5782	0.671	-23.213	0.000	-16.894	-14.262
percent_pre_kinder_ready	0.6898	0.019	36.989	0.000	0.653	0.726

Omnibus:	21.314	Durbin-Watson:	1.707
Prob(Omnibus):	0.000	Jarque-Bera (JB):	23.652
Skew:	-0.218	Prob(JB):	7.31e-06
Kurtosis:	3.366	Cond. No.	421.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The coefficient of -15.58 for majority_black_enrollment_flag[T.True] means that when that column switches from 0 (minority black) to 1 (majority black), the passing rate drops by 15.58 points. According to our analysis, even if a school's students start off equally prepared, being a majority-black school translates into a drop of over 15 points in the reading test pass rate.

And those specific Pinellas schools we're interested in?

# What was the predicted passing rate for each school
df['predicted_passing'] = result.predict()

# How much different they performed compared to the expectation
df['residual'] = df['percent_passing_reading'] - df['predicted_passing']


df[df.FF_school_flag]

	year	total_enrollment	school	pinellas_flag	percent_pre_kinder_ready	percent_passing_reading	percent_passing_math	percent_free_lunch	percent_black_enrollment	majority_free_lunch_flag	majority_black_enrollment_flag	district	FF_school_flag	predicted_passing	residual
0	2014	549	LAKEWOOD ELEMENTARY SCHOOL	True	62.0	19	15	68.852459	81.785064	True	True	PINELLAS	True	38.456217	-19.456217
1	2014	410	MELROSE ELEMENTARY SCHOOL	True	51.0	13	7	80.731707	85.853658	True	True	PINELLAS	True	30.868805	-17.868805
2	2014	591	CAMPBELL PARK ELEMENTARY SCHOOL	True	48.0	24	20	72.927242	81.725888	True	True	PINELLAS	True	28.799511	-4.799511
3	2014	632	FAIRMOUNT PARK ELEMENTARY SCHL	True	47.0	14	9	77.056962	85.126582	True	True	PINELLAS	True	28.109746	-14.109746
4	2014	510	MAXIMO ELEMENTARY SCHOOL	True	66.0	22	11	62.156863	79.411765	True	True	PINELLAS	True	41.215276	-19.215276

Even when adjusting for initial kindergarten scores and minority/majority black students, the features schools still under performing.

In the end, all this analysis boiled down to these two sentences:

A Times analysis of statewide kindergarten readiness data shows that new students in Pinellas County’s most segregated schools show up no less prepared than students in scores of other struggling, high-poverty schools. It’s only after a few years in Pinellas classrooms that they’re falling behind their peers statewide.

Discussion topics#

Do ESE students really have an effect our results? Check out the names of some of schools that performed far below predicted. Do the names tell you anything about the students that go there?

When looking at the difference between how a school's predicted and actual passing rate, we used the raw residual. What might be a downside? Think about schools with a high predicted score compared to those with a low predicted score.

An alternative to the raw residual would be the standard deviation, which we used to analyze cheating Texas schools. Why might standard deviation a more effective measure of under/over performing?

In our second regression, we found that even after testing equally while entering kindergarten, students at a majority-black school tend to perform worse on their third-grade reading tests. How is this different than the findings from our first regression, where students at a majority-black school performing worse?

What policy changes or investigation angles would you suggest after our initial analysis? What would change after our second analysis?

Our final analysis only resulted in two sentences in the final published piece. Despite that, how might the findings have colored the rest of our research?

Which regression does a better job representing the state of passing reading tests? It isn't an emotional appeal, there's actually a mathematical answer! You might need to check back in our evaluating regressions page if you forget how to compare regressions.

In our second analysis, why is there a difference between using percent black enrollment as compared to majority black?

In this analysis we used 50% black as a proxy for integrated vs integrated schools. What might have changed if we pushed this number to something higher, like 75% or even 90%?

While we couldn't reproduce it, the critique stated that three of the five "failure factories" schools outperformed their peers. Let's say this was true: does it change our findings on the relationship between school performance and their having a majority/minority black students?

The Tampa Bay Times and school performance#

What's the story?#

The data#

Test score data#

Checking out our data#

Lunch data#

Reading in lunch data#

Feature engineering#

Race data#

Cleaning up race counts#

Feature engineering#

Merging our datasets#

Cleaning up our columns#

Converting to percentages#

Graphing our data#

Linear regression#

Understanding the regression#

Overperformers and underperformers#

Making predictions#

Using predictions#

Exceptional student education#

Free lunch gotchas (a big one!)#

So what do we do?#

A different approach#

Where's the data?#

Our new analysis#

Discussion topics#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects