Reproducing test score graphics from The Dallas Morning News' investigation of TAKS scores#

While text-based analysis can take you far, a good graphic can help you see patterns in your data.

Read online Download notebook Interactive version

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np

%matplotlib inline

2003 third-grade reading scores vs 2004 fourth-grade reading scores#

We'll read in two years of data and combine them, roughly tracking students at the same school as they move between third and fourth grade.

From The Dallas Morning News:

Harrell Budd scored poorly in third and fifth grade. But its fourth-grade reading scores were among the best in the state

We are able to highlight Harrell Budd using its campus code of 57905115. You have also filtered by name, etc.

df1 = pd.read_csv("data/cfy04e4.dat", usecols=['r_all_rs', 'CAMPUS', 'CNAME'])
df1 = df1.set_index('CAMPUS').add_suffix('_fourth')
df2 = pd.read_csv("data/cfy03e3.dat", usecols=['r_all_rs', 'CAMPUS'])
df2 = df2.set_index('CAMPUS').add_suffix('_third')

merged = df1.join(df2)
merged.head(3)

	CNAME_fourth	r_all_rs_fourth	r_all_rs_third
CAMPUS
1902103	CAYUGA EL	2392.0	2330.0
1903101	ELKHART EL	2263.0	2285.0
1904102	FRANKSTON EL	2242.0	2299.0

fig, ax = plt.subplots(figsize=(4,4))

ax.set_xlim(2000, 2500)
ax.set_ylim(1900, 2500)
ax.set_facecolor('lightgrey')
ax.grid(True, color='white')
ax.set_axisbelow(True)

sns.regplot('r_all_rs_third',
            'r_all_rs_fourth',
            data=merged,
            marker='.', 
            line_kws={"color": "black", "linewidth": 1},
            scatter_kws={"color": "grey"})

highlight = merged.loc[57905115]
plt.plot(highlight.r_all_rs_third, highlight.r_all_rs_fourth, 'ro')

[<matplotlib.lines.Line2D at 0x121762320>]

highlight

CNAME_fourth       HARRELL BUDD EL
r_all_rs_fourth               2470
r_all_rs_third                2140
Name: 57905115, dtype: object

2004 fifth-grade math scores vs fifth-grade reading scores#

This time we'll only read in one year of data - 2004 - and compare the math and reading scores at each school.From The Dallas Morning News:

Sanderson's fourth-grade math scores were exceedingly low. Its fifth-grade scores were No. 1 in the state.

We are able to highlight Sanderson using its campus code of 101912236. You have also filtered by name, etc.

df = pd.read_csv("data/cfy04e5.dat", usecols=['m_all_rs', 'r_all_rs', 'CAMPUS', 'CNAME'])
df = df.set_index('CAMPUS').add_suffix('_fifth')
df.head(3)

	CNAME_fifth	r_all_rs_fifth	m_all_rs_fifth
CAMPUS
1902103	CAYUGA EL	2308.0	2317.0
1903101	ELKHART EL	2193.0	2153.0
1904102	FRANKSTON EL	2288.0	2256.0

fig, ax = plt.subplots(figsize=(4,4))

ax.set_xlim(1900, 2500)
ax.set_ylim(1800, 2750)
ax.set_facecolor('lightgrey')
ax.grid(True, color='white')
ax.set_axisbelow(True)

sns.regplot('r_all_rs_fifth',
            'm_all_rs_fifth',
            data=df,
            marker='.', 
            line_kws={"color": "black", "linewidth": 1},
            scatter_kws={"color": "grey"})

highlight = df.loc[101912236]
plt.plot(highlight.r_all_rs_fifth, highlight.m_all_rs_fifth, 'ro')

[<matplotlib.lines.Line2D at 0x12033e6d8>]

highlight

CNAME_fifth       SANDERSON EL
r_all_rs_fifth            2235
m_all_rs_fifth            2696
Name: 101912236, dtype: object

2004 third-grade reading scores vs 2004 fourth-grade reading scores#

This time we'll see how third- and fourth-graders performed at the same school in the same year. From The Dallas Morning News:

Garza's third-grade students, most of whom have problems with English, finished in the top 2 percent of the state in reading.

df1 = pd.read_csv("data/cfy04e4.dat", usecols=['r_all_rs', 'CAMPUS', 'CNAME'])
df1 = df1.set_index('CAMPUS').add_suffix('_fourth')
df2 = pd.read_csv("data/cfy04e3.dat", usecols=['r_all_rs', 'CAMPUS'])
df2 = df2.set_index('CAMPUS').add_suffix('_third')
merged = df1.join(df2)
merged.head(3)

	CNAME_fourth	r_all_rs_fourth	r_all_rs_third
CAMPUS
1902103	CAYUGA EL	2392.0	2410.0
1903101	ELKHART EL	2263.0	2256.0
1904102	FRANKSTON EL	2242.0	2284.0

fig, ax = plt.subplots(figsize=(4,4))

ax.set_xlim(2000, 2600)
ax.set_ylim(1900, 2500)
ax.set_facecolor('lightgrey')
ax.grid(True, color='white')
ax.set_axisbelow(True)

sns.regplot('r_all_rs_third',
            'r_all_rs_fourth',
            data=merged,
            marker='.', 
            line_kws={"color": "black", "linewidth": 1},
            scatter_kws={"color": "grey"})

highlight = merged.loc[[31901124, 57905115, 57920108]]
plt.plot(highlight.r_all_rs_third, highlight.r_all_rs_fourth, 'ro')

[<matplotlib.lines.Line2D at 0x1211d7c18>]

highlight

	CNAME_fourth	r_all_rs_fourth	r_all_rs_third
CAMPUS
31901124	GARZA EL	2142.0	2398.0
57905115	HARRELL BUDD EL	2470.0	2160.0
57920108	WILMER EL	2168.0	2501.0

Reproducing test score graphics from The Dallas Morning News' investigation of TAKS scores#

2003 third-grade reading scores vs 2004 fourth-grade reading scores#

2004 fifth-grade math scores vs fifth-grade reading scores#

2004 third-grade reading scores vs 2004 fourth-grade reading scores#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects