Regression: What's the point?#

Let's take a look at how we can use regression to find relationships within our dataset.

What is regression?#

Regression is a way to describe how two (or more) things are related to each other. You might notice it in sentences like:

  • "An increase of 10 percentage points in the unemployment rate in a neighborhood translated to a loss of roughly a year and a half of life expectancy," from the Associated Press. As unemployment goes up, life expectancy goes down.
  • "Reveal’s analysis also showed that the greater the number of African Americans or Latinos in a neighborhood, the more likely a loan application would be denied there – even after accounting for income and other factors," from Reveal. As the amount of African Americans or Latinos goes up, the likelihood of a loan being denied goes up.
  • "In Boston, Asian and Latino residents were more likely to be ticketed than were out-of-towners of the same race, when cited for the same offense," from the Boston Globe

Notice that we've defined it as how a change in one variable is related to a change in another, not that a change in one causes a change in another. Regression can tell you the relationship, but not the "why."

Types of regression#

While there are many kinds of regression out there, the two major ones journalists care about are linear regression and logistic regression.

Linear regression is used to predict numbers. Life expectancy is a number, so the Associate Press story above uses linear regression. We'll also see linear regression often used with standardized test scores in education.

Logistic regression is used to predict categories such as yes/no or accepted/rejected. A loan being denied is a yes/no, so the Reveal story above uses logistic regression. Logistic regression is also very common when looking at bias or discrimination.

When to use regression#

Before we get used to it, we might not realize there are times that regression would be useful. The biggest clue that we'll want to use regression is when we're looking at the "relationship between" or "correlation between" two (or more) different things.

Note that correlation is a actually real stats thing that's separate from regression. But it seems like when most people talk about correlation they're looking for a "when X goes up, Y changes such-and-such amount" kind of description, which comes from performing a regression.

We can also recognize regression from the phrases "all other factors being equal" or "controlling for differences in." Notice how in the Reveal example above the African American/Latino population matters "even after accounting for income and other factors."

We'll use both linear and logistic regression for two major things:

  • Understanding the impact of different factors: In ProPublica's criminal sentencing bias piece, they showed that race played an outsize role in determining sentencing suggestions, controlling for other possibly-explanatory factors. In the Associated Press piece, they show the relationship between unemployment and life expectancy.
  • Finding unexpected outliers: In this piece from the Dallas Morning News, they used regression analysis to predict a school's 4th grade standardized test scores based on their 3rd grade scores. Schools that did much much much better than expected were suspected of cheating.

While regression will show up in other situations - automatically classifying documents, for example! - we'll cover that separately.

How to stay careful#

If you're doing anything involving statistics or large datasets, you need to check your results with an expert. While running a regression can be pretty straightforward, there's always the possibility of problems hiding in the details. For example, we might have forgotten some useful variable, or maybe a few variables are fighting with each other and ruining our results (which sounds more fun than "multicollinearity").

Almost every single person I interviewed who performed a regression analysis for a story leaned heavily on other members of their team, as well as at least one outside expert. Some newsrooms even pitted multiple academics against each other over the results, having statisticians and subject-matter experts battle it out over the "right" way to do it!

Even if we hire sixty statisticians and get back a hundred different contradictory answers, even if we can't be 100% certain it's all 100% perfect, even if we're eventually convinced statistics is more art than science, at least we'll have an idea of possible issues with our approach and how they might affect your story.

Review#

In this section we introduced regression, which is the relationship between different one or more input variables and an output variable. For example, unemployment and education (inputs) on life expectancy (output). There are two kinds of regression, linear regression which is used to predict numbers, and logistic regression which is used to predict categories (typically yes/no answers).

You can use the result of a regression to make predictions (how well should this school have scored in math, given it scored XXX in reading?), or simply explain how two things are related ("all other factors being equal...").

While performing a regression can be pretty easy, you'll always want to double-check with someone who has more statistics and/or domain-specific knowledge than you. Since people really trust numbers you'll want to make sure you're doing everything right!