Detecting special interest model legislation in state laws

Special interest groups use model legislation to push their agendas in state government. How can we find bills based on these "cut and paste" models?

natural language processing apache tika n-grams text reuse legislation

Readings and links

Summary

USA Today, The Arizona Republic, and the Center for Public Integrity examined over one million bills from state legislatures, using techniques similar to plagiarism detection. This effort uncovered over 10,000 uses of model legislation, which are bills put together and distributed by special interest groups and passed by elected officials.

For this project, we will use Apache Tika to process all types of documents (PDFs, web pages, images) into a computer-readable format. Once they're ingested, we'll use Apache Solr to find documents similar to source legislation, and finally use Python text analysis tools to find the exact overlaps.

This is a massive project, so it's broken into a lot of notebooks below. If you'd like to skip ahead, maybe try "Checking for legislative text reuse using Python, Solr, and ngrams."

Notebooks, Assignments, and Walkthroughs

Downloading one million pieces of legislation from LegiScan

Downloading one million pieces of legislation from LegiScan

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

Taking a million pieces of legislation from a CSV and inserting them into Postgres

Taking a million pieces of legislation from a CSV and inserting them into Postgres

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

Download Word, PDF and HTML content and process it into text with Tika

Download Word, PDF and HTML content and process it into text with Tika

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

Import content into Solr for advanced text searching

Apache Solr is a great resource for advanced text searching. It isn't a database - it sits alongside your data - but allows you to conduct powerful queries.

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

Checking for legislative text reuse using Python, Solr, and ngrams

Using an n-gram index in Solr to quickly track down duplicated text between bills and model legislation, then using Python to find the extent of the overlaps.

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

Checking for legislative text reuse using Python, Solr, and simple text search

Using simple text index in Solr to quickly track down duplicated text between bills and model legislation, then using Python to find the extent of the overlaps.

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

Search for model legislation in over one million bills using Postgres and Solr

In the other notebooks we're looking for text reuse one at a time. In this once, we'll be comparing all of our model legislation against our database of 1.2 million bills.

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

Using topic modeling to categorize legislation

Using topic modeling to categorize legislation

Jupyter Notebook

Download notebook

Jupyter Notebook

Interactive version

Jupyter Notebook

About the site

Hi, I'm Soma, welcome to Data Science for Journalism a.k.a. investigate.ai!

There's been a lot of buzz about machine learning and "artificial intelligence" being used in stories over the past few years. It's mostly not that complicated - a little stats, a classifier here or there - but it's hard to know where to start without a little help.

If you know a little Python programming, hopefully this site can be that help! Learn more about this project here.

Our newsletter

Links

Thanks to Columbia Journalism School, the Knight Foundation, and many others.