Detecting special interest model legislation in state laws

Special interest groups use model legislation to push their agendas in state government. How can we find bills based on these "cut and paste" models?

natural language processing apache tika n-grams text reuse legislation

Summary

USA Today, The Arizona Republic, and the Center for Public Integrity examined over one million bills from state legislatures, using techniques similar to plagiarism detection. This effort uncovered over 10,000 uses of model legislation, which are bills put together and distributed by special interest groups and passed by elected officials.

For this project, we will use Apache Tika to process all types of documents (PDFs, web pages, images) into a computer-readable format. Once they're ingested, we'll use Apache Solr to find documents similar to source legislation, and finally use Python text analysis tools to find the exact overlaps.

This is a massive project, so it's broken into a lot of notebooks below. If you'd like to skip ahead, maybe try "Checking for legislative text reuse using Python, Solr, and ngrams."

Notebooks, Assignments, and Walkthroughs

Downloading one million pieces of legislation from LegiScan

Downloading one million pieces of legislation from LegiScan

Taking a million pieces of legislation from a CSV and inserting them into Postgres

Taking a million pieces of legislation from a CSV and inserting them into Postgres

Download Word, PDF and HTML content and process it into text with Tika

Download Word, PDF and HTML content and process it into text with Tika

Import content into Solr for advanced text searching

Apache Solr is a great resource for advanced text searching. It isn't a database - it sits alongside your data - but allows you to conduct powerful queries.

Checking for legislative text reuse using Python, Solr, and ngrams

Using an n-gram index in Solr to quickly track down duplicated text between bills and model legislation, then using Python to find the extent of the overlaps.

Checking for legislative text reuse using Python, Solr, and simple text search

Using simple text index in Solr to quickly track down duplicated text between bills and model legislation, then using Python to find the extent of the overlaps.

Search for model legislation in over one million bills using Postgres and Solr

In the other notebooks we're looking for text reuse one at a time. In this once, we'll be comparing all of our model legislation against our database of 1.2 million bills.

Using topic modeling to categorize legislation

Using topic modeling to categorize legislation