Detecting special interest model legislation in state laws
Special interest groups use model legislation to push their agendas in state government. How can we find bills based on these "cut and paste" models?
natural language processing apache tika n-grams text reuse legislation
Readings and links
- You elected them to write new laws. They’re letting corporations do it instead.
- What is ALEC? 'The most effective organization' for conservatives, says Newt Gingrich
- How we uncovered 10,000 times lawmakers introduced copycat model bills — and why it matters
- Copy, paste, legislate
- A Reddit AMA about the series
- Tweets from Rob O'Dell on the project
USA Today, The Arizona Republic, and the Center for Public Integrity examined over one million bills from state legislatures, using techniques similar to plagiarism detection. This effort uncovered over 10,000 uses of model legislation, which are bills put together and distributed by special interest groups and passed by elected officials.
For this project, we will use Apache Tika to process all types of documents (PDFs, web pages, images) into a computer-readable format. Once they're ingested, we'll use Apache Solr to find documents similar to source legislation, and finally use Python text analysis tools to find the exact overlaps.
This is a massive project, so it's broken into a lot of notebooks below. If you'd like to skip ahead, maybe try "Checking for legislative text reuse using Python, Solr, and ngrams."
Notebooks, Assignments, and Walkthroughs
Downloading one million pieces of legislation from LegiScan
Taking a million pieces of legislation from a CSV and inserting them into Postgres
Download Word, PDF and HTML content and process it into text with Tika
Apache Solr is a great resource for advanced text searching. It isn't a database - it sits alongside your data - but allows you to conduct powerful queries.
Using an n-gram index in Solr to quickly track down duplicated text between bills and model legislation, then using Python to find the extent of the overlaps.
Using simple text index in Solr to quickly track down duplicated text between bills and model legislation, then using Python to find the extent of the overlaps.
In the other notebooks we're looking for text reuse one at a time. In this once, we'll be comparing all of our model legislation against our database of 1.2 million bills.
Using topic modeling to categorize legislation