Comparing 1.2 million bills to thousands of pieces of model legislation#

In our mission to reproduce this piece on model legislation, we need to find all examples of "cut and paste" legislation in our database.

Our previous approach found one piece of model legislation at a time, while this time we'll be looking to process all of them in one batch.

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import pysolr
import requests
from sqlalchemy import create_engine
import tqdm

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_colwidth", 1000)

Read in model bills#

model_df = pd.read_csv("data/alec-model-policies.csv")
model_df = model_df.rename(columns={'text': 'content'})
model_df.head()
title url content
0 Resolution Supporting Congressional Approval of the United States-Mexico-Canada Agreement (USMCA) https://www.alec.org/model-policy/resolution-supporting-congressional-approval-of-the-united-states-mexico-canada-agreement-usmca/ \n\nDraft\nResolution Supporting Congressional Approval of the United States-Mexico-Canada Agreement (USMCA)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWhereas, the imposition of artificial barriers to free and open trade are harmful to American economic interests; and\nWhereas, together, the United States, Canada and Mexico promote a shared belief in freedom, representative democracy and market principles as recognized in the U.S. Constitution; and\nWhereas, a longstanding, close tri-lateral relationship, codified in the North American Free Trade Agreement (NAFTA), has existed between the United States, Canada, and Mexico for more than 25 years and has proven economically, culturally and strategically important for all parties and this relationship will continue with ratification of USMCA; and\nWhereas, trade with Canada and Mexico supports nearly 12 million American jobs, and nearly 5 million of those jobs are supported by increased trade generated by NAFTA and these benefits will co...
1 Resolution Supporting the Intellectual Property (IP) Provisions in the United States-Mexico-Canada Agreement (USMCA) https://www.alec.org/model-policy/draft-resolution-supporting-the-intellectual-property-ip-provisions-in-the-united-states-mexico-canada-agreement-usmca/ \n\nDraft\nResolution Supporting the Intellectual Property (IP) Provisions in the United States-Mexico-Canada Agreement (USMCA)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWhereas, the American Legislative Exchange Council (ALEC) policy on free trade acknowledges that, “the imposition of artificial barriers to free and open trade…are deterrents to American economic interests;” and\nWhereas, the United States, Canada and Mexico share a belief in freedom, representative democracy and market principles as recognized in the U.S. Constitution; and\nWhereas, trade among our North American trading partners is made up predominantly of intellectual property (IP)-intensive goods and services that employ millions of Americans in high paying jobs and generate billions of dollars in economic output; and\nWhereas, many of the IP-intensive goods, services and exchanges through which trade is facilitated in the NAFTA bloc did not exist when the agreement was drafted and this situation has resulted in u...
2 Victims of Communism Memorial Day Resolution https://www.alec.org/model-policy/draft-victims-of-communism-memorial-day-resolution/ \n\nDraft\nVictims of Communism Memorial Day Resolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nModel Policy\nWHEREAS, the year 2017 marked 100 years since the Bolshevik Revolution in Russia resulting in the world’s first communist regime under Vladimir Lenin, which led to decades of oppression and violence under communist regimes throughout the world; and\nWHEREAS, based on the philosophy of Karl Marx, communism has proven incompatible with the ideals of liberty, prosperity, and dignity of human life and has given rise to such infamous totalitarian dictators as Joseph Stalin, Mao Zedong, Ho Chi Minh, Pol Pot, Nicolae Ceaușescu, the Castro brothers, and the Kim dynasty; and\nWHEREAS, President Donald Trump declared November 7, 2017 a National Day for the Victims of Communism, condemning communism as a political philosophy “incompatible with liberty, prosperity, and the dignity of human life;” and\nWHEREAS, the bipartisan U.S. Congressional Caucus for the Victims of Communism stated ...
3 Resolution in Support of the Taiwan Travel Act https://www.alec.org/model-policy/draft-resolution-in-support-of-the-taiwan-travel-act/ \n\nDraft\nResolution in Support of the Taiwan Travel Act\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nModel Policy\nWhereas, a longstanding, close bilateral relationship, codified in the Taiwan Relations Act, has existed between the United States and Taiwan and has proven economically, culturally and strategically important to both; and\nWhereas, Taiwan is a robust democracy, significant American trading partner and U.S. ally; and\nWhereas, together, Taiwan and the United States promote a shared belief in freedom, democracy and free market principles; and\nWhereas, Taiwan has consistently ranked among the top 12 U.S. trading partners for more than two decades; and\nWhereas, Taiwan serves as a free market, democratic beacon and protector of the rules-based international order in the region.\nTherefore be it resolved, that ALEC applauds the adoption of the Taiwan Travel Act which will encourage the high-level official to official exchanges facilitated by the Act.\nBe it further resolved, ...
4 Draft Resolution Urging the Presidential Administration and Congress to Support Continued U.S. Participation in the U.S.-Korea Free Trade Agreement (KORUS FTA) https://www.alec.org/model-policy/draft-resolution-urging-the-presidential-administration-and-congress-to-support-continued-u-s-participation-in-the-u-s-korea-free-trade-agreement-korus-fta/ \n\nDraft\nDraft Resolution Urging the Presidential Administration and Congress to Support Continued U.S. Participation in the U.S.-Korea Free Trade Agreement (KORUS FTA)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWHEREAS, the American Legislative Exchange Council (ALEC) policy on free trade acknowledges that “the imposition of artificial barriers to free and open trade…are deterrents to American economic interests;” and\nWHEREAS, KORUS FTA was entered into force on March 15, 2012; and\nWHEREAS, KORUS FTA has been the largest U.S. FTA in more than 16 years and is the highest standard trade framework the U.S. currently has in force; and\nWHEREAS, retaining the KORUS FTA at this time would send a strong signal to U.S. trading partners that America’s historic commitment to free trade and economic liberalization remains strong; and\nWHEREAS, the Republic of Korea is the 15th largest economy in the world; and\nWHEREAS, the Republic of Korea is the United States’ seventh largest export marke...

Find matches#

SOLR_RESULTS = 500

solr = pysolr.Solr('http://localhost:8983/solr/legislation', always_commit=True)
engine = create_engine('postgresql://localhost:5432/legislation')

def find_matches(target):
    # If there are leftovers from a previous match search, remove them
    solr.delete(q='bill_id:0')
    # Insert the model legislation to do a MLT search
    solr.add([{ 'content': target['content'], 'bill_id': 0 }])

    # What's like the one we just added?
    response = requests.get(f'http://localhost:8983/solr/legislation/mlt?q=bill_id:0&rows={SOLR_RESULTS}')
    data = response.json()

    # Extract bill ids, pass to postgres database
    bill_ids = [result['bill_id'] for result in data['response']['docs']]
    query = "select * from bills where bill_id = ANY(ARRAY{})".format(bill_ids)
    matches_df = pd.read_sql_query(query, engine)

    # Vectorize original and compare to search results
    vectorizer = CountVectorizer(binary=True, ngram_range=(6,6))
    vectorizer.fit([target['content']])
    matrix = vectorizer.transform(matches_df.content)

    # Count up matches
    sums = matrix.sum(axis=1)

    # Delete the model legislation that we're done
    solr.delete(q='bill_id:0')

    return pd.DataFrame({
        'matches': np.squeeze(np.asarray(sums)),
        'code': matches_df.state_code + "-" + matches_df.basename,
        'matched_with': target['title']
    })
# We can use iterrows because the speed of this part is really not that important

results = []
model_df = model_df.head(20)
for index, row in tqdm.tqdm_notebook(model_df.iterrows(), total=model_df.shape[0]):
    result = find_matches(row)
    results.append(result)    
df = pd.concat(results)
---------------------------------------------------------------------------
timeout                                   Traceback (most recent call last)
~/.local/share/virtualenvs/algos-book-zMo2shYq/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    420                     # Otherwise it looks like a bug in the code.
--> 421                     six.raise_from(e, None)
    422         except (SocketTimeout, BaseSSLError, SocketError) as e:

~/.local/share/virtualenvs/algos-book-zMo2shYq/lib/python3.6/site-packages/urllib3/packages/six.py in raise_from(value, from_value)

~/.local/share/virtualenvs/algos-book-zMo2shYq/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    415                 try:
--> 416                     httplib_response = conn.getresponse()
    417                 except BaseException as e:

~/.pyenv/versions/3.6.8/lib/python3.6/http/client.py in getresponse(self)
   1330             try:
-> 1331                 response.begin()
   1332             except ConnectionError:

~/.pyenv/versions/3.6.8/lib/python3.6/http/client.py in begin(self)
    296         while True:
--> 297             version, status, reason = self._read_status()
    298             if status != CONTINUE:

~/.pyenv/versions/3.6.8/lib/python3.6/http/client.py in _read_status(self)
    257     def _read_status(self):
--> 258         line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
    259         if len(line) > _MAXLINE:

~/.pyenv/versions/3.6.8/lib/python3.6/socket.py in readinto(self, b)
    585             try:
--> 586                 return self._sock.recv_into(b)
    587             except timeout:

timeout: timed out

During handling of the above exception, another exception occurred:

ReadTimeoutError                          Traceback (most recent call last)
~/.local/share/virtualenvs/algos-book-zMo2shYq/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    448                     retries=self.max_retries,
--> 449                     timeout=timeout
    450                 )

~/.local/share/virtualenvs/algos-book-zMo2shYq/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    719             retries = retries.increment(
--> 720                 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
    721             )

~/.local/share/virtualenvs/algos-book-zMo2shYq/lib/python3.6/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
    399             if read is False or not self._is_method_retryable(method):
--> 400                 raise six.reraise(type(error), error, _stacktrace)
    401             elif read is not None:

~/.local/share/virtualenvs/algos-book-zMo2shYq/lib/python3.6/site-packages/urllib3/packages/six.py in reraise(tp, value, tb)
    734                 raise value.with_traceback(tb)
--> 735             raise value
    736         finally:

~/.local/share/virtualenvs/algos-book-zMo2shYq/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    671                 headers=headers,
--> 672                 chunked=chunked,
    673             )

~/.local/share/virtualenvs/algos-book-zMo2shYq/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    422         except (SocketTimeout, BaseSSLError, SocketError) as e:
--> 423             self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
    424             raise

~/.local/share/virtualenvs/algos-book-zMo2shYq/lib/python3.6/site-packages/urllib3/connectionpool.py in _raise_timeout(self, err, url, timeout_value)
    330             raise ReadTimeoutError(
--> 331                 self, url, "Read timed out. (read timeout=%s)" % timeout_value
    332             )

ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8983): Read timed out. (read timeout=60)

During handling of the above exception, another exception occurred:

ReadTimeout                               Traceback (most recent call last)
~/.local/share/virtualenvs/algos-book-zMo2shYq/lib/python3.6/site-packages/pysolr.py in _send_request(self, method, path, body, headers, files)
    384             resp = requests_method(url, data=bytes_body, headers=headers, files=files,
--> 385                                    timeout=self.timeout, auth=self.auth)
    386         except requests.exceptions.Timeout as err:

~/.local/share/virtualenvs/algos-book-zMo2shYq/lib/python3.6/site-packages/requests/sessions.py in post(self, url, data, json, **kwargs)
    580 
--> 581         return self.request('POST', url, data=data, json=json, **kwargs)
    582 

~/.local/share/virtualenvs/algos-book-zMo2shYq/lib/python3.6/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    532         send_kwargs.update(settings)
--> 533         resp = self.send(prep, **send_kwargs)
    534 

~/.local/share/virtualenvs/algos-book-zMo2shYq/lib/python3.6/site-packages/requests/sessions.py in send(self, request, **kwargs)
    645         # Send the request
--> 646         r = adapter.send(request, **kwargs)
    647 

~/.local/share/virtualenvs/algos-book-zMo2shYq/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    528             elif isinstance(e, ReadTimeoutError):
--> 529                 raise ReadTimeout(e, request=request)
    530             else:

ReadTimeout: HTTPConnectionPool(host='localhost', port=8983): Read timed out. (read timeout=60)

During handling of the above exception, another exception occurred:

SolrError                                 Traceback (most recent call last)
<ipython-input-12-5fb60887ce6d> in <module>
      4 model_df = model_df.head(20)
      5 for index, row in tqdm.tqdm_notebook(model_df.iterrows(), total=model_df.shape[0]):
----> 6     result = find_matches(row)
      7     results.append(result)
      8 df = pd.concat(results)

<ipython-input-7-dd03a394198e> in find_matches(target)
     28 
     29     # Delete the model legislation that we're done
---> 30     solr.delete(q='bill_id:0')
     31 
     32     return pd.DataFrame({

~/.local/share/virtualenvs/algos-book-zMo2shYq/lib/python3.6/site-packages/pysolr.py in delete(self, id, q, commit, softCommit, waitFlush, waitSearcher, handler)
    958             m = '<delete><query>%s</query></delete>' % q
    959 
--> 960         return self._update(m, commit=commit, softCommit=softCommit, waitFlush=waitFlush, waitSearcher=waitSearcher, handler=handler)
    961 
    962     def commit(self, softCommit=False, waitFlush=None, waitSearcher=None, expungeDeletes=None, handler='update'):

~/.local/share/virtualenvs/algos-book-zMo2shYq/lib/python3.6/site-packages/pysolr.py in _update(self, message, clean_ctrl_chars, commit, softCommit, waitFlush, waitSearcher, overwrite, handler)
    498             message = sanitize(message)
    499 
--> 500         return self._send_request('post', path, message, {'Content-type': 'text/xml; charset=utf-8'})
    501 
    502     def _extract_error(self, resp):

~/.local/share/virtualenvs/algos-book-zMo2shYq/lib/python3.6/site-packages/pysolr.py in _send_request(self, method, path, body, headers, files)
    387             error_message = "Connection to server '%s' timed out: %s"
    388             self.log.error(error_message, url, err, exc_info=True)
--> 389             raise SolrError(error_message % (url, err))
    390         except requests.exceptions.ConnectionError as err:
    391             error_message = "Failed to connect to server at '%s', are you sure that URL is correct? Checking it in a browser might help: %s"

SolrError: Connection to server 'http://localhost:8983/solr/legislation/update/?commit=true' timed out: HTTPConnectionPool(host='localhost', port=8983): Read timed out. (read timeout=60)
df.sort_values(by='matches', ascending=False).head(100)