Importing content in Solr for advanced text searching#

We've imported over a million pieces of legislation into a postgres database, but that just isn't good enough! While our database system can do a lot, we have some intense text searching in our future, and postgres just isn't up to the task.

Instead, we're going to use another Apache product - Apache Solr - as a search tool that sits next to our postgres database and performs lightning-fast text searches.

Read online Download notebook Interactive version

Why do we need Solr?#

Asking why we need to use Solr for this is an excellent question! With over a million documents, it's going to be very, very, very slow to do the kinds of fancy searches we want to do using Python or postgres. We're going to feed our documents to Solr in order to speed up searching.

Solr isn't a database, though! We'll just use it to say, "hey, do you recognize any legislation like this one?" and it will give us some bill identifiers in return. We'll take those identifiers back to postgres to find the actual content of the bills.

What magic can Solr do? For example, take the sentence Put taxes on fishing. Even though PLACE A TAX ON FISH might seem very similar, even after we ignore punctuation "on" is the only thing technically shared between the two. Solr can do magic like automatically lowercasing, removing boring words like "on," "a," and "and," and stemming words like "fish" and "fishes" and "fishing" so they all mean the same thing.

This sort of pre-processing allows us to get more accurate results more quickly in the next step.

Create the legislation database#

First we'll need to start solr.

Because indexing 6-grams is demanding from a hardware point of view, we're going to assign Solr 5GB of RAM. It won't use all of the RAM the entire time, but if you don't grant it all five gigs it will mysteriously halt partway through the process.

If you aren't using the ngrams technique you should be able to use the default solr start command (which assigns 512MB of RAM).

# Stop solr if it's running
!solr stop

Sending stop command to Solr running on port 8983 ... waiting up to 180 seconds to allow Jetty process 10755 to stop gracefully.
 [|] [/] [-] [\] [|] [/] [-] [\] [|] [/] [-] [\] [|] [/] [-] [\] [|] [/] [-] [\] [|] [/] [-] [\] [|] [/] [-] [\]

# Start solr with 5 gigs of RAM

!solr start -m 5g

*** [WARN] ***  Your Max Processes Limit is currently 2048. 
 It should be set to 65000 to avoid operational disruption. 
 If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
Waiting up to 180 seconds to see Solr running on port 8983 [|] [/] [-] [\] [|] [/] [-] [\] [|] [/] [-] [\] [|] [/] [-] [\] [|] [/] [-]  
Started Solr server on port 8983 (pid=11858). Happy searching!

Are we re-running this to recreate our database? If so, it will destroy the existing legislation database. Otherwise we'll just create a new database called legislation.

We're also going to use the solrconfig folder as the default configuration. It's faster than trying to set up new columns and imports manually.

# Delete index if already exists
!solr delete -c legislation

Deleting core 'legislation' using command:
http://localhost:8983/solr/admin/cores?action=UNLOAD&core=legislation&deleteIndex=true&deleteDataDir=true&deleteInstanceDir=true

# Use the settings in data/solrconfig to initialize our setup
!solr create -c legislation -d data/solrconfig

Created new core 'legislation'

Connect to Solr#

Let's connect to our Solr database and See if it works. We're going to be using two different ways of talking to solr - the pysolr library when it's convenient and just normal requests when we want to use a feature that pysolr doesn't support.

In this case, we don't use the library at all. We just want to do a health check below that can't be done with the current version of pysolr!

import requests
import pysolr

# Connecting just so you see what it looks like
solr_url = 'http://localhost:8983/solr/legislation'
solr = pysolr.Solr(solr_url, always_commit=True)

# Health check
response = requests.get('http://localhost:8983/solr/legislation/admin/ping')
response.json()

{'responseHeader': {'zkConnected': None,
  'status': 0,
  'QTime': 148,
  'params': {'q': '{!lucene}*:*',
   'distrib': 'false',
   'df': '_text_',
   'rows': '10',
   'echoParams': 'all'}},
 'status': 'OK'}

Now that solr is set up, we want to do a data import from postgres. You can start that by visiting the Solr web interface at http://localhost:8983/solr/#/legislation/dataimport/. Click Execute and we're good to go!

You could use the API for this, but I find that you can read error messages more easily if you use the web interface (and it's fun to see how quickly things are filling up!).

Stopping solr#

Once you're done with your import, you can stop solr. When you're ready to do searching you can restart it with the solr start command, without having it tie up 5 gigs of memory.

!solr stop

Sending stop command to Solr running on port 8983 ... waiting up to 180 seconds to allow Jetty process 622 to stop gracefully.
 [|] [/] [-] [\] [|] [/] [-] [\]

Importing content in Solr for advanced text searching#

Why do we need Solr?#

Create the legislation database#

Connect to Solr#

Stopping solr#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects