# Recognizing people and places with Named Entity Recognition

Sometimes instead of just words you're looking for _real-life things_ - people, places, companies, objects with _names_. This is called **named entity recognition** (NER), and is a useful technique of extraining information from text.

<p class="reading-options">
 <a class="btn" href="/text-analysis/named-entity-recognition">
 <i class="fa fa-sm fa-book"></i>
 Read online
 </a>
 <a class="btn" href="/text-analysis/notebooks/Named Entity Recognition.ipynb">
 <i class="fa fa-sm fa-download"></i>
 Download notebook
 </a>
 <a class="btn" href="https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/text-analysis/notebooks/Named Entity Recognition.ipynb" target="_new">
 <i class="fa fa-sm fa-laptop"></i>
 Interactive version
 </a>
</p>

## Using NER with spaCy

The natural language processing library spaCy has [great NER support](https://spacy.io/usage/linguistic-features#named-entities), allowing us to extract entities from any sort of text.

Before you use spaCy in a notebook, you need to load in a language model. We're going to be using `en_core_web_sm`, because it's nice and small and fast to import.

In [113]:
import pandas as pd
import spacy
import requests
from bs4 import BeautifulSoup

nlp = spacy.load("en_core_web_sm")

pd.set_option("display.max_rows", 200)

Once we have our language model loaded, we can process a single sentence. Once you feed it to spaCy, all the magic happens behind the scenes - all that's left for us to do is loop through `doc.ents` and see the entities inside!

In [107]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
 print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


We can also visualize the entities in the sentence using the wonderfully-named `displacy` inside of spaCy.

In [108]:
from spacy import displacy

displacy.render(doc, style="ent")

Here's what spaCy found:

* `Apple` is an **organization**
* `U.K.` is a **geo-political entity**
* `$1 billion` is **money**

If we wanted to be a little more computational about it, we can throw these results into a dataframe. Along with the `text` we're also adding the lemma of the text, just so capitalization and the like can be normalized.

In [109]:
entities = [(ent.text, ent.label_, ent.lemma_) for ent in doc.ents]
df = pd.DataFrame(entities, columns=['text', 'type', 'lemma'])
df

Unnamed: 0,text,type,lemma
0,Apple,ORG,Apple
1,U.K.,GPE,U.K.
2,$1 billion,MONEY,$ 1 billion


## Processing longer texts

Processing longer texts is the exact same thing as processing something shorter! Let's look at a [sample article from the Washington Post](https://www.washingtonpost.com/local/crime/ballistic-helmet-and-vest-provider-charged-with-passing-off-chinese-gear-as-american-made/2019/12/22/23e7799a-24df-11ea-b2ca-2e72667c1741_story.html).

In [115]:
content = """The owner of a Virginia company that provides the U.S. Navy with ballistic vests, protective helmets and riot gear is facing a federal wire fraud charge, accused of misleading authorities about where the products were made.

Prosecutors said Arthur Morgan, the 67-year-old chief executive of Surveillance Equipment Group Inc. and its division SEG Armor, falsely claimed the equipment was made in Hong Kong and the United States when it in fact was made in mainland China.

According to federal court records, Morgan’s company was an authorized seller of law enforcement and security supplies to federal agencies. Such sales must comply with the Trade Agreements Act, which requires products to be made or “substantially transformed” in a “designated country.”

The United States includes Hong Kong, a special administrative region of China, on its list of countries designated to make equipment under the act but excludes the mainland. If a contractor wants to supply products from non-designated countries, it must specifically disclose that information in an initial offer.

“A contractor’s failure to do so disqualifies the contractor from eligibility for the contract,” a federal affidavit said, “and a contractor who falsely certifies cannot lawfully seek payment from the United States.”

Surveillance Equipment Group became an authorized equipment provider in 2003 and has fulfilled multiple federal orders over the past 16 years, the U.S. attorney for the District of Maryland said. In 2014, prosecutors said, the company’s price list stated that concealable body armor and helmets were made in Hong Kong.

When a General Services Administration contracting officer asked Morgan whether his items complied with the trade act, prosecutors said, Morgan emailed saying his products were made in the “United States/Hong Kong.” In 2017, he submitted a spreadsheet to the GSA stating that the ballistic helmets, anti-riot suits and shields originated in Louisa, Va.

A federal prosecutor visited the Louisa address, on Mount Airy Road, which property records listed as a home. “Specifically, no manufacturing facility was observed at the address, which appeared, from the point of my observation, to be a field and/or forested area containing several vehicles,” the affidavit said.

An Army special agent then checked pictures Morgan’s firm had filed with the GSA and detected that a photo of a helmet for sale had been altered, court records said. When the agent did a reverse image search, he found the unaltered photo on Alibaba.com, a Chinese e-commerce site, indicating the helmet was made by a Chinese firm. Another reverse image search found the same was true for a ballistic vest.

Federal prosecutors researched shipping records from U.S. Customs and Border Protection and found that Morgan’s company received 14 shipments of vests or helmets between 2015 and 2017 from the same Chinese firm that made the equipment agents found in the unaltered photos.

Investigators also found serial numbers on equipment that traced back to the Chinese company and Mandarin handwriting on ballistic material, prosecutors said. Emails Morgan had sent, however, indicated that orders were made in the United States and were being shipped from a factory in southern Virginia.

Between 2015 and July of this year, prosecutors said, five federal agencies placed nine orders for ballistic vests, helmets or riot gear from Surveillance Equipment Group, totaling about $640,000.

On Thursday, a magistrate at U.S. District Court in Greenbelt, Md., ordered that Morgan be released to home confinement after he paid a $75,000 bond. An order of detention filed Friday showed he hadn’t found a suitable custodian and hadn’t posted bond.

An email and phone message left for Morgan on Sunday were not returned, and no lawyer was listed in court records.

If convicted, prosecutors said, Morgan could face a maximum sentence of 20 years in federal prison. Actual sentences are typically less than that, they said.
"""

In [116]:
doc = nlp(content)

Amazing, look at all that! Let's throw it into a dataframe to see if we can figure out what this is about.

In [117]:
entities = [(ent.text, ent.label_, ent.lemma_) for ent in doc.ents]
df = pd.DataFrame(entities, columns=['text', 'type', 'lemma'])
df.head()

Unnamed: 0,text,type,lemma
0,Virginia,GPE,Virginia
1,the U.S. Navy,ORG,the U.S. Navy
2,Arthur Morgan,PERSON,Arthur Morgan
3,Surveillance Equipment Group Inc.,ORG,Surveillance Equipment Group Inc.
4,SEG Armor,ORG,SEG Armor


Let's see which **geopolitical entities** are the most common.

In [120]:
df[df.type == 'GPE'].lemma.value_counts()

the United States 4
Virginia 3
Hong Kong 3
China 2
U.S. 2
Louisa 1
Greenbelt 1
the " United States / Hong Kong 1
Md. 1
Name: lemma, dtype: int64

## Trying one more time

Processing longer texts is the exact same thing as processing something shorter! Below we'll look at [a piece by Reveal](https://www.revealnews.org/article/federal-judges-rulings-favored-companies-in-which-he-owned-stock/) involving judges making rulings on companies they own stock in.

The story itself actually used named entity recognition! Here's part of a description of the process from Jonathan Stray's [What do journalists do with documents?](http://jonathanstray.com/papers/What%20do%20journalists%20do%20with%20documents.pdf):

> Shifflet exhaustively transcribed California federal judges’
“statement of economic interest" disclosures to generate lists of
companies in which they owned stock. He then scraped the
PACER database for every case those judges presided over
(robust import) and used NER to generate a per-judge list of the
entities involved [42][personal communication]. By comparing
these lists the reporters were able to find cases in which judges
had ruled favorably for companies in which they owned stock.

Which judges is it about? Let's try to use NER to find out!

In [123]:
# Download and parse the article
url = "https://www.revealnews.org/article/federal-judges-rulings-favored-companies-in-which-he-owned-stock/"
response = requests.get(url)
bs_doc = BeautifulSoup(response.text, 'lxml')

In [124]:
# Pull out the article content and look at 
content = bs_doc.select_one("#content_body").text
doc = nlp(content)

Now that we've processed it, it's just a matter of building our entities dataframe and **just looking at the people**.

In [125]:
entities = [(ent.text, ent.label_, ent.lemma_) for ent in doc.ents]
df = pd.DataFrame(entities, columns=['text', 'type', 'lemma'])
df.head()

Unnamed: 0,text,type,lemma
0,Manuel Real,PERSON,Manuel Real
1,the U.S. District Court,ORG,the U.S. District Court
2,Los Angeles,GPE,Los Angeles
3,1966.Photo,CARDINAL,1966.photo
4,Virginia Lee Hunter,PERSON,Virginia Lee Hunter


In [126]:
df[df.type == 'PERSON'].lemma.value_counts().head()

Herndon 8
Patel 6
Schneider 6
Dwyer 4
Anderson 4
Name: lemma, dtype: int64

Clearly it's all about a handful of judges: Herndon, Patel, Schneider, and Dwyer, at least. **Because we shouldn't just blindly trust an algorithm,** let's use our eyes and brain and actually read the spaCy-annotated annotated article.

In [127]:
displacy.render(doc, style="ent")

Here's the thing: **after the first two hits for Manuel Real, it misses almost every other mention of him.** And he's the _main character of the piece!_

There's also a great point where `Verizon` is tagged as a person, and then as an organization in the very next sentence. It's trying its best, I guess!

## Addressing the issues

While you [can train spaCy further](https://spacy.io/usage/training#ner) or use a different analyzer altogether, all NER systems have weaknesses and are apt to make mistakes. Instead of pretending you're going to hit 100% with your tool, it's best to design your process knowing that you aren't likely to get everything! And that if you don't catch it, _you might completely miss out on some categories_, like we did up above.

Spot-checking and reviewing a random sample of results is always a good idea, just to see what tweaks you might need to make.

It also might be more healthy to think of NER as a **search engine** that's friendly enough to give you a suggested list of automatically-generated search terms, **not as an authoritative list of what's in the text.**

## More reading

If you're interesting in taking NER further, the spaCy documentation has some really interesting uses. [One in particular sticks out](https://spacy.io/usage/examples#entity-relations):

> Here, we extract money and currency values (entities labelled as MONEY) and then check the dependency tree to find the noun phrase they are referring to – for example: `"$9.4 million"` → `"Net income"`.

Amazing!

## Review

In this section we covered **named entity recognition** which can be used to extract "real world" objects from text. We used the spaCy library to find companies, people, countries, and more.

We quickly came up against some issues with NER as a Source of Truth, as it often misclassifies or misses entities that humans would easily understand. While it isn't a replacement for actually reading documents, with a healthy dose of skepticism and spot-checking it's sure to aid in research and analysis.

## Discussion topics

TODO