The goal of NER is to label names of people, places, organizations, and other entities of interest in text documents. There are three major approaches to NER: lexicon-based, rule-based, and machine learning based. However, a NER system may combine more than one of these categories (Keretna et al., 2014). Some approaches to NER rely on POS tagging. Also, NER is a preprocessing step for tasks such as information or relationship extraction
We can find which companies or organizations are mentioned in the article or text or contain the name of a person or any product or something.
Shown in Table are tools used for NER tagging. All the tools are based primarily on statistical approaches.
NER pipeline
A typical NER system pipeline includes preprocessing the data such as tokenization, sentence splitting, feature extraction, applying ML models on the data for tagging, and then post-processing to remove some tagging inconsistencies. Fig. 5 illustrates this pipeline.
With NTLK
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
sentence = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'
# defining a fuction for preprocessing the sentence
def preprocess(sent):
sent = nltk.word_tokenize(sent)
sent = nltk.pos_tag(sent)
return sent
sent = preprocess(sentence)
sent
#Our chunk pattern consists of one rule, that a noun phrase, NP, should be formed whenever the chunker finds an optional determiner, DT, followed by any number of adjectives, JJ, and then a noun, NN.
pattern = 'NP: {<DT>?<JJ>*<NN>}'
#Chunking
#Using this pattern, we create a chunk parser and test it on our sentence.
cp = nltk.RegexpParser(pattern)
cs = cp.parse(sent)
print(cs)
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
iob_tagged = tree2conlltags(cs)
print(iob_tagged)
ne_tree = ne_chunk(pos_tag(word_tokenize(ex)))
print(ne_tree)
With SpaCy
SpaCy’s named entity recognition has been trained on the OntoNotes 5 corpus and it supports the following entity types:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
print([(X.text, X.label_) for X in doc.ents])
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])
Tags
# Extracting named entity from an article
from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
res = requests.get(url)
html = res.text
soup = BeautifulSoup(html, 'html5lib')
for script in soup(["script", "style", 'aside']):
script.extract()
return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region®ion=top-news&WT.nav=top-news')
article = nlp(ny_bb)
len(article.ents)
labels = [x.label_ for x in article.ents]
Counter(labels)
items = [x.text for x in article.ents]
Counter(items).most_common(3)
# Let’s randomly select one sentence to learn more.
sentences = [x for x in article.sents]
print(sentences[20])
displacy.render(nlp(str(sentences[20])), jupyter=True, style='ent')
displacy.render(nlp(str(sentences[20])), style='dep', jupyter = True, options = {'distance': 120})
# parts of speech
[(x.orth_,x.pos_, x.lemma_) for x in [y
for y
in nlp(str(sentences[20]))
if not y.is_stop and y.pos_ !='PUNCT']]
dict([(str(x), x.label_) for x in nlp(str(sentences[20])).ents])
print([(x, x.ent_iob_, x.ent_type_) for x in sentences[20]])
Final output: