The goal of NER is to label names of people, places, organizations, and other entities of interest in text documents. There are three major approaches to NER: lexicon-based, rule-based, and machine learning based. However, a NER system may combine more than one of these categories (Keretna et al., 2014). Some approaches to NER rely on POS tagging. Also, NER is a preprocessing step for tasks such as information or relationship extraction
We can find which companies or organizations are mentioned in the article or text or contain the name of a person or any product or something.
Shown in Table are tools used for NER tagging. All the tools are based primarily on statistical approaches.

NER pipeline
A typical NER system pipeline includes preprocessing the data such as tokenization, sentence splitting, feature extraction, applying ML models on the data for tagging, and then post-processing to remove some tagging inconsistencies. Fig. 5 illustrates this pipeline.

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
sentence = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'
# defining a fuction for preprocessing the sentence
def preprocess(sent):
sent = nltk.word_tokenize(sent)
sent = nltk.pos_tag(sent)
return sent
sent = preprocess(sentence)

#Our chunk pattern consists of one rule, that a noun phrase, NP, should be formed whenever the chunker finds an optional determiner, DT, followed by any number of adjectives, JJ, and then a noun, NN.
pattern = 'NP: {<DT>?<JJ>*<NN>}'
#Using this pattern, we create a chunk parser and test it on our sentence.
cp = nltk.RegexpParser(pattern)
cs = cp.parse(sent)

from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
iob_tagged = tree2conlltags(cs)

ne_tree = ne_chunk(pos_tag(word_tokenize(ex)))

With SpaCy
SpaCy’s named entity recognition has been trained on the OntoNotes 5 corpus and it supports the following entity types:

import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
print([(X.text, X.label_) for X in doc.ents])

pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])


# Extracting named entity from an article
from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
res = requests.get(url)
html = res.text
soup = BeautifulSoup(html, 'html5lib')
for script in soup(["script", "style", 'aside']):
return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('®ion=top-news&WT.nav=top-news')
article = nlp(ny_bb)
labels = [x.label_ for x in article.ents]

items = [x.text for x in article.ents]

# Let’s randomly select one sentence to learn more.
sentences = [x for x in article.sents]
displacy.render(nlp(str(sentences[20])), jupyter=True, style='ent')

displacy.render(nlp(str(sentences[20])), style='dep', jupyter = True, options = {'distance': 120})

# parts of speech
[(x.orth_,x.pos_, x.lemma_) for x in [y
for y
in nlp(str(sentences[20]))
if not y.is_stop and y.pos_ !='PUNCT']]

dict([(str(x), x.label_) for x in nlp(str(sentences[20])).ents])

print([(x, x.ent_iob_, x.ent_type_) for x in sentences[20]])

Final output: