Named Entity Recognition

Author

Janpieter van der Pol

Introduction

Named Entity Recognition (NER) is a text-analysis technique that automatically finds and labels “things” mentioned in written text. Those “things” are called entities—for example people (Greta Thunberg), organizations (Utrecht University), locations (Amsterdam), dates (5 January 2026), or sometimes more specific categories such as products, events, or laws, depending on the model you use. A practical way to think about NER is: it turns unstructured text into structured data. Humans can easily read a sentence and notice who it is about, where it happens, and which organizations are involved. Computers do not “understand” text in that way by default. NER provides a bridge by converting text into a list of detected entities plus their types and positions in the text.

For example, in the sentence:

“Utrecht University announced a new partnership in Amsterdam on 5 January 2026.” an NER system might extract something like:

“Utrecht University” is an Organization “Amsterdam” is a Location “5 January 2026” is a Date

Once entities are identified, you can start doing analyses that would otherwise be tedious or impossible at scale, such as:

  • Counting how often certain people or organizations are mentioned across thousands of documents
  • Mapping which organizations are discussed together (co-occurrence networks)
  • Tracking where attention is geographically concentrated (locations over time)
  • Filtering or summarizing documents by which entities appear in them

In this course, you will learn how to run NER in R and transform the results into tidy, analysis-ready tables. The key idea is not just to “detect entities,” but to use them as building blocks for downstream tasks such as visualization, trend analysis, and network analysis. By the end, you should be able to take a collection of texts and answer concrete questions like: Which actors are most prominent? Which places are most frequently discussed? And how do these patterns change over time or differ across sources?

Setup

First we need to start the python environment so that we can run the models

# Activate the python environment we need to run the models
reticulate::use_virtualenv("bertopic_r_env_spacy", required = TRUE)

Now we can start downloading the models. Let’s start by loading a bert model trained for NER. We add the argument aggregation_strategy= “simple to get an output that inclused readable tokens and not a list of syllable level tokens (cf. slides from the lecture). For NER, we will use the following model: bert-base-NER. If you have more patience and computation power bert-large-NER is also an option. The hf_load_pipeline will download the models directly and put them to use.

NER_extract <- huggingfaceR::hf_load_pipeline(
  model = "dslim/bert-large-NER", 
  task = "ner", 
  aggregation_strategy = "simple")


dslim/bert-large-NER is ready for ner
text <- c("The 2024 edition of The European 5G Conference will take place on 30-31 January at the Hotel nhow Brussels Bloom. Now, in its 8th year, the conference has an established reputation as Brussels’ leading meeting place for discussion on 5G policy. Registration is now available – secure your place today. The event will, once again, provide the opportunity to hear from high-level policymakers and industry stakeholders on key themes such as investment, security, sustainability, emerging business models, and connectivity. It will provide an update on progress that has been made towards the 2030 ‘Path to the Digital Decade’ targets, as well as offering a first opportunity to examine the outcomes from WRC-23 and at what this may mean for the future connectivity environment around 5G and future technologies. By looking back at the lessons learnt to date and forward to the path towards 5G Advanced and 6G, the event will provide a comprehensive insight into all the key policy aspects that are shaping the 5G ecosystem in Europe.")
extracted_NE <- NER_extract(text)
#transform output into something readable:
extracted_NE <- plyr::ldply(extracted_NE, data.frame)
extracted_NE
   entity_group     score                       word start  end
1          MISC 0.9934977                        The    20   23
2          MISC 0.8844852     European 5G Conference    24   46
3           LOC 0.8335104                      Hotel    87   92
4           LOC 0.4823326                      ##how    94   97
5           LOC 0.9036462             Brussels Bloom    98  112
6           LOC 0.9993755                   Brussels   184  192
7          MISC 0.5425270                          5   234  235
8          MISC 0.6118023                          ‘   595  596
9          MISC 0.9522035 Path to the Digital Decade   596  622
10         MISC 0.9099926                        WRC   702  705
11         MISC 0.5324209                          5   888  889
12          LOC 0.9994928                     Europe  1024 1030

We can do the same with a different model that is capable of treating multiple languages. The basic structure is the same. We change the links to the models, and adjust what we download (tensor or pytorch):

multilanguage_NER = huggingfaceR::hf_load_pipeline(model = "Babelscape/wikineural-multilingual-ner", tokenizer = "Babelscape/wikineural-multilingual-ner", task = "ner", aggregation_strategy="simple")


Babelscape/wikineural-multilingual-ner is ready for ner
test_multi <- multilanguage_NER(text)
test_multi <- plyr::ldply(test_multi, data.frame)
test_multi
  entity_group     score                       word start  end
1         MISC 0.8881885 The European 5G Conference    20   46
2          LOC 0.9967667  Hotel nhow Brussels Bloom    87  112
3          LOC 0.8897376                   Brussels   184  192
4         MISC 0.9636980 Path to the Digital Decade   596  622
5          LOC 0.9381566                     Europe  1024 1030

Illustration

load("/Users/janpieter/Desktop/Teachings/NetworkIsLife2/NetworkIsLife/LN_dataframe.rdata")
# the text we want to analyse is in the "Article" column
safe_ner <- purrr::possibly(NER_extract, otherwise = list())

res <- climate_change_speeches %>%
  mutate(
    entities = map(text, safe_ner)
  ) %>%
  select(speech_id, entities) %>%
  unnest_longer(entities) %>%                 # one list element per entity
  unnest_wider(entities) %>%                  # columns like entity_group, word, score, start, end
  relocate(speech_id)

NER_results_climatechangespeeches <- res
save(NER_results_climatechangespeeches, file = "NER_results_climatechangespeeches.rdata")
load("NER_results_climatechangespeeches.rdata")
head(NER_results_climatechangespeeches)
  speech_id entity_group     score                  word start  end
1         1          ORG 0.4085696                 House    29   34
2         1          ORG 0.3730169                 House   165  170
3         1          LOC 0.3669575                 House   338  343
4         1         MISC 0.9975488   European Green Deal   932  951
5         1         MISC 0.9970733 Net Zero Industry Act   962  983
6         1          ORG 0.3932655                 House  1869 1874