# Activate the python environment we need to run the models
reticulate::use_virtualenv("bertopic_r_env_spacy", required = TRUE)Named Entity Recognition
Introduction
Named Entity Recognition (NER) is a text-analysis technique that automatically finds and labels “things” mentioned in written text. Those “things” are called entities—for example people (Greta Thunberg), organizations (Utrecht University), locations (Amsterdam), dates (5 January 2026), or sometimes more specific categories such as products, events, or laws, depending on the model you use. A practical way to think about NER is: it turns unstructured text into structured data. Humans can easily read a sentence and notice who it is about, where it happens, and which organizations are involved. Computers do not “understand” text in that way by default. NER provides a bridge by converting text into a list of detected entities plus their types and positions in the text.
For example, in the sentence:
“Utrecht University announced a new partnership in Amsterdam on 5 January 2026.” an NER system might extract something like:
“Utrecht University” is an Organization “Amsterdam” is a Location “5 January 2026” is a Date
Once entities are identified, you can start doing analyses that would otherwise be tedious or impossible at scale, such as:
- Counting how often certain people or organizations are mentioned across thousands of documents
- Mapping which organizations are discussed together (co-occurrence networks)
- Tracking where attention is geographically concentrated (locations over time)
- Filtering or summarizing documents by which entities appear in them
In this course, you will learn how to run NER in R and transform the results into tidy, analysis-ready tables. The key idea is not just to “detect entities,” but to use them as building blocks for downstream tasks such as visualization, trend analysis, and network analysis. By the end, you should be able to take a collection of texts and answer concrete questions like: Which actors are most prominent? Which places are most frequently discussed? And how do these patterns change over time or differ across sources?
Setup
First we need to start the python environment so that we can run the models
Now we can start downloading the models. Let’s start by loading a bert model trained for NER. We add the argument aggregation_strategy= “simple to get an output that inclused readable tokens and not a list of syllable level tokens (cf. slides from the lecture). For NER, we will use the following model: bert-base-NER. If you have more patience and computation power bert-large-NER is also an option. The hf_load_pipeline will download the models directly and put them to use.
NER_extract <- huggingfaceR::hf_load_pipeline(
model = "dslim/bert-large-NER",
task = "ner",
aggregation_strategy = "simple")
dslim/bert-large-NER is ready for ner
text <- c("The 2024 edition of The European 5G Conference will take place on 30-31 January at the Hotel nhow Brussels Bloom. Now, in its 8th year, the conference has an established reputation as Brussels’ leading meeting place for discussion on 5G policy. Registration is now available – secure your place today. The event will, once again, provide the opportunity to hear from high-level policymakers and industry stakeholders on key themes such as investment, security, sustainability, emerging business models, and connectivity. It will provide an update on progress that has been made towards the 2030 ‘Path to the Digital Decade’ targets, as well as offering a first opportunity to examine the outcomes from WRC-23 and at what this may mean for the future connectivity environment around 5G and future technologies. By looking back at the lessons learnt to date and forward to the path towards 5G Advanced and 6G, the event will provide a comprehensive insight into all the key policy aspects that are shaping the 5G ecosystem in Europe.")
extracted_NE <- NER_extract(text)
#transform output into something readable:
extracted_NE <- plyr::ldply(extracted_NE, data.frame)
extracted_NE entity_group score word start end
1 MISC 0.9934977 The 20 23
2 MISC 0.8844852 European 5G Conference 24 46
3 LOC 0.8335104 Hotel 87 92
4 LOC 0.4823326 ##how 94 97
5 LOC 0.9036462 Brussels Bloom 98 112
6 LOC 0.9993755 Brussels 184 192
7 MISC 0.5425270 5 234 235
8 MISC 0.6118023 ‘ 595 596
9 MISC 0.9522035 Path to the Digital Decade 596 622
10 MISC 0.9099926 WRC 702 705
11 MISC 0.5324209 5 888 889
12 LOC 0.9994928 Europe 1024 1030
We can do the same with a different model that is capable of treating multiple languages. The basic structure is the same. We change the links to the models, and adjust what we download (tensor or pytorch):
multilanguage_NER = huggingfaceR::hf_load_pipeline(model = "Babelscape/wikineural-multilingual-ner", tokenizer = "Babelscape/wikineural-multilingual-ner", task = "ner", aggregation_strategy="simple")
Babelscape/wikineural-multilingual-ner is ready for ner
test_multi <- multilanguage_NER(text)
test_multi <- plyr::ldply(test_multi, data.frame)
test_multi entity_group score word start end
1 MISC 0.8881885 The European 5G Conference 20 46
2 LOC 0.9967667 Hotel nhow Brussels Bloom 87 112
3 LOC 0.8897376 Brussels 184 192
4 MISC 0.9636980 Path to the Digital Decade 596 622
5 LOC 0.9381566 Europe 1024 1030
Illustration
load("/Users/janpieter/Desktop/Teachings/NetworkIsLife2/NetworkIsLife/LN_dataframe.rdata")
# the text we want to analyse is in the "Article" column
safe_ner <- purrr::possibly(NER_extract, otherwise = list())
res <- climate_change_speeches %>%
mutate(
entities = map(text, safe_ner)
) %>%
select(speech_id, entities) %>%
unnest_longer(entities) %>% # one list element per entity
unnest_wider(entities) %>% # columns like entity_group, word, score, start, end
relocate(speech_id)
NER_results_climatechangespeeches <- res
save(NER_results_climatechangespeeches, file = "NER_results_climatechangespeeches.rdata")load("NER_results_climatechangespeeches.rdata")
head(NER_results_climatechangespeeches) speech_id entity_group score word start end
1 1 ORG 0.4085696 House 29 34
2 1 ORG 0.3730169 House 165 170
3 1 LOC 0.3669575 House 338 343
4 1 MISC 0.9975488 European Green Deal 932 951
5 1 MISC 0.9970733 Net Zero Industry Act 962 983
6 1 ORG 0.3932655 House 1869 1874