State of Data

This document describes the current state of the available data in the ESMEE project. Given that we continuously add and improve upon the data we create this webpage, so that we can easily update the information and ensure we all have the latest information.

In short the datasources we currently have are:

Type	Source	Time frame	Regionalised	Ecosystem Element
Patents	Regpat/Lens.org	1800-Today	Yes - by hand	Networks, Knowledge, Talent, Leadership
Publications	Lens.org	1800-Today	Yes - by hand	Networks, Knowledge, Talent, Leadership
European Projects	EU	Fp1-2024	Yes	Networks, Knowledge, Leadership
Keep/Interreg	EU		Yes	Networks, Knowledge, Leadership
Startups	Crunchbase	2010-2024	Yes	Culture, Finance
Subsidies	RVO		by hand	Formal Institutions
News	NPO	2010-2024	by hand	Culture, Leadership, Demand, Formal Institutions
Trademarks/designs	EU		by hand	Culture
Eco-label	EU		by hand	Culture

Patents

Patents are used as an indicator of technological innovation. Each patent family contains information on a specific invention that at the time of submission was new to the state of the art. When all patent information is aggregated into a database we have a valuable source of knowledge on technological innovation.

Summarizing the data available in patents and all the research questions we can answer with patent is quite complicated. A limited answer to this question would be: Patents allow us to measure and understand who developed (inventors/assignees) which technologies (classifications, text), where (Address of inventors and assignees) and based on what knowledge (citations).

There are many complexities and limitations to this data which we will not describe here (there is just too much to address).

What data do we have

For the measurement of technological innovation, we use two patent databases. The first is REGPAT which is provided by the OECD and the patents are already regionlised. The second is lens.org

How are regions attributed to a patent? There are two ways to attribute a region to a patent document. First we can look at the address of the patent applicant. If the applicant has supplied an address on the patent we can use it to attach a region to the patent. The second method is to use the address of the inventor. In scientific research the address of the inventor is usually used to regionalize a patent. The reasoning behind this is that we want to approach the region in which the knowledge is located. The inventors usually live close to the place where they work and hence where the knowledge is created. Companies on the other hand can file patent on behalf of subsidiaries in other countries. Actually, there are many fiscal incentives for companies to have other structures file and manage the patents for them. Depending on the level of detail supplied this is more or less easy. In RegPat this work has been done for us. Each patent document has a nuts3 code attached to it. Note: the RegPat database only has patents that went through the EPO, WIPO, or JPO. A patent filed by a Dutch Company at the DPO or even GPO/FPO will not be present in this database. Other than the restrictions per office, there are not limits to the geographical origin of the inventors or the applicants.

Caution: A patent can be filed in different offices at the same time, there is a difference between a patent family and a patent document. One cannot simply count the number of patents in a region, we need to take into account that we are over evaluating the number of patents in the region if we do not regroup them at the family level.

We have the following data distribution of the patents:

companies_new_per_year = regpat %>% group_by(app_name_harmonised) %>% summarise("first"= min(year), "last"= max(year))
companies_new_per_year = companies_new_per_year %>% group_by(first) %>% summarise("freq" = n())
couleur = "#4CAF50"
p = ggplot(companies_new_per_year, aes(x = first, y = freq)) + geom_bar(stat = "identity", fill = couleur) +
  xlab("") + ylab("") + theme( 
    text = element_text(size=10),
    plot.title = element_text(hjust = 0.5),
    panel.background = element_rect(fill = "transparent"), # bg of the panel
    plot.background = element_rect(fill = "transparent", color = NA), # bg of the plot
    legend.background = element_rect(fill = "transparent"), # get rid of legend bg
    legend.box.background = element_rect(fill = "transparent")) +
  scale_y_continuous(breaks = seq(0, max(companies_new_per_year$freq), by = 100)) +
  scale_x_discrete(breaks = seq(2010, max(companies_new_per_year$first), by = 1))

New companies with patents per year:

geographical distribution of patents in the Netherlands

patents per 100 companies per province

plot(p)

Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
give correct results for longitude/latitude data

players per province

Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
give correct results for longitude/latitude data

patents per province per year

Missing value analysis

# missing values with the naniar package
library(naniar)

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the UpSetR package.
  Please report the issue to the authors.

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the UpSetR package.
  Please report the issue to the authors.

Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.
ℹ The deprecated feature was likely used in the UpSetR package.
  Please report the issue to the authors.

Information in the patent database

Patents contain a variety of variables that we can use for the analysis of innovation and entrepreneurship. Summarizing the data available in patents and all the research questions we can answer with patent is quite complicates. A limited answer to this question would be: Patents allow us to measure and understand who developed (inventors/assignees) which technologies (classifications, text), where (Address of inventors and assignees) and based on what knowledge (citations).

All this data sounds great, but patents are a complex data source and not all information is perfectly available. In the following table we summarise the different fields of interest in patent data.

Category	Field	Description
Dates	Priority
	Application
	Publication
Applicants	App_name
Assignees
Owners
Inventors
Classifications	IPC	International Patent Classification
	CPC	Cooperative Patent Classification
	USC
Identifiers	App_nbr	Application number
	Appln_id	Internal patent number
	Person_id	Internal applicant number
	Pub_nbr	Publication number
	Pct_nbr	Wo application? Number
	Internat_appln_nr
Localisation	Address	Full address as written by the assignee (or the inventor)
	City	City of the assignee (or inventor)
	Postal_code	Postcode of the assignee (or the inventor)
	Reg_code	NUTS-3
	Ctry_code	Iso 2 country code
	Reg_share	Share to attribute to each region on the patent. When multiple assignees are on the patent, and they come from different regions, we only assign a fraction of the patent to the region. 2 regions = 0.5, 3 regions = 0.333 etc.
	App-share	When multiple assignees on the patent, we attribute a fraction of the count to the assignee. When there are two assignees we only count the patent as 0.5 for the assignee.