Data Import and preparation

How to import different dataformats in python and R.

Importing in R

TXT, CSV, TSV

For the import of data from txt, csv or tsv formats we will use the read_delim() function from the readr package. The main difference between these data formats resides in the separator of the data. For txt data, this is often undefined and needs to be specified by the user. The main arguments of the read_delim() are the file and the separator. The separator is the character string that defines where a line should be cut. For example, if our raw data looks like this:

Year;Class;Value
2002;B60C;42
2003;B29K;21
2009;C08K;12

Then we see that each column is separated by a “;”. By splitting the data whenever there is a “;” we create the following dataframe:

Year Class Value
2002 B60C 42
2003 B29K 21
2009 C08K 12

Seperators that we commonly find in data are the semicolon: “;”, the comma: “,”, the vertical bar (or pipe) “|”, the space **” “, and tabs which are coded with**”\t”.

Pay Attention

Even though csv stands for Comma Separated Values, this does not mean that the separator is always a comma, often enough it is in fact a semicolon. Always check your data to make sure you have the right one.

library(readr)
# csv and various other separators
data <- read_delim(file, sep = ",")

data <- read_delim(file, sep = "|")

data <- read_delim(file, sep = ";")

# txt (space separated values)
data <- read_delim(file, sep = " ")

# tsv (tab separated values)
data <- read_delim(file, sep = "\t")

XLS, XLSX

Excel files can also easily be importanted into R with the the readxl package.

library(readxl)
read_excel("sample_data.xlsx")
# A tibble: 5 × 3
  Pat_num       Year Domain
  <chr>        <dbl> <chr> 
1 WO200214562   2002 B60C  
2 WO2023738962  2023 B60C  
3 EP2023778962  2023 B29D  
4 FR2019272698  2019 C08K  
5 FR201922671   2019 C08K  

rdata

# use import function in rstudio, or double click

Bib

.bib files are a popular bibliographic data format. This data can be imported and transformed into a dataframe using the bibliometrix package.

library(bibliometrix)
convert2df(file, dbsource = "scopus", format = "bibtex")
Pay Attention

The bibliometrix package is designed for bibliometic analysis, it might change the names of the columns and the format of some of the data to adjust to what it is supped to do.

json

library(jsonlite)

# Specify the path to the JSON file
json_file <- "example_1.json"

# Import the JSON file
data <- fromJSON(json_file)

# Display the imported JSON data
print(data)
$fruit
[1] "Apple"

$size
[1] "Large"

$color
[1] "Red"

jsonl

jsonl files are a list of json files. We find this format for exampl in the lens.org database. Since each row is a json, we first read the files as text files and then apply the fromJSON function to extract the information. The result is a list of data objects.

library(jsonlite)
tmp<-readLines("data_test.jsonl")
Warning in readLines("data_test.jsonl"): incomplete final line found on
'data_test.jsonl'
tmp <- lapply(tmp, jsonlite::fromJSON)
Click here to see output
print(tmp)
[[1]]
[[1]]$lens_id
[1] "000-136-263-451-165"

[[1]]$title
[1] "The influence of displaced femoral neck fractures on the surgical incision for hemiarthroplasty using the Hardinge approach"

[[1]]$publication_type
[1] "journal article"

[[1]]$year_published
[1] 2002

[[1]]$date_published_parts
[1] 2002    1

[[1]]$created
[1] "2018-05-12T00:07:16.368000+00:00"

[[1]]$external_ids
   type              value
1 magid         2083116342
2   doi 10.1007/bf03170381

[[1]]$authors
  first_name last_name initials               ids
1         M.    Yousef        M magid, 2518579655
2         E. Masterson        E magid, 2607514729
                                                                                                                                                                                    affiliations
1 Mid-Western Regional Hospital, Department of Orthopaedic Surgery, Mid-Western Regional Hospital, Limerick, Ireland., magid, grid, ror, 2799956115, grid.415964.b, 00r5kgd36, grid.415964.b, IE
2 Mid-Western Regional Hospital, Department of Orthopaedic Surgery, Mid-Western Regional Hospital, Limerick, Ireland., magid, grid, ror, 2799956115, grid.415964.b, 00r5kgd36, grid.415964.b, IE

[[1]]$source
[[1]]$source$title
[1] "Irish Journal of Medical Science"

[[1]]$source$type
[1] "Journal"

[[1]]$source$publisher
[1] "Springer Science and Business Media LLC"

[[1]]$source$issn
        type    value
1      print 00211265
2 electronic 18634362
3      print 03321029
4      print 0790231x
5      print 07902328

[[1]]$source$country
[1] "Ireland"

[[1]]$source$asjc_codes
[1] "2700"

[[1]]$source$asjc_subjects
[1] "General Medicine"


[[1]]$fields_of_study
[1] "Surgery"                "Surgical incision"      "Hardinge approach"     
[4] "Femoral Neck Fractures" "Medicine"              

[[1]]$volume
[1] "171"

[[1]]$issue
[1] "1"

[[1]]$languages
[1] "en"

[[1]]$source_urls
  type                                                          url
1 html         https://link.springer.com/article/10.1007/BF03170381
2 <NA> https://link.springer.com/content/pdf/10.1007/BF03170381.pdf

[[1]]$start_page
[1] "31"

[[1]]$end_page
[1] "31"

[[1]]$author_count
[1] 2


[[2]]
[[2]]$lens_id
[1] "000-150-493-965-760"

[[2]]$title
[1] "Peroxidase activity in soybeans following inoculation with Phytophthora sojae."

[[2]]$publication_type
[1] "journal article"

[[2]]$year_published
[1] 2006

[[2]]$date_published_parts
[1] 2006    1

[[2]]$created
[1] "2018-05-12T03:57:41.820000+00:00"

[[2]]$external_ids
   type                     value
1   doi 10.1007/s11046-005-0721-y
2  pmid                  16389483
3 magid                2006874664

[[2]]$authors
  first_name last_name initials               ids
1    Jose C.    Melgar       JC magid, 2634470543
2  Thomas S.     Abney       TS magid, 2677788953
3 Richard A.  Vierling       RA magid, 1981461697
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       affiliations
1                                                                                                                                                                                                                                                                      Purdue University, Botany and Plant Pathology Department, Purdue University, West Lafayette, IN 47907, USA., magid, grid, fundref, wikidata, ror, 219193219, grid.169077.e, 100006377, q217741, 02dqehb95, grid.169077.e, US
2 United States Department of Agriculture, Agricultural Research Service, Crop Production and Pathology Unit, Agricultural Research Service, U.S. Department of Agriculture, West Lafayette, USA, Crop Production and Pathology Unit, Agricultural Research Service, U.S. Department of Agriculture, West Lafayette, USA, magid, grid, fundref, ror, 1336096307, grid.417548.b, 100000199, 01na82s61, grid, fundref, ror, grid.463419.d, 100007917, 02d2m2044, grid.417548.b, grid.463419.d, US, US
3                                                                                                                                                                                                                                                           Purdue University, Botany & Plant Pathology Department and Agronomy Department, Purdue University, West Lafayette, USA, magid, grid, fundref, wikidata, ror, 219193219, grid.169077.e, 100006377, q217741, 02dqehb95, grid.169077.e, US

[[2]]$source
[[2]]$source$title
[1] "Mycopathologia"

[[2]]$source$type
[1] "Journal"

[[2]]$source$publisher
[1] "Springer Science and Business Media LLC"

[[2]]$source$issn
        type    value
1      print 0301486x
2 electronic 15730832

[[2]]$source$country
[1] "Netherlands"

[[2]]$source$asjc_codes
[1] "2404" "2402" "1110" "1102" "3401"

[[2]]$source$asjc_subjects
[1] "Agronomy and Crop Science"             
[2] "Applied Microbiology and Biotechnology"
[3] "Microbiology"                          
[4] "Plant Science"                         
[5] "Veterinary (miscalleneous)"            
[6] "Veterinary (miscellaneous)"            


[[2]]$fields_of_study
[1] "Gene"               "Horticulture"       "Peroxidase"        
[4] "Inoculation"        "Phytophthora sojae" "Phytophthora"      
[7] "Botany"             "Biology"           

[[2]]$volume
[1] "161"

[[2]]$issue
[1] "1"

[[2]]$languages
[1] "en"

[[2]]$references
               lens_id
1  004-042-571-377-321
2  005-317-066-947-203
3  020-042-216-048-371
4  032-446-103-377-888
5  034-601-439-886-731
6  062-700-560-648-136
7  066-443-980-641-011
8  088-133-232-334-221
9  106-456-828-269-00X
10 168-522-110-138-272
11 192-341-655-765-695

[[2]]$mesh_terms
    mesh_heading       qualifier_name mesh_id qualifier_id
1    Peroxidases           metabolism D010544      Q000378
2   Phytophthora growth & development D010838      Q000254
3 Plant Diseases         microbiology D010935      Q000382
4 Plant Proteins           metabolism D010940      Q000378
5      Seedlings           enzymology D036226      Q000201
6       Soybeans           enzymology D013025      Q000201

[[2]]$chemicals
  substance_name registry_number mesh_id
1 Plant Proteins               0 D010940
2    Peroxidases     EC 1.11.1.- D010544

[[2]]$source_urls
  type                                                         url
1 html https://link.springer.com/article/10.1007/s11046-005-0721-y
2 html                https://www.ncbi.nlm.nih.gov/pubmed/16389483
3 html                  https://europepmc.org/article/MED/16389483

[[2]]$references_count
[1] 11

[[2]]$scholarly_citations_count
[1] 12

[[2]]$start_page
[1] "37"

[[2]]$end_page
[1] "42"

[[2]]$scholarly_citations
 [1] "027-668-154-909-322" "031-369-742-281-943" "034-566-544-087-375"
 [4] "041-477-983-598-256" "041-552-411-498-281" "042-968-597-273-042"
 [7] "045-790-949-621-105" "062-868-088-589-29X" "107-788-244-832-10X"
[10] "132-744-976-798-754" "153-187-658-237-284" "160-817-571-772-644"

[[2]]$author_count
[1] 3


[[3]]
[[3]]$lens_id
[1] "000-161-224-328-417"

[[3]]$title
[1] "Astronomy: Planets in chaos"

[[3]]$publication_type
[1] "journal article"

[[3]]$year_published
[1] 2014

[[3]]$date_published
[1] "2014-07-02T00:00:00.000000+00:00"

[[3]]$date_published_parts
[1] 2014    7    2

[[3]]$created
[1] "2018-05-12T07:57:28.771000+00:00"

[[3]]$external_ids
   type           value
1 magid      2014980602
2  pmid        24990727
3   doi 10.1038/511022a

[[3]]$open_access
[[3]]$open_access$colour
[1] "bronze"


[[3]]$authors
  first_name  last_name initials               ids affiliations
1        Ann Finkbeiner        A magid, 2633978770         NULL

[[3]]$source
[[3]]$source$title
[1] "Nature"

[[3]]$source$type
[1] "Journal"

[[3]]$source$publisher
[1] "Springer Science and Business Media LLC"

[[3]]$source$issn
        type    value
1 electronic 14764687
2      print 00280836

[[3]]$source$country
[1] "United Kingdom"

[[3]]$source$asjc_codes
[1] "1000"

[[3]]$source$asjc_subjects
[1] "Multidisciplinary"


[[3]]$fields_of_study
[1] "Exoplanetology"           "Physics"                 
[3] "Astronomy"                "Star (graph theory)"     
[5] "Planetary science"        "CHAOS (operating system)"
[7] "Planet"                   "Astrobiology"            

[[3]]$volume
[1] "511"

[[3]]$issue
[1] "7507"

[[3]]$languages
[1] "en"

[[3]]$references
              lens_id
1 002-401-641-595-121
2 002-872-990-703-478
3 031-880-830-498-497
4 082-960-570-095-64X
5 149-177-951-839-803

[[3]]$source_urls
     type                                                           url
1 unknown                   https://www.nature.com/articles/511022a.pdf
2    html                       https://www.nature.com/articles/511022a
3    html http://ui.adsabs.harvard.edu/abs/2014Natur.511...22F/abstract
4    html                    https://europepmc.org/article/MED/24990727

[[3]]$abstract
[1] "The discovery of thousands of star systems wildly different from our own has demolished ideas about how planets form. Astronomers are searching for a whole new theory."

[[3]]$references_count
[1] 5

[[3]]$scholarly_citations_count
[1] 3

[[3]]$start_page
[1] "22"

[[3]]$end_page
[1] "24"

[[3]]$scholarly_citations
[1] "020-515-110-880-837" "084-952-180-195-058" "086-879-116-036-663"

[[3]]$author_count
[1] 1

[[3]]$is_open_access
[1] TRUE

Importing in Python

TXT, CSV, TSV

package required openpyxl, pandas

import pandas as pd

# Import CSV File
csv_file = "sample_data.csv"
df_csv = pd.read_csv(csv_file, sep = ";")


# Import TXT File (Space-Separated)
txt_file = "sample_data.txt"
df_txt = pd.read_csv(txt_file, sep=" ")

# Import TSV File (Tab-Separated)
tsv_file = "sample_data.tsv"
df_tsv = pd.read_csv(tsv_file, sep="\t")


# Import XLS (Excel) File
xls_file = "sample_data.xlsx"
df_xls = pd.read_excel(xls_file)
print(df_xls)

# Import JSON File
#json_file = "sample_data.json"
#df_json = pd.read_json(json_file)

XLS/XLSX (excel data format)

Excel files can be imported with help from the pandas package. If you do not specify a sheet to import, the first sheet will be taken. You can specify the sheet with the argument: sheet_name=‘Sheet1’

import pandas as pd
# Import XLS (Excel) File
df_xls = pd.read_excel("sample_data.xlsx")
print(df_xls)

Json and jsonl

# Import JSON File
#json_file = "sample_data.json"
#df_json = pd.read_json(json_file)
# import jsonl
import jsonlines

# First we open the file and provide it with a name (data_from_jsonl)
with jsonlines.open('data_test.jsonl', 'r') as data_from_jsonl:
      # extract the information from each row and put into object
     results = [row for row in data_from_jsonl] 
data_from_jsonl.close() # close the file again

Data conformity and preparation

Data conformity in R

Before you start working on your analysis, you want to make sure the data is conform to your expectations. This means:

  • The number of observation is correct, and the variables you need are included.

  • Making sure each column is of the right type (years are dates, text are factors, numbers are integers or reals, …)

  • There are no trailing spaces

  • “.” and “,” aren’t mixed up

  • The units of measurement are correct (km vs. miles, K€ vs 1.000, celsius vs fahrenheit etc.)

  • Any missing values have been dealt with, or you have a valid reason to ignore them

  • headers are easy to use (no spaces or other characters that will make your life difficult).

In this section we will look into checking your data and adjusting it in preparation of your analysis.

Checking data types

There are many different data formats. Most frequently we encounter numbers, strings, factors, dates and booleans. Consider the following dataset:

ID Year Cool Domain Frequency Alive
ID1 2002 B60C 42 TRUE
ID2 2003 B29K 21 FALSE
ID3 2009 C08K 12 TRUE

The first column has textual data that is used as a key for each row. The Year should be considered either as an integer or a date. This will depend upon the type of analysis you want to perform. Specifically time series analysis will require this to be a date, while it’s enough for other analyses to ensure that it is an integer. The most common issue with this type of observation is that the year is considered as text instead of a number. If you see any ““” next to your numbers this is a bad sign which indicates that it is in fact, text.

The “Cool Domain” contains classifications which we want to be textual values. “Frequency” will need to be an integer while “Alive” we want to be a boolean. For each of the formats there is a function to test whether or not the format is the one we expect. These functions are usually in the form of is.numerical(), is.logical(), is.character(). The argument is then simply the column of the dataframe or an observation. Suppose we create the previous dataframe in R:

data <- data.frame(ID = c("ID1", "ID2", "ID3"), Year = c(2002, 2003, 2009), `Cool Domain` = c("B60C", "B29K","B29K"), Frequency = c(42, 21, 12),Alive = c(TRUE, FALSE, TRUE))


print(data)
   ID Year Cool.Domain Frequency Alive
1 ID1 2002        B60C        42  TRUE
2 ID2 2003        B29K        21 FALSE
3 ID3 2009        B29K        12  TRUE

We can then check the format of each variables:

is.numeric(data$Frequency)
[1] TRUE
is.character(data$ID)
[1] TRUE
is.logical(data$Alive)
[1] TRUE

This can also be done in a more consise way with the summary() function. This compute, for each of the columns, some statistics and provides the class of the class of the variables. This comes in handy as it allows us verify not only the class of the variables but also the distribution:

summary(data)
      ID                 Year      Cool.Domain          Frequency   
 Length:3           Min.   :2002   Length:3           Min.   :12.0  
 Class :character   1st Qu.:2002   Class :character   1st Qu.:16.5  
 Mode  :character   Median :2003   Mode  :character   Median :21.0  
                    Mean   :2005                      Mean   :25.0  
                    3rd Qu.:2006                      3rd Qu.:31.5  
                    Max.   :2009                      Max.   :42.0  
   Alive        
 Mode :logical  
 FALSE:1        
 TRUE :2        
                
                
                

When the variable is a character, the output shows the length (number of observations) and the class. For numeric variables the output shows the distribution of the data (min, max, etc.). Not all text variables are created equal. It is possible to ensure that text is considered as a factor. A factor is basically a category, dummy variables in regressions are usually programmed as factors.

data2 <- data.frame(ID = c("ID1", "ID2", "ID3", "ID4"), Year = c(2002, 2003, 2009, 2010), `Cool Domain` = c("B60C", "B29K","B29K", "B60C"), Frequency = c(42, 21, 12, NA),Alive = c(TRUE, FALSE, FALSE, FALSE), Country = as.factor(c("France", "France", "Germany", "Netherlands")))

summary(data2)
      ID                 Year      Cool.Domain          Frequency   
 Length:4           Min.   :2002   Length:4           Min.   :12.0  
 Class :character   1st Qu.:2003   Class :character   1st Qu.:16.5  
 Mode  :character   Median :2006   Mode  :character   Median :21.0  
                    Mean   :2006                      Mean   :25.0  
                    3rd Qu.:2009                      3rd Qu.:31.5  
                    Max.   :2010                      Max.   :42.0  
                                                      NA's   :1     
   Alive                Country 
 Mode :logical   France     :2  
 FALSE:3         Germany    :1  
 TRUE :1         Netherlands:1  
                                
                                
                                
                                

In this output we see that the NA is taken into account and counted separedly in the date. It also shows that “Country” is considered as a factor. The summary then shows how many observations we have for each of the factors.

Given that we created the dataframe ourselves we can be relatively sure that the data was is the right format. However, often enough when we download data from online sources, we run into issues. If the format is not as expected there are different functions that allows us to transform the format of the data. Mainly we use as.numeric(), as.character() and as.logical(). These functions require only one argument which is the column we are trying to convert:

# transform into a numeric value
data$Frequency <- as.numeric(data$Frequency)

# transform in to a character string
data$Frequency <- as.character(data$ID)

# transform into a boolean
data$Frequency <- as.logical(data$Alive)

As an example consider we will create a dataframe that mimics data that was not imported correctly:

Messy_data <- data.frame(ID = c("ID1", "ID2", "ID3", "ID4"), Year = c("2002", "2003", "2009", "2010"), `Cool Domain` = c("B60C", "B29K","B29K", "B60C"), Frequency = c("42", "21", "12", "NA"),Alive = c("TRUE", "FALSE", "FALSE", "FALSE"), Country = c("France", "France", "Germany", "Netherlands"))

summary(Messy_data)
      ID                Year           Cool.Domain         Frequency        
 Length:4           Length:4           Length:4           Length:4          
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
    Alive             Country         
 Length:4           Length:4          
 Class :character   Class :character  
 Mode  :character   Mode  :character  

The output clearly shows that all variables are considered as characters which we do not want. We cannot comput the average on the Frequency while R thinks that it is text. This also shows that there is a difference between the string NA (“NA”) and the operator NA.

# transform into a numerical value
Messy_data$Year <- as.numeric(Messy_data$Year)
Messy_data$Frequency <- as.numeric(Messy_data$Frequency)
Warning: NAs introduced by coercion
# transform into a boolean
Messy_data$Alive <- as.logical(Messy_data$Alive)
# transform into a factor (category)
Messy_data$Country <- as.factor(Messy_data$Country)

summary(Messy_data)
      ID                 Year      Cool.Domain          Frequency   
 Length:4           Min.   :2002   Length:4           Min.   :12.0  
 Class :character   1st Qu.:2003   Class :character   1st Qu.:16.5  
 Mode  :character   Median :2006   Mode  :character   Median :21.0  
                    Mean   :2006                      Mean   :25.0  
                    3rd Qu.:2009                      3rd Qu.:31.5  
                    Max.   :2010                      Max.   :42.0  
                                                      NA's   :1     
   Alive                Country 
 Mode :logical   France     :2  
 FALSE:3         Germany    :1  
 TRUE :1         Netherlands:1  
                                
                                
                                
                                

Now that our data has the right format and we are sure we have all the observations we might want to do some work to normalise the column names. This is done to make programming and refercing the columns easier. Whenever the is a space in the name of the column, we need to refer to the column using `Column name with spaces` which is not ideal. It’s generally good practice to not have any spaces. We can use the colnames() function to change some (or all of the column names).

# change on column name:
colnames(Messy_data)[3] <- "Domain"

# change multiple names:
colnames(Messy_data)[c(1,2,5)] <- c("Identifier", "Date", "Active")

# check
summary(Messy_data)
  Identifier             Date         Domain            Frequency   
 Length:4           Min.   :2002   Length:4           Min.   :12.0  
 Class :character   1st Qu.:2003   Class :character   1st Qu.:16.5  
 Mode  :character   Median :2006   Mode  :character   Median :21.0  
                    Mean   :2006                      Mean   :25.0  
                    3rd Qu.:2009                      3rd Qu.:31.5  
                    Max.   :2010                      Max.   :42.0  
                                                      NA's   :1     
   Active               Country 
 Mode :logical   France     :2  
 FALSE:3         Germany    :1  
 TRUE :1         Netherlands:1  
                                
                                
                                
                                

The Janitor package offers an automated solution for many header issues.

library(janitor)

Attaching package: 'janitor'
The following objects are masked from 'package:stats':

    chisq.test, fisher.test
Messy_data <- data.frame(ID = c("ID1", "ID2", "ID3", "ID4"), Year = c("2002", "2003", "2009", "2010"), `Cool Domain` = c("B60C", "B29K","B29K", "B60C"), Frequency = c("42", "21", "12", "NA"),Alive = c("TRUE", "FALSE", "FALSE", "FALSE"), Country = c("France", "France", "Germany", "Netherlands"))
# the column names from the initial dataframe
print(colnames(Messy_data))
[1] "ID"          "Year"        "Cool.Domain" "Frequency"   "Alive"      
[6] "Country"    
# Cleaning the column names
Messy_data <- clean_names(Messy_data)
print(colnames(Messy_data))
[1] "id"          "year"        "cool_domain" "frequency"   "alive"      
[6] "country"    

You can see that capital letter have been adjusted, some characters (µ, “)” ) have been replaced, spaces have been removed and replaced with “_”. This makes it easier and faster to refer to the columns and the dataframes when manipulating the data on a day-to-day basis.

Once we are happy with the data formats we can move on take care of other issues.

Data transformation

It often happens that we are not fully happy with the data that we receive or simply that we need apply some modifications to suit our needs. The most frequent of these operations are

  • Mathematical transformation of numerical data (multiplication, rounding up)
  • Extract years, days or months from dates
  • Trim whitespaces from data
  • Rename categories within the data

The main difficulty with the actions is to identify how to apply the functions. Some functions can be applied directly to a whole column while other nedd to be applied using mapping functions. Let’s start with mathematical transformations:

# we start by creating a dataset to work with
data <- data.frame(Identifier = c("ID1", "ID2", "ID3", "ID4 "), Date = c(2002, 2003, 2009, 2010), Domain = c("B60C", "B29K","B29K ", "B60C"), Distance = c(62.1371, 124.274, 93.2057, 186.411), Active = c(TRUE, FALSE, FALSE, FALSE), Country = as.factor(c("France", "France", "Germany", "Netherlands")))

# The Distance is given in miles in this data, we would like to have kilometers. The formula to transform a mile to kilometers is a multiplication by 1.609.

data$KM <- data$Distance * 1.609344

print(data)
  Identifier Date Domain Distance Active     Country        KM
1        ID1 2002   B60C  62.1371   TRUE      France  99.99997
2        ID2 2003   B29K 124.2740  FALSE      France 199.99962
3        ID3 2009  B29K   93.2057  FALSE     Germany 150.00003
4       ID4  2010   B60C 186.4110  FALSE Netherlands 299.99942

We can also round this up. The round() function takes two arguments, a first will be the data, the second will be the number of decimals:

data$KM <- round(data$KM, 2)
print(data)
  Identifier Date Domain Distance Active     Country  KM
1        ID1 2002   B60C  62.1371   TRUE      France 100
2        ID2 2003   B29K 124.2740  FALSE      France 200
3        ID3 2009  B29K   93.2057  FALSE     Germany 150
4       ID4  2010   B60C 186.4110  FALSE Netherlands 300

We notice that for some of the domains there is a trailing whitespace, for example we have “B29K” in the data. These can be removed with the trimws() function. These whitespaces can also be created when we split data. It’s good practise to always check this and remove any space to avoid unnecessary mistakes.

data$Domain <- trimws(data$Domain)
print(data)
  Identifier Date Domain Distance Active     Country  KM
1        ID1 2002   B60C  62.1371   TRUE      France 100
2        ID2 2003   B29K 124.2740  FALSE      France 200
3        ID3 2009   B29K  93.2057  FALSE     Germany 150
4       ID4  2010   B60C 186.4110  FALSE Netherlands 300

Suppose that we want to change the labels for the countries. It’s much more common to use ISO-2 level codes for countries. The gsub() function for the replacement of text. The gsub() function requires three arguments: the patter to search, the pattern to replace with and the data. Caution while using this function since it searches for the pattern even withing words. If one searches for the string “land” to replace it with “country”, then the string “netherlands” will become “nethercountrys”.

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()  masks stats::filter()
✖ purrr::flatten() masks jsonlite::flatten()
✖ dplyr::lag()     masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
  data$Country <- gsub("France", "FR", data$Country)
  data$Country <- gsub("Netherlands" , "NL",data$Country)
  data$Country <- gsub("Germany" ,"DE",data$Country)

print(data)
  Identifier Date Domain Distance Active Country  KM
1        ID1 2002   B60C  62.1371   TRUE      FR 100
2        ID2 2003   B29K 124.2740  FALSE      FR 200
3        ID3 2009   B29K  93.2057  FALSE      DE 150
4       ID4  2010   B60C 186.4110  FALSE      NL 300

In the special case of categorical variables, we can also use the dplyr package to replace strings using the case_when function. It is a flexible and powerful function that allows you to perform conditional operations on a vector or data frame, similar to a series of nested if-else statements. It’s especially useful when you need to create or modify variables in your data based on specific conditions. The syntax follows:

case_when(
  condition1 ~ result1,
  condition2 ~ result2,
  condition3 ~ result3,
  ...
)

In the case of changing categories, if we want to switch the name of the country for the ISO identifyers of the countries:

library(tidyverse)
data <- data.frame(Identifier = c("ID1", "ID2", "ID3", "ID4 "), Date = c(2002, 2003, 2009, 2010), Domain = c("B60C", "B29K","B29K ", "B60C"), Distance = c(62.1371, 124.274, 93.2057, 186.411), Active = c(TRUE, FALSE, FALSE, FALSE), Country = as.factor(c("France", "France", "Germany", "Netherlands")))

data$Country <- case_when(
  data$Country == "France" ~ "FR",
  data$Country == "Netherlands" ~ "NL",
  data$Country == "Germany" ~ "DE"
)

print(data)
  Identifier Date Domain Distance Active Country
1        ID1 2002   B60C  62.1371   TRUE      FR
2        ID2 2003   B29K 124.2740  FALSE      FR
3        ID3 2009  B29K   93.2057  FALSE      DE
4       ID4  2010   B60C 186.4110  FALSE      NL

Extract substrings

Missing values

It is common to encounter missing values in your data. There are multiple methods to deal with missing data ranging for simply removing them to using statistical methods to replace the missing values with other values. Whatever you decide to do with your missing data, think thoroughly about the consequences. Always consider why the data is missing, are the missing observations random or is there a pattern to the missing observations? For example if, suppose we are working on a dataset with pollution measurements around the world. It’s possible that missing values come from countries in a specific region of the world. Removing these observations will result in excluding a whole region from the analysis. If the missing observations are more randomly dispersed over different regions in the world this might be less of an issue. The first step to take when dealing with missing values is to check for patterns in the data. Based on the type of missing values you are dealing with, you can decide to on the most efficient method to treat them:

  • Remove the missing values
  • Linear interpolation
  • K-means

This book on missing values and This book with a chapter on how to handle this king of data

Data conformity in Python

Before you start working on your analysis, you want to make sure the data is conform to your expectations. This means:

  • The number of observation is correct, and the variables you need are included.

  • Making sure each column is of the right type (years are dates, text are factors, numbers are integers or reals, …)

  • There are no trailing spaces

  • “.” and “,” aren’t mixed up

  • The units of measurement are correct (km vs. miles, K€ vs 1.000, celsius vs fahrenheit etc.)

  • Any missing values have been dealt with, or you have a valid reason to ignore them

  • headers are easy to use (no spaces or other characters that will make your life difficult).

In this section we will look into checking your data and adjusting it in preparation of your analysis.

Checking data types

There are many different data formats. Most frequently we encounter numbers, strings, factors, dates and booleans. Consider the following dataset:

ID Year Cool Domain Frequency Alive
ID1 2002 B60C 42 TRUE
ID2 2003 B29K 21 FALSE
ID3 2009 C08K 12 TRUE

The first column has textual data that is used as a key for each row. The Year should be considered either as an integer or a date. This will depend upon the type of analysis you want to perform. Specifically time series analysis will require this to be a date, while it’s enough for other analyses to ensure that it is an integer. The most common issue with this type of observation is that the year is considered as text instead of a number. If you see any ““” next to your numbers this is a bad sign which indicates that it is in fact, text.

The “Cool Domain” contains classifications which we want to be textual values. “Frequency” will need to be an integer while “Alive” we want to be a boolean. For each of the formats there is a function to test whether or not the format is the one we expect. These functions are usually in the form of is_numeric_dtype(), is_bool_dtype(), is_string_dtype(). The argument is then simply the column of the dataframe or an observation. Suppose we create the previous dataframe in Python:

import pandas as pd
data = {
  'ID' : ["ID1", "ID2", "ID3"], 
  'Year' : [2002, 2003, 2009], 
  'Cool Domain' : ["B60C", "B29K","B29K"], 
  'Frequency' : [42, 21, 12],
  'Alive' : [True, False, True]
}

dataframe = pd.DataFrame(data)
# check if the information in the column Year is numeric
pd.api.types.is_numeric_dtype(dataframe['Year'])
# check if the information in the column is logic
pd.api.types.is_bool_dtype(dataframe['Alive'])
# check if the information in the column is character
pd.api.types.is_string_dtype(dataframe['ID'])
  • pd: This is an alias for the pandas library, which is a popular data manipulation library in Python. The alias is used for convenience.
  • api: This is the namespace within the pandas library where various utility functions and types are organized. It’s considered part of the public API of the library, which means that it’s stable and can be used by developers without concerns about major changes between different versions of pandas.
  • types: This is a subnamespace within the api namespace that provides functions and types related to data types and data type checking.
  • is_numeric_dtype(): This is a function provided by the pandas library in the types subnamespace. It’s used to check if a given data type is numeric.

This can also be done in a more consise way with the dataframe.describe() function. This compute, for each of the columns, some statistics and provides the class of the class of the variables. This comes in handy as it allows us verify not only the class of the variables but also the distribution:

data = {
  'ID' : ["ID1", "ID2", "ID3"], 
  'Year' : [2002, 2003, 2009], 
  'Cool Domain' : ["B60C", "B29K","B29K"], 
  'Frequency' : [42, 21, 12],
  'Alive' : [True, False, True]
}
dataframe = pd.DataFrame(data)
dataframe.describe()

Missing values in a dataframe will be automatically ignored. We can also check the types of each variable to ensure that it corresponds to what we want (integers are integers, text is text, etc). The latter can be done with the .dftypes function.

import pandas as pd
data = {
  'ID' : ["ID1", "ID2", "ID3", "ID4"], 
  'Year' : [2002, 2003, 2009, 2010], 
  'Cool Domain' : ["B60C", "B29K","B29K", "B60C"], 
  'Frequency' : [42, 21, 12, None],
  'Weight' : [12.1, 34.2, 21.3, 93.2],
  'Alive' : [True, False, False, False],
  'Country' : ["France", "France", "Germany", "Netherlands"]
}
df = pd.DataFrame(data)
df.describe()
df.dtypes

In this output we can see that the None value is ignored and the statistics are computed. The format of the variables is shown: ID is considered “object” signifying that has no specific nature (object is a generic format for anything used in Python). int64 represents an integer, float64 represents a reel number, bool is a boolean operator.

Given that we created the dataframe ourselves we can be relatively sure that the data was is the right format. However, often enough when we download data from online sources, we run into issues. If the format is not as expected there are different functions that allows us to transform the format of the data. Mainly we use .astype(float), .astype(int) and .astype(str).

data = {
  'ID' : ["ID1", "ID2", "ID3", "ID4"], 
  'Year' : [2002, 2003, 2009, 2010], 
  'Cool Domain' : ["B60C", "B29K","B29K", "B60C"], 
  'Frequency' : [42, 21, 12, 12],
  'Weight' : [12.1, 34.2, 21.3, 93.2],
  'Alive' : [True, False, False, False],
  'Country' : ["France", "France", "Germany", "Netherlands"]
}
df = pd.DataFrame(data)
# Change the data type of 'ID' to string
df['ID'] = df['ID'].astype(str)

# Change the data type of 'Weight' to uppercase string
df['Weight'] = df['Weight'].astype(float)

# Change the data type of 'Frequency' to int
# We first need to take care of the missing value
df['Frequency'] = df['Frequency'].astype(int)
df.dtypes

Now that our data has the right format and we are sure we have all the observations we might want to do some work to normalise the column names. This is done to make programming and refercing the columns easier. Whenever the is a space in the name of the column, we need to refer to the column using `Column name with spaces` which is not ideal. It’s generally good practice to not have any spaces. We can use the pyjanitor package to automate this process:

import pandas as pd
import janitor

data = {'Column 1': [1, 2, None, 4],
        'Column (2)': [None, 6, 7, 8],
        'PM(µ >3)': [9, 10, 11, 12]}
df = pd.DataFrame(data)
print(df)
# Using clean_pandas
df = df.clean_names()
print(df)

Data transformation