1. Table of Contents

This project explores the various methods in assessing Data Quality, implementing Data Preprocessing and conducting Data Exploration for prediction problems with categorical responses using various helpful packages in R. A non-exhaustive list of methods to detect missing data, extreme outlying points, near-zero variance, multicollinearity, linear dependencies and skewed distributions were evaluated. Remedial procedures on addressing data quality issues including missing data imputation, centering and scaling transformation, shape transformation and outlier treatment were similarly considered, as applicable. All results were consolidated in a Summary presented at the end of the document.

Data quality assessment involves profiling and assessing the data to understand its suitability for machine learning tasks. The quality of training data has a huge impact on the efficiency, accuracy and complexity of machine learning tasks. Data remains susceptible to errors or irregularities that may be introduced during collection, aggregation or annotation stage. Issues such as incorrect labels, synonymous categories in a categorical variable or heterogeneity in columns, among others, which might go undetected by standard pre-processing modules in these frameworks can lead to sub-optimal model performance, inaccurate analysis and unreliable decisions.

Data preprocessing involves changing the raw feature vectors into a representation that is more suitable for the downstream modelling and estimation processes, including data cleaning, integration, reduction and transformation. Data cleaning aims to identify and correct errors in the dataset that may negatively impact a predictive model such as removing outliers, replacing missing values, smoothing noisy data, and correcting inconsistent data. Data integration addresses potential issues with redundant and inconsistent data obtained from multiple sources through approaches such as detection of tuple duplication and data conflict. The purpose of data reduction is to have a condensed representation of the data set that is smaller in volume, while maintaining the integrity of the original data set. Data transformation converts the data into the most appropriate form for data modeling.

Data exploration involves analyzing and investigating data sets to summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to discover patterns, spot anomalies, test a hypothesis, or check assumptions. This process is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a better understanding of data set variables and the relationships between them.

1.1 Sample Data

The schedulingData dataset from the AppliedPredictiveModeling package was used for this illustrated example.

Preliminary dataset assessment:

[A] 4331 rows (observations)

[B] 8 columns (variables)
[B.1] 1/8 response = Class variable (factor)
[B.2] 7/8 predictors = All remaining variables (2/7 factor + 5/7 numeric)

Code Chunk | Output

##################################
# Loading R libraries
##################################
library(AppliedPredictiveModeling)
library(caret)
library(moments)
library(skimr)
library(dplyr)
library(RANN)
library(corrplot)
library(lares)
library(DMwR2)

##################################
# Loading dataset
##################################
data(schedulingData)

##################################
# Performing a general exploration of the dataset
##################################
dim(schedulingData)

## [1] 4331    8

str(schedulingData)

## 'data.frame':    4331 obs. of  8 variables:
##  $ Protocol   : Factor w/ 14 levels "A","C","D","E",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Compounds  : num  997 97 101 93 100 100 105 98 101 95 ...
##  $ InputFields: num  137 103 75 76 82 82 88 95 91 92 ...
##  $ Iterations : num  20 20 10 20 20 20 20 20 20 20 ...
##  $ NumPending : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hour       : num  14 13.8 13.8 10.1 10.4 ...
##  $ Day        : Factor w/ 7 levels "Mon","Tue","Wed",..: 2 2 4 5 5 3 5 5 5 3 ...
##  $ Class      : Factor w/ 4 levels "VF","F","M","L": 2 1 1 1 1 1 1 1 1 1 ...

summary(schedulingData)

##     Protocol      Compounds        InputFields      Iterations    
##  J      : 989   Min.   :   20.0   Min.   :   10   Min.   : 10.00  
##  O      : 581   1st Qu.:   98.0   1st Qu.:  134   1st Qu.: 20.00  
##  N      : 536   Median :  226.0   Median :  426   Median : 20.00  
##  M      : 451   Mean   :  497.7   Mean   : 1537   Mean   : 29.24  
##  I      : 381   3rd Qu.:  448.0   3rd Qu.:  991   3rd Qu.: 20.00  
##  H      : 321   Max.   :14103.0   Max.   :56671   Max.   :200.00  
##  (Other):1072                                                     
##    NumPending           Hour           Day      Class    
##  Min.   :   0.00   Min.   : 0.01667   Mon:692   VF:2211  
##  1st Qu.:   0.00   1st Qu.:10.90000   Tue:900   F :1347  
##  Median :   0.00   Median :14.01667   Wed:903   M : 514  
##  Mean   :  53.39   Mean   :13.73376   Thu:720   L : 259  
##  3rd Qu.:   0.00   3rd Qu.:16.60000   Fri:923            
##  Max.   :5605.00   Max.   :23.98333   Sat: 32            
##                                       Sun:161

##################################
# Formulating a data type assessment summary
##################################
PDA <- schedulingData
(PDA.Summary <- data.frame(
  Column.Index=c(1:length(names(PDA))),
  Column.Name= names(PDA), 
  Column.Type=sapply(PDA, function(x) class(x)), 
  row.names=NULL)
)

##   Column.Index Column.Name Column.Type
## 1            1    Protocol      factor
## 2            2   Compounds     numeric
## 3            3 InputFields     numeric
## 4            4  Iterations     numeric
## 5            5  NumPending     numeric
## 6            6        Hour     numeric
## 7            7         Day      factor
## 8            8       Class      factor

1.2 Data Quality Assessment

[A] No missing observations noted for any variable.

[B] Low variance observed for 2 variables with First.Second.Mode.Ratio>5.
     [B.1] Iterations variable (numeric)
     [B.2] NumPending variable (numeric)

[C] Low variance observed for 1 variable with Unique.Count.Ratio<0.01.
     [C.1] Iterations variable (numeric)

[D] High skewness observed for 4 variables with Skewness>3 or Skewness<(-3).
     [D.1] Compounds variable (numeric)
     [D.2] InputFields variable (numeric)
     [D.3] Iterations variable (numeric)
     [D.4] NumPending variable (numeric)

Code Chunk | Output

##################################
# Loading dataset
##################################
DQA <- schedulingData

##################################
# Listing all predictors
##################################
DQA.Predictors <- DQA[,!names(DQA) %in% c("Class")]

##################################
# Formulating an overall data quality assessment summary
##################################
(DQA.Summary <- data.frame(
  Column.Index=c(1:length(names(DQA))),
  Column.Name= names(DQA), 
  Column.Type=sapply(DQA, function(x) class(x)), 
  Row.Count=sapply(DQA, function(x) nrow(DQA)),
  NA.Count=sapply(DQA,function(x)sum(is.na(x))),
  Fill.Rate=sapply(DQA,function(x)format(round((sum(!is.na(x))/nrow(DQA)),3),nsmall=3)),
  row.names=NULL)
)

##   Column.Index Column.Name Column.Type Row.Count NA.Count Fill.Rate
## 1            1    Protocol      factor      4331        0     1.000
## 2            2   Compounds     numeric      4331        0     1.000
## 3            3 InputFields     numeric      4331        0     1.000
## 4            4  Iterations     numeric      4331        0     1.000
## 5            5  NumPending     numeric      4331        0     1.000
## 6            6        Hour     numeric      4331        0     1.000
## 7            7         Day      factor      4331        0     1.000
## 8            8       Class      factor      4331        0     1.000

##################################
# Listing all numeric predictors
##################################
DQA.Predictors.Numeric <- DQA.Predictors[,sapply(DQA.Predictors, is.numeric)]

if (length(names(DQA.Predictors.Numeric))>0) {
    print(paste0("There are ",
               (length(names(DQA.Predictors.Numeric))),
               " numeric predictor variable(s)."))
} else {
  print("There are no numeric predictor variables.")
}

## [1] "There are 5 numeric predictor variable(s)."

##################################
# Listing all factor predictors
##################################
DQA.Predictors.Factor <- DQA.Predictors[,sapply(DQA.Predictors, is.factor)]

if (length(names(DQA.Predictors.Factor))>0) {
    print(paste0("There are ",
               (length(names(DQA.Predictors.Factor))),
               " factor predictor variable(s)."))
} else {
  print("There are no factor predictor variables.")
}

## [1] "There are 2 factor predictor variable(s)."

##################################
# Formulating a data quality assessment summary for factor predictors
##################################
if (length(names(DQA.Predictors.Factor))>0) {
  
  ##################################
  # Formulating a function to determine the first mode
  ##################################
  FirstModes <- function(x) {
    ux <- unique(na.omit(x))
    tab <- tabulate(match(x, ux))
    ux[tab == max(tab)]
  }

  ##################################
  # Formulating a function to determine the second mode
  ##################################
  SecondModes <- function(x) {
    ux <- unique(na.omit(x))
    tab <- tabulate(match(x, ux))
    fm = ux[tab == max(tab)]
    sm = x[!(x %in% fm)]
    usm <- unique(sm)
    tabsm <- tabulate(match(sm, usm))
    usm[tabsm == max(tabsm)]
  }
  
  (DQA.Predictors.Factor.Summary <- data.frame(
  Column.Name= names(DQA.Predictors.Factor), 
  Column.Type=sapply(DQA.Predictors.Factor, function(x) class(x)), 
  Unique.Count=sapply(DQA.Predictors.Factor, function(x) length(unique(x))),
  First.Mode.Value=sapply(DQA.Predictors.Factor, function(x) as.character(FirstModes(x)[1])),
  Second.Mode.Value=sapply(DQA.Predictors.Factor, function(x) as.character(SecondModes(x)[1])),
  First.Mode.Count=sapply(DQA.Predictors.Factor, function(x) sum(na.omit(x) == FirstModes(x)[1])),
  Second.Mode.Count=sapply(DQA.Predictors.Factor, function(x) sum(na.omit(x) == SecondModes(x)[1])),
  Unique.Count.Ratio=sapply(DQA.Predictors.Factor, function(x) format(round((length(unique(x))/nrow(DQA.Predictors.Factor)),3), nsmall=3)),
  First.Second.Mode.Ratio=sapply(DQA.Predictors.Factor, function(x) format(round((sum(na.omit(x) == FirstModes(x)[1])/sum(na.omit(x) == SecondModes(x)[1])),3), nsmall=3)),
  row.names=NULL)
  )
  
}

##   Column.Name Column.Type Unique.Count First.Mode.Value Second.Mode.Value
## 1    Protocol      factor           14                J                 O
## 2         Day      factor            7              Fri               Wed
##   First.Mode.Count Second.Mode.Count Unique.Count.Ratio First.Second.Mode.Ratio
## 1              989               581              0.003                   1.702
## 2              923               903              0.002                   1.022

##################################
# Formulating a data quality assessment summary for numeric predictors
##################################
if (length(names(DQA.Predictors.Numeric))>0) {
  
  ##################################
  # Formulating a function to determine the first mode
  ##################################
  FirstModes <- function(x) {
    ux <- unique(na.omit(x))
    tab <- tabulate(match(x, ux))
    ux[tab == max(tab)]
  }

  ##################################
  # Formulating a function to determine the second mode
  ##################################
  SecondModes <- function(x) {
    ux <- unique(na.omit(x))
    tab <- tabulate(match(x, ux))
    fm = ux[tab == max(tab)]
    sm = na.omit(x)[!(na.omit(x) %in% fm)]
    usm <- unique(sm)
    tabsm <- tabulate(match(sm, usm))
    usm[tabsm == max(tabsm)]
  }
  
  (DQA.Predictors.Numeric.Summary <- data.frame(
  Column.Name= names(DQA.Predictors.Numeric), 
  Column.Type=sapply(DQA.Predictors.Numeric, function(x) class(x)), 
  Unique.Count=sapply(DQA.Predictors.Numeric, function(x) length(unique(x))),
  Unique.Count.Ratio=sapply(DQA.Predictors.Numeric, function(x) format(round((length(unique(x))/nrow(DQA.Predictors.Numeric)),3), nsmall=3)),
  First.Mode.Value=sapply(DQA.Predictors.Numeric, function(x) format(round((FirstModes(x)[1]),3),nsmall=3)),
  Second.Mode.Value=sapply(DQA.Predictors.Numeric, function(x) format(round((SecondModes(x)[1]),3),nsmall=3)),
  First.Mode.Count=sapply(DQA.Predictors.Numeric, function(x) sum(na.omit(x) == FirstModes(x)[1])),
  Second.Mode.Count=sapply(DQA.Predictors.Numeric, function(x) sum(na.omit(x) == SecondModes(x)[1])),
  First.Second.Mode.Ratio=sapply(DQA.Predictors.Numeric, function(x) format(round((sum(na.omit(x) == FirstModes(x)[1])/sum(na.omit(x) == SecondModes(x)[1])),3), nsmall=3)),
  Minimum=sapply(DQA.Predictors.Numeric, function(x) format(round(min(x,na.rm = TRUE),3), nsmall=3)),
  Mean=sapply(DQA.Predictors.Numeric, function(x) format(round(mean(x,na.rm = TRUE),3), nsmall=3)),
  Median=sapply(DQA.Predictors.Numeric, function(x) format(round(median(x,na.rm = TRUE),3), nsmall=3)),
  Maximum=sapply(DQA.Predictors.Numeric, function(x) format(round(max(x,na.rm = TRUE),3), nsmall=3)),
  Skewness=sapply(DQA.Predictors.Numeric, function(x) format(round(skewness(x,na.rm = TRUE),3), nsmall=3)),
  Kurtosis=sapply(DQA.Predictors.Numeric, function(x) format(round(kurtosis(x,na.rm = TRUE),3), nsmall=3)),
  Percentile25th=sapply(DQA.Predictors.Numeric, function(x) format(round(quantile(x,probs=0.25,na.rm = TRUE),3), nsmall=3)),
  Percentile75th=sapply(DQA.Predictors.Numeric, function(x) format(round(quantile(x,probs=0.75,na.rm = TRUE),3), nsmall=3)),
  row.names=NULL)
  )  
  
}

##   Column.Name Column.Type Unique.Count Unique.Count.Ratio First.Mode.Value
## 1   Compounds     numeric          858              0.198           20.000
## 2 InputFields     numeric         1730              0.399           10.000
## 3  Iterations     numeric           11              0.003           20.000
## 4  NumPending     numeric          303              0.070            0.000
## 5        Hour     numeric          924              0.213           13.083
##   Second.Mode.Value First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio
## 1            31.000               96                29                   3.310
## 2           466.000               82                27                   3.037
## 3            10.000             3568               272                  13.118
## 4             1.000             3275               165                  19.848
## 5            21.067               28                25                   1.120
##   Minimum     Mean  Median   Maximum Skewness Kurtosis Percentile25th
## 1  20.000  497.742 226.000 14103.000    6.568   69.486         98.000
## 2  10.000 1537.055 426.000 56671.000    5.870   54.919        134.000
## 3  10.000   29.244  20.000   200.000    3.937   18.510         20.000
## 4   0.000   53.389   0.000  5605.000    9.718  105.594          0.000
## 5   0.017   13.734  14.017    23.983   -0.546    3.747         10.900
##   Percentile75th
## 1        448.000
## 2        991.000
## 3         20.000
## 4          0.000
## 5         16.600

##################################
# Identifying potential data quality issues
##################################

##################################
# Checking for missing observations
##################################
if ((nrow(DQA.Summary[DQA.Summary$NA.Count>0,]))>0){
  print(paste0("Missing observations noted for ",
               (nrow(DQA.Summary[DQA.Summary$NA.Count>0,])),
               " variable(s) with NA.Count>0 and Fill.Rate<1.0."))
  DQA.Summary[DQA.Summary$NA.Count>0,]
} else {
  print("No missing observations noted.")
}

## [1] "No missing observations noted."

##################################
# Checking for zero or near-zero variance predictors
##################################
if (length(names(DQA.Predictors.Factor))==0) {
  print("No factor predictors noted.")
} else if (nrow(DQA.Predictors.Factor.Summary[as.numeric(as.character(DQA.Predictors.Factor.Summary$First.Second.Mode.Ratio))>5,])>0){
  print(paste0("Low variance observed for ",
               (nrow(DQA.Predictors.Factor.Summary[as.numeric(as.character(DQA.Predictors.Factor.Summary$First.Second.Mode.Ratio))>5,])),
               " factor variable(s) with First.Second.Mode.Ratio>5."))
  DQA.Predictors.Factor.Summary[as.numeric(as.character(DQA.Predictors.Factor.Summary$First.Second.Mode.Ratio))>5,]
} else {
  print("No low variance factor predictors due to high first-second mode ratio noted.")
}

## [1] "No low variance factor predictors due to high first-second mode ratio noted."

if (length(names(DQA.Predictors.Numeric))==0) {
  print("No numeric predictors noted.")
} else if (nrow(DQA.Predictors.Numeric.Summary[as.numeric(as.character(DQA.Predictors.Numeric.Summary$First.Second.Mode.Ratio))>5,])>0){
  print(paste0("Low variance observed for ",
               (nrow(DQA.Predictors.Numeric.Summary[as.numeric(as.character(DQA.Predictors.Numeric.Summary$First.Second.Mode.Ratio))>5,])),
               " numeric variable(s) with First.Second.Mode.Ratio>5."))
  DQA.Predictors.Numeric.Summary[as.numeric(as.character(DQA.Predictors.Numeric.Summary$First.Second.Mode.Ratio))>5,]
} else {
  print("No low variance numeric predictors due to high first-second mode ratio noted.")
}

## [1] "Low variance observed for 2 numeric variable(s) with First.Second.Mode.Ratio>5."

##   Column.Name Column.Type Unique.Count Unique.Count.Ratio First.Mode.Value
## 3  Iterations     numeric           11              0.003           20.000
## 4  NumPending     numeric          303              0.070            0.000
##   Second.Mode.Value First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio
## 3            10.000             3568               272                  13.118
## 4             1.000             3275               165                  19.848
##   Minimum   Mean Median  Maximum Skewness Kurtosis Percentile25th
## 3  10.000 29.244 20.000  200.000    3.937   18.510         20.000
## 4   0.000 53.389  0.000 5605.000    9.718  105.594          0.000
##   Percentile75th
## 3         20.000
## 4          0.000

if (length(names(DQA.Predictors.Numeric))==0) {
  print("No numeric predictors noted.")
} else if (nrow(DQA.Predictors.Numeric.Summary[as.numeric(as.character(DQA.Predictors.Numeric.Summary$Unique.Count.Ratio))<0.01,])>0){
  print(paste0("Low variance observed for ",
               (nrow(DQA.Predictors.Numeric.Summary[as.numeric(as.character(DQA.Predictors.Numeric.Summary$Unique.Count.Ratio))<0.01,])),
               " numeric variable(s) with Unique.Count.Ratio<0.01."))
  DQA.Predictors.Numeric.Summary[as.numeric(as.character(DQA.Predictors.Numeric.Summary$Unique.Count.Ratio))<0.01,]
} else {
  print("No low variance numeric predictors due to low unique count ratio noted.")
}

## [1] "Low variance observed for 1 numeric variable(s) with Unique.Count.Ratio<0.01."

##   Column.Name Column.Type Unique.Count Unique.Count.Ratio First.Mode.Value
## 3  Iterations     numeric           11              0.003           20.000
##   Second.Mode.Value First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio
## 3            10.000             3568               272                  13.118
##   Minimum   Mean Median Maximum Skewness Kurtosis Percentile25th Percentile75th
## 3  10.000 29.244 20.000 200.000    3.937   18.510         20.000         20.000

##################################
# Checking for skewed predictors
##################################
if (length(names(DQA.Predictors.Numeric))==0) {
  print("No numeric predictors noted.")
} else if (nrow(DQA.Predictors.Numeric.Summary[as.numeric(as.character(DQA.Predictors.Numeric.Summary$Skewness))>3 |
                                               as.numeric(as.character(DQA.Predictors.Numeric.Summary$Skewness))<(-3),])>0){
  print(paste0("High skewness observed for ",
  (nrow(DQA.Predictors.Numeric.Summary[as.numeric(as.character(DQA.Predictors.Numeric.Summary$Skewness))>3 |
                                               as.numeric(as.character(DQA.Predictors.Numeric.Summary$Skewness))<(-3),])),
  " numeric variable(s) with Skewness>3 or Skewness<(-3)."))
  DQA.Predictors.Numeric.Summary[as.numeric(as.character(DQA.Predictors.Numeric.Summary$Skewness))>3 |
                                 as.numeric(as.character(DQA.Predictors.Numeric.Summary$Skewness))<(-3),]
} else {
  print("No skewed numeric predictors noted.")
}

## [1] "High skewness observed for 4 numeric variable(s) with Skewness>3 or Skewness<(-3)."

##   Column.Name Column.Type Unique.Count Unique.Count.Ratio First.Mode.Value
## 1   Compounds     numeric          858              0.198           20.000
## 2 InputFields     numeric         1730              0.399           10.000
## 3  Iterations     numeric           11              0.003           20.000
## 4  NumPending     numeric          303              0.070            0.000
##   Second.Mode.Value First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio
## 1            31.000               96                29                   3.310
## 2           466.000               82                27                   3.037
## 3            10.000             3568               272                  13.118
## 4             1.000             3275               165                  19.848
##   Minimum     Mean  Median   Maximum Skewness Kurtosis Percentile25th
## 1  20.000  497.742 226.000 14103.000    6.568   69.486         98.000
## 2  10.000 1537.055 426.000 56671.000    5.870   54.919        134.000
## 3  10.000   29.244  20.000   200.000    3.937   18.510         20.000
## 4   0.000   53.389   0.000  5605.000    9.718  105.594          0.000
##   Percentile75th
## 1        448.000
## 2        991.000
## 3         20.000
## 4          0.000

1.3 Data Preprocessing

1.3.1 Missing Data Imputation

K-Nearest Neighbors Imputation identifies a sample with one or more missing values and determines the K most similar samples in the training data that are complete. Similarity of samples for this method is defined by either the Euclidean distance (numeric predictors) or Gower’s distance (numeric and categorical predictors) metrics. The distance measure is computed for each predictor and the average distance is used as the overall distance. The K closest samples to the sample with the missing value are identified. Once the K neighbors are found, their values are used to impute the missing data. The mode is used to impute qualitative predictors while the average or median is used to impute quantitative predictors.

Bagged Trees Imputation combines bootstrapping and decision trees to construct an ensemble for each predictor as a function of all the others. Tree-based models are a reasonable choice for an imputation technique since a tree can be constructed in the presence of other missing data, aside from generally having good accuracy without extrapolating values beyond the bounds of the training data. The modeling process involves generating bootstrap samples of the original data, training an unpruned decision tree for each bootstrap subset of the data and implementing an ensemble voting for all the individual decision tree predictions to formulate the final prediction.

Median Imputation replaces missing numeric values with the sample median providing a fast and simple fix for the missing data. However, this process will change the distributional shape, underestimate the variance, disturb the relations between variables, bias almost any estimate other than the median and bias the estimate of the median when data are not missing completely at random, especially when the missing values are large in number. This method can be slightly improved by stratifying data into subgroups.

[A] 100% fill rate with no missing data identified from the previous data quality assessment.

[B] 100% fill rate with no missing data confirmed using a descriptive statistics summary.

[C] The caret package allows three imputation methods:
     [C.1] The knnimpute method is carried out by finding the k closest samples (Euclidian distance) in the training set.
     [C.2] The bagimpute method fits a bagged tree model for each predictor (as a function of all the others).
     [C.3] The medianimpute method takes the median of each predictor in the training set, and uses them to fill missing values.

Code Chunk | Output

##################################
# Loading dataset
##################################
DPA <- schedulingData

##################################
# Gathering descriptive statistics
##################################
(DPA_Skimmed <- skim(DPA))

Data summary
Name	DPA
Number of rows	4331
Number of columns	8
_______________________
Column type frequency:
factor	3
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Protocol	1	FALSE	14	J: 989, O: 581, N: 536, M: 451
Day	1	FALSE	7	Fri: 923, Wed: 903, Tue: 900, Thu: 720
Class	1	FALSE	4	VF: 2211, F: 1347, M: 514, L: 259

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Compounds	1	497.74	1020.17	20.00	98.0	226.00	448.0	14103.00	▇▁▁▁▁
InputFields	1	1537.06	3650.08	10.00	134.0	426.00	991.0	56671.00	▇▁▁▁▁
Iterations	1	29.24	34.42	10.00	20.0	20.00	20.0	200.00	▇▁▁▁▁
NumPending	1	53.39	355.96	0.00	0.0	0.00	0.0	5605.00	▇▁▁▁▁
Hour	1	13.73	3.98	0.02	10.9	14.02	16.6	23.98	▁▂▇▇▁

##################################
# Identifying columns with missing data
#################################
DPA %>%
skim() %>%
dplyr::filter(n_missing > 0)

## # A tibble: 0 x 15
## # i 15 variables: skim_type <chr>, skim_variable <chr>, n_missing <int>,
## #   complete_rate <dbl>, factor.ordered <lgl>, factor.n_unique <int>,
## #   factor.top_counts <chr>, numeric.mean <dbl>, numeric.sd <dbl>,
## #   numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>, numeric.p75 <dbl>,
## #   numeric.p100 <dbl>, numeric.hist <chr>

1.3.2 Outlier Treatment

Spatial Sign Transformation takes a set of predictor variables and transforms them in a way that the new values have the same distance to the center of the distribution. This approach is also referred to as global contrast normalization where the data are projected onto a multidimensional sphere. The transformation required computing the norm of the data and would benefit from a centering and scaling operation, as well as transformation methods inducing symmetry, prior to the spatial sign process.

[A] Outliers noted for 5 variables. Outlier treatment for numerical stability remains optional depending on potential model requirements for the subsequent steps.

[B] Numeric data can be visualized through a boxplot including observations classified as suspected outliers using the IQR criterion. The IQR criterion means that all observations above the (75th percentile + 1.5 x IQR) or below the (25th percentile - 1.5 x IQR) are suspected outliers, where IQR is the difference between the third quartile (75th percentile) and first quartile (25th percentile).

[C] The caret package includes one method for outlier treatment:
[C.1] The spatialSign method from the caret package projects the data for a predictor to the unit circle in p dimensions by dividing it by its norm, where p is the number of predictors.

[D] The spatialSign methods was applied on the dataset:
[D.1] While data distribution generally improved with the number of remaining outliers reduced, there are still 4 variables noted with outliers using the IQR criterion.

Code Chunk | Output

##################################
# Loading dataset
##################################
DPA <- schedulingData

##################################
# Listing all predictors
##################################
DPA.Predictors <- DPA[,!names(DPA) %in% c("Class")]

##################################
# Listing all numeric predictors
##################################
DPA.Predictors.Numeric <- DPA.Predictors[,sapply(DPA.Predictors, is.numeric)]

##################################
# Identifying outliers for the numeric predictors
##################################
OutlierCountList <- c()

for (i in 1:ncol(DPA.Predictors.Numeric)) {
  Outliers <- boxplot.stats(DPA.Predictors.Numeric[,i])$out
  OutlierCount <- length(Outliers)
  OutlierCountList <- append(OutlierCountList,OutlierCount)
  OutlierIndices <- which(DPA.Predictors.Numeric[,i] %in% c(Outliers))
  boxplot(DPA.Predictors.Numeric[,i], 
          ylab = names(DPA.Predictors.Numeric)[i], 
          main = names(DPA.Predictors.Numeric)[i],
          horizontal=TRUE)
  mtext(paste0(OutlierCount, " Outlier(s) Detected"))
}

OutlierCountSummary <- as.data.frame(cbind(names(DPA.Predictors.Numeric),(OutlierCountList)))
names(OutlierCountSummary) <- c("NumericPredictors","OutlierCount")
OutlierCountSummary$OutlierCount <- as.numeric(as.character(OutlierCountSummary$OutlierCount))
NumericPredictorWithOutlierCount <- nrow(OutlierCountSummary[OutlierCountSummary$OutlierCount>0,])
print(paste0(NumericPredictorWithOutlierCount, " numeric variable(s) were noted with outlier(s)." ))

## [1] "5 numeric variable(s) were noted with outlier(s)."

##################################
# Gathering descriptive statistics
##################################
(DPA_Skimmed <- skim(DPA.Predictors.Numeric))

Data summary
Name	DPA.Predictors.Numeric
Number of rows	4331
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Compounds	1	497.74	1020.17	20.00	98.0	226.00	448.0	14103.00	▇▁▁▁▁
InputFields	1	1537.06	3650.08	10.00	134.0	426.00	991.0	56671.00	▇▁▁▁▁
Iterations	1	29.24	34.42	10.00	20.0	20.00	20.0	200.00	▇▁▁▁▁
NumPending	1	53.39	355.96	0.00	0.0	0.00	0.0	5605.00	▇▁▁▁▁
Hour	1	13.73	3.98	0.02	10.9	14.02	16.6	23.98	▁▂▇▇▁

##################################
# Applying a center, scale and spatial sign data transformation
##################################
DPA_CenteredScaledSpatialSigned <- preProcess(DPA.Predictors.Numeric, method = c("center","scale","spatialSign"))
DPA_CenteredScaledSpatialSignedTransformed <- predict(DPA_CenteredScaledSpatialSigned, DPA.Predictors.Numeric)

##################################
# Gathering descriptive statistics
##################################
(DPA_CenteredScaledSpatialSignedTransformedSkimmed <- skim(DPA_CenteredScaledSpatialSignedTransformed))

Data summary
Name	DPA_CenteredScaledSpatial…
Number of rows	4331
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Compounds	1	-0.14	0.37	-0.82	-0.39	-0.21	-0.04	1.00	▂▇▃▁▁
InputFields	1	-0.16	0.41	-0.81	-0.41	-0.25	-0.07	1.00	▃▇▂▁▂
Iterations	1	-0.18	0.37	-0.92	-0.38	-0.25	-0.14	1.00	▁▇▂▁▁
NumPending	1	-0.09	0.21	-0.46	-0.18	-0.13	-0.06	1.00	▃▇▁▁▁
Hour	1	0.04	0.65	-0.99	-0.63	0.08	0.70	0.99	▇▃▅▃▇

##################################
# Identifying outliers for the numeric predictors
##################################
OutlierCountList <- c()

for (i in 1:ncol(DPA.Predictors.Numeric)) {
  Outliers <- boxplot.stats(DPA_CenteredScaledSpatialSignedTransformed[,i])$out
  OutlierCount <- length(Outliers)
  OutlierCountList <- append(OutlierCountList,OutlierCount)
  OutlierIndices <- which(DPA.Predictors.Numeric[,i] %in% c(Outliers))
  boxplot(DPA_CenteredScaledSpatialSignedTransformed[,i], 
          ylab = names(DPA.Predictors.Numeric)[i], 
          main = names(DPA.Predictors.Numeric)[i],
          horizontal=TRUE)
  mtext(paste0(OutlierCount, " Outlier(s) Detected"))
}

OutlierCountSummary <- as.data.frame(cbind(names(DPA.Predictors.Numeric),(OutlierCountList)))
names(OutlierCountSummary) <- c("NumericPredictors","OutlierCount")
OutlierCountSummary$OutlierCount <- as.numeric(as.character(OutlierCountSummary$OutlierCount))
NumericPredictorWithOutlierCount <- nrow(OutlierCountSummary[OutlierCountSummary$OutlierCount>0,])
print(paste0(NumericPredictorWithOutlierCount, " numeric variable(s) were noted with outlier(s)." ))

## [1] "4 numeric variable(s) were noted with outlier(s)."

1.3.3 Zero and Near-Zero Variance

[A] Low variance noted for 1 variable from the previous data quality assessment.

[B] Low variance noted for 2 variables confirmed using a preprocessing summary from the caret package.

[C] The caret package includes two methods for detecting low variance variables:
     [C.1] The nearZeroVar method using the freqCut criteria with default setting at 95/5 computes the frequency of the most prevalent value over the second most frequent value (called the “frequency ratio’’), which would be near one for well-behaved predictors and very large for highly-unbalanced data.
     [C.2] The nearZeroVar method using the uniqueCut criteria with default setting at 10 computes the percent of unique values referring to the number of unique values divided by the total number of samples (times 100) that approaches zero as the granularity of the data increases.

[D] The nearZeroVar method using both the freqCut and uniqueCut criteria set at 80/20 and 10, respectively, were applied on the dataset:
     [D.1] 2 variables may be optionally removed from the dataset for the subsequent analysis.

Code Chunk | Output

##################################
# Loading dataset
##################################
DPA <- schedulingData

##################################
# Gathering descriptive statistics
##################################
(DPA_Skimmed <- skim(DPA))

Data summary
Name	DPA
Number of rows	4331
Number of columns	8
_______________________
Column type frequency:
factor	3
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Protocol	1	FALSE	14	J: 989, O: 581, N: 536, M: 451
Day	1	FALSE	7	Fri: 923, Wed: 903, Tue: 900, Thu: 720
Class	1	FALSE	4	VF: 2211, F: 1347, M: 514, L: 259

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Compounds	1	497.74	1020.17	20.00	98.0	226.00	448.0	14103.00	▇▁▁▁▁
InputFields	1	1537.06	3650.08	10.00	134.0	426.00	991.0	56671.00	▇▁▁▁▁
Iterations	1	29.24	34.42	10.00	20.0	20.00	20.0	200.00	▇▁▁▁▁
NumPending	1	53.39	355.96	0.00	0.0	0.00	0.0	5605.00	▇▁▁▁▁
Hour	1	13.73	3.98	0.02	10.9	14.02	16.6	23.98	▁▂▇▇▁

##################################
# Identifying columns with low variance
###################################
DPA_LowVariance <- nearZeroVar(DPA,
                               freqCut = 80/20,
                               uniqueCut = 10,
                               saveMetrics= TRUE)
(DPA_LowVariance[DPA_LowVariance$nzv,])

##            freqRatio percentUnique zeroVar  nzv
## Iterations  13.11765     0.2539829   FALSE TRUE
## NumPending  19.84848     6.9960748   FALSE TRUE

if ((nrow(DPA_LowVariance[DPA_LowVariance$nzv,]))==0){
  
  print("No low variance predictors noted.")
  
} else {

  print(paste0("Low variance observed for ",
               (nrow(DPA_LowVariance[DPA_LowVariance$nzv,])),
               " numeric variable(s) with First.Second.Mode.Ratio>4 and Unique.Count.Ratio<0.10."))
  
  DPA_LowVarianceForRemoval <- (nrow(DPA_LowVariance[DPA_LowVariance$nzv,]))
  
  print(paste0("Low variance can be resolved by removing ",
               (nrow(DPA_LowVariance[DPA_LowVariance$nzv,])),
               " numeric variable(s)."))
  
  for (j in 1:DPA_LowVarianceForRemoval) {
  DPA_LowVarianceRemovedVariable <- rownames(DPA_LowVariance[DPA_LowVariance$nzv,])[j]
  print(paste0("Variable ",
               j,
               " for removal: ",
               DPA_LowVarianceRemovedVariable))
  }
  
  DPA %>%
  skim() %>%
  dplyr::filter(skim_variable %in% rownames(DPA_LowVariance[DPA_LowVariance$nzv,]))

  ##################################
  # Filtering out columns with low variance
  #################################
  DPA_ExcludedLowVariance <- DPA[,!names(DPA) %in% rownames(DPA_LowVariance[DPA_LowVariance$nzv,])]
  
  ##################################
  # Gathering descriptive statistics
  ##################################
  (DPA_ExcludedLowVariance_Skimmed <- skim(DPA_ExcludedLowVariance))
}

## [1] "Low variance observed for 2 numeric variable(s) with First.Second.Mode.Ratio>4 and Unique.Count.Ratio<0.10."
## [1] "Low variance can be resolved by removing 2 numeric variable(s)."
## [1] "Variable 1 for removal: Iterations"
## [1] "Variable 2 for removal: NumPending"

Data summary
Name	DPA_ExcludedLowVariance
Number of rows	4331
Number of columns	6
_______________________
Column type frequency:
factor	3
numeric	3
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Protocol	1	FALSE	14	J: 989, O: 581, N: 536, M: 451
Day	1	FALSE	7	Fri: 923, Wed: 903, Tue: 900, Thu: 720
Class	1	FALSE	4	VF: 2211, F: 1347, M: 514, L: 259

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Compounds	1	497.74	1020.17	20.00	98.0	226.00	448.0	14103.00	▇▁▁▁▁
InputFields	1	1537.06	3650.08	10.00	134.0	426.00	991.0	56671.00	▇▁▁▁▁
Hour	1	13.73	3.98	0.02	10.9	14.02	16.6	23.98	▁▂▇▇▁

1.3.4 Collinearity

[A] No high correlation noted for any variable pair confirmed using the preprocessing summaries from the caret and lares packages.

[B] The caret and lares packages include methods for detecting highly correlated variables:
     [B.1] The findCorrelation method using the cutoff criteria with default setting at 0.90 from the caret package searches through a correlation matrix and returns a vector of integers corresponding to columns to remove to reduce pair-wise correlations.
     [B.2] The corr_cross method using the top criteria from the lares package lists out and ranks the variable-pairs with the highest correlation coefficients which were statistically significant.

[C] The findCorrelation method using cutoff criteria set at 0.95 and the corr_cross method using top criteria set at 10 were applied on the dataset:
     [C.1] No highly correlated variables detected which may be optionally removed from the dataset for the subsequent analysis.

Code Chunk | Output

##################################
# Loading dataset
##################################
DPA <- schedulingData

##################################
# Listing all predictors
##################################
DPA.Predictors <- DPA[,!names(DPA) %in% c("Class")]

##################################
# Listing all numeric predictors
##################################
DPA.Predictors.Numeric <- DPA.Predictors[,sapply(DPA.Predictors, is.numeric)]

##################################
# Visualizing pairwise correlation between predictors
##################################
DPA_CorrelationTest <- cor.mtest(DPA.Predictors.Numeric,
                       method = "pearson",
                       conf.level = .95)

corrplot(cor(DPA.Predictors.Numeric,
             method = "pearson",
             use="pairwise.complete.obs"), 
         method = "circle",
         type = "upper", 
         order = "original", 
         tl.col = "black", 
         tl.cex = 0.75,
         tl.srt = 90, 
         sig.level = 0.05, 
         p.mat = DPA_CorrelationTest$p,
         insig = "blank")

##################################
# Identifying the highly correlated variables
##################################
DPA_Correlation <-  cor(DPA.Predictors.Numeric, 
                        method = "pearson",
                        use="pairwise.complete.obs")
(DPA_HighlyCorrelatedCount <- sum(abs(DPA_Correlation[upper.tri(DPA_Correlation)]) > 0.95))

## [1] 0

if (DPA_HighlyCorrelatedCount == 0) {
  print("No highly correlated predictors noted.")
} else {
  print(paste0("High correlation observed for ",
               (DPA_HighlyCorrelatedCount),
               " pairs of numeric variable(s) with Correlation.Coefficient>0.95."))
  
  (DPA_HighlyCorrelatedPairs <- corr_cross(DPA.Predictors.Numeric,
  max_pvalue = 0.05, 
  top = DPA_HighlyCorrelatedCount,
  rm.na = TRUE,
  grid = FALSE
))
  
}

## [1] "No highly correlated predictors noted."

if (DPA_HighlyCorrelatedCount > 0) {
  DPA_HighlyCorrelated <- findCorrelation(DPA_Correlation, cutoff = 0.95)
  
  (DPA_HighlyCorrelatedForRemoval <- length(DPA_HighlyCorrelated))
  
  print(paste0("High correlation can be resolved by removing ",
               (DPA_HighlyCorrelatedForRemoval),
               " numeric variable(s)."))
  
  for (j in 1:DPA_HighlyCorrelatedForRemoval) {
  DPA_HighlyCorrelatedRemovedVariable <- colnames(DPA.Predictors.Numeric)[DPA_HighlyCorrelated[j]]
  print(paste0("Variable ",
               j,
               " for removal: ",
               DPA_HighlyCorrelatedRemovedVariable))
  }
  
  ##################################
  # Filtering out columns with high correlation
  #################################
  DPA_ExcludedHighCorrelation <- DPA[,-DPA_HighlyCorrelated]
  
  ##################################
  # Gathering descriptive statistics
  ##################################
  (DPA_ExcludedHighCorrelation_Skimmed <- skim(DPA_ExcludedHighCorrelation))

}

1.3.5 Linear Dependencies

[A] No linear dependency noted for any subset of variables using the preprocessing summary from the caret package.

[B] The caret package includes three methods for detecting linearly dependent variables:
[B.1] The findLinearCombos method from the caret package uses the QR decomposition of a matrix to enumerate sets of linear combinations (if they exist).

[C] The findLinearCombos method was applied on the dataset:
[C.1] No linearly dependent variables were identified which may be optionally removed from the dataset for the subsequent analysis.

Code Chunk | Output

##################################
# Loading dataset
##################################
DPA <- schedulingData

##################################
# Listing all predictors
##################################
DPA.Predictors <- DPA[,!names(DPA) %in% c("Class")]

##################################
# Listing all numeric predictors
##################################
DPA.Predictors.Numeric <- DPA.Predictors[,sapply(DPA.Predictors, is.numeric)]

##################################
# Finding linear dependencies
##################################
DPA_LinearlyDependent <- findLinearCombos(DPA.Predictors.Numeric)

##################################
# Identifying the linearly dependent variables
##################################
DPA_LinearlyDependent <- findLinearCombos(DPA.Predictors.Numeric)

(DPA_LinearlyDependentCount <- length(DPA_LinearlyDependent$linearCombos))

## [1] 0

if (DPA_LinearlyDependentCount == 0) {
  print("No linearly dependent predictors noted.")
} else {
  print(paste0("Linear dependency observed for ",
               (DPA_LinearlyDependentCount),
               " subset(s) of numeric variable(s)."))
  
  for (i in 1:DPA_LinearlyDependentCount) {
    DPA_LinearlyDependentSubset <- colnames(DPA.Predictors.Numeric)[DPA_LinearlyDependent$linearCombos[[i]]]
    print(paste0("Linear dependent variable(s) for subset ",
                 i,
                 " include: ",
                 DPA_LinearlyDependentSubset))
  }
  
}

## [1] "No linearly dependent predictors noted."

##################################
# Identifying the linearly dependent variables for removal
##################################

if (DPA_LinearlyDependentCount > 0) {
  DPA_LinearlyDependent <- findLinearCombos(DPA.Predictors.Numeric)
  
  DPA_LinearlyDependentForRemoval <- length(DPA_LinearlyDependent$remove)
  
  print(paste0("Linear dependency can be resolved by removing ",
               (DPA_LinearlyDependentForRemoval),
               " numeric variable(s)."))
  
  for (j in 1:DPA_LinearlyDependentForRemoval) {
  DPA_LinearlyDependentRemovedVariable <- colnames(DPA.Predictors.Numeric)[DPA_LinearlyDependent$remove[j]]
  print(paste0("Variable ",
               j,
               " for removal: ",
               DPA_LinearlyDependentRemovedVariable))
  }
  
  ##################################
  # Filtering out columns with linear dependency
  #################################
  DPA_ExcludedLinearlyDependent <- DPA.Predictors.Numeric[,-DPA_LinearlyDependent$remove]
  
  ##################################
  # Gathering descriptive statistics
  ##################################
  (DPA_ExcludedLinearlyDependent_Skimmed <- skim(DPA_ExcludedLinearlyDependent))

}

1.3.6 Centering and Scaling

[A] Centering and scaling transformation for numerical stability remains optional depending on potential model requirements for the subsequent steps.

[B] The caret package includes three methods for centering and scaling variables:
     [B.1] The center method from the caret package subtracts the average value of a numeric variable to all the values. As a result of centering, the variable has a zero mean.
     [B.2] The scale method from the caret package performs a center transformation with each value of the variable divided by its standard deviation. Scaling the data coerces the values to have a common standard deviation of one.
     [B.3] The range method from the caret package scales the data to be within the defined numeric range bound.

[C] The center, scale and range methods were evaluated on the dataset.

Code Chunk | Output

##################################
# Loading dataset
##################################
DPA <- schedulingData

##################################
# Listing all predictors
##################################
DPA.Predictors <- DPA[,!names(DPA) %in% c("Class")]

##################################
# Listing all numeric predictors
##################################
DPA.Predictors.Numeric <- DPA.Predictors[,sapply(DPA.Predictors, is.numeric)]

##################################
# Gathering descriptive statistics
##################################
(DPA_Skimmed <- skim(DPA.Predictors.Numeric))

Data summary
Name	DPA.Predictors.Numeric
Number of rows	4331
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Compounds	1	497.74	1020.17	20.00	98.0	226.00	448.0	14103.00	▇▁▁▁▁
InputFields	1	1537.06	3650.08	10.00	134.0	426.00	991.0	56671.00	▇▁▁▁▁
Iterations	1	29.24	34.42	10.00	20.0	20.00	20.0	200.00	▇▁▁▁▁
NumPending	1	53.39	355.96	0.00	0.0	0.00	0.0	5605.00	▇▁▁▁▁
Hour	1	13.73	3.98	0.02	10.9	14.02	16.6	23.98	▁▂▇▇▁

##################################
# Applying a center transformation
##################################
DPA_Centered <- preProcess(DPA.Predictors.Numeric, method = c("center"))
DPA_CenteredTransformed <- predict(DPA_Centered, DPA.Predictors.Numeric)

##################################
# Gathering descriptive statistics
##################################
(DPA_CenteredTransformedSkimmed <- skim(DPA_CenteredTransformed))

Data summary
Name	DPA_CenteredTransformed
Number of rows	4331
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	sd	p0	p25	p50	p75	p100	hist
Compounds	1	1020.17	-477.74	-399.74	-271.74	-49.74	13605.26	▇▁▁▁▁
InputFields	1	3650.08	-1527.06	-1403.06	-1111.06	-546.06	55133.94	▇▁▁▁▁
Iterations	1	34.42	-19.24	-9.24	-9.24	-9.24	170.76	▇▁▁▁▁
NumPending	1	355.96	-53.39	-53.39	-53.39	-53.39	5551.61	▇▁▁▁▁
Hour	1	3.98	-13.72	-2.83	0.28	2.87	10.25	▁▂▇▇▁

##################################
# Applying a center and scale data transformation
##################################
DPA_CenteredScaled <- preProcess(DPA.Predictors.Numeric, method = c("center","scale"))
DPA_CenteredScaledTransformed <- predict(DPA_CenteredScaled, DPA.Predictors.Numeric)

##################################
# Gathering descriptive statistics
##################################
(DPA_CenteredScaledTransformedSkimmed <- skim(DPA_CenteredScaledTransformed))

Data summary
Name	DPA_CenteredScaledTransfo…
Number of rows	4331
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	sd	p0	p25	p50	p75	p100	hist
Compounds	1	1	-0.47	-0.39	-0.27	-0.05	13.34	▇▁▁▁▁
InputFields	1	1	-0.42	-0.38	-0.30	-0.15	15.10	▇▁▁▁▁
Iterations	1	1	-0.56	-0.27	-0.27	-0.27	4.96	▇▁▁▁▁
NumPending	1	1	-0.15	-0.15	-0.15	-0.15	15.60	▇▁▁▁▁
Hour	1	1	-3.45	-0.71	0.07	0.72	2.57	▁▂▇▇▁

##################################
# Applying a range transformation
##################################
DPA_Ranged <- preProcess(DPA.Predictors.Numeric, method = c("range"), rangeBounds = c(0, 1))
DPA_RangedTransformed <- predict(DPA_Ranged, DPA.Predictors.Numeric)

##################################
# Gathering descriptive statistics
##################################
(DPA_RangedTransformedSkimmed <- skim(DPA_RangedTransformed))

Data summary
Name	DPA_RangedTransformed
Number of rows	4331
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p25	p50	p75	p100	hist
Compounds	1	0.03	0.07	0.01	0.01	0.03	1	▇▁▁▁▁
InputFields	1	0.03	0.06	0.00	0.01	0.02	1	▇▁▁▁▁
Iterations	1	0.10	0.18	0.05	0.05	0.05	1	▇▁▁▁▁
NumPending	1	0.01	0.06	0.00	0.00	0.00	1	▇▁▁▁▁
Hour	1	0.57	0.17	0.45	0.58	0.69	1	▁▂▇▇▁

1.3.7 Shape Transformation

Box-Cox Transformation applies a family of power transformations such that the modified values are a monotonic function of the observations over some admissible range and indexed by an optimal parameter lambda. The process involves determining the value of lambda which would result to the highest correlation between the Box-Cox-transformed values and the z-scores of the observations’ ordered indices. The original data set is recomputed through the same power transformations given the optimized lambda value. The method is restricted to positive values only.

Yeo-Johnson Transformation applies a new family of distributions that can be used without restrictions, extending many of the good properties of the Box-Cox power family. Similar to the Box-Cox transformation, the method also estimates the optimal value of lambda but has the ability to transform both positive and negative values by inflating low variance data and deflating high variance data to create a more uniform data set. While there are no restrictions in terms of the applicable values, the interpretability of the transformed values is more diminished as compared to the other methods.

Exponential Transformation applies a new family of exponential distributions that can be used without restrictions, providing a good alternative for other power transformation families. While there are no restrictions in terms of the applicable values, the method’s effectiveness is sensitive to the quality of the data as it needs to assume a common mean and estimate the transformation parameter by directly maximizing the likelihood.

[A] Shape transformation to remove skewness for data distribution stability remains optional depending on potential model requirements for the subsequent steps.

[B] The caret package includes three methods for transforming variables:
     [B.1] The BoxCox method from the caret package transforms the distributional shape for variables with strictly positive values.
     [B.2] The YeoJohnson method from the caret package transforms the distributional shape for variables with zero and/or negative values.
     [B.3] The expoTrans method from the caret package transforms the distributional shape for variables with zero and/or negative values.

[C] The BoxCox, YeoJohnson and expoTrans methods were evaluated on the dataset.

Code Chunk | Output

##################################
# Loading dataset
##################################
DPA <- schedulingData

##################################
# Listing all predictors
##################################
DPA.Predictors <- DPA[,!names(DPA) %in% c("Class")]

##################################
# Listing all numeric predictors
##################################
DPA.Predictors.Numeric <- DPA.Predictors[,sapply(DPA.Predictors, is.numeric)]

##################################
# Gathering descriptive statistics
##################################
(DPA_Skimmed <- skim(DPA.Predictors.Numeric))

Data summary
Name	DPA.Predictors.Numeric
Number of rows	4331
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Compounds	1	497.74	1020.17	20.00	98.0	226.00	448.0	14103.00	▇▁▁▁▁
InputFields	1	1537.06	3650.08	10.00	134.0	426.00	991.0	56671.00	▇▁▁▁▁
Iterations	1	29.24	34.42	10.00	20.0	20.00	20.0	200.00	▇▁▁▁▁
NumPending	1	53.39	355.96	0.00	0.0	0.00	0.0	5605.00	▇▁▁▁▁
Hour	1	13.73	3.98	0.02	10.9	14.02	16.6	23.98	▁▂▇▇▁

##################################
# Applying a Box-Cox transformation
##################################
DPA_BoxCox <- preProcess(DPA.Predictors.Numeric, method = c("BoxCox"))
DPA_BoxCoxTransformed <- predict(DPA_BoxCox, DPA.Predictors.Numeric)

##################################
# Gathering descriptive statistics
##################################
(DPA_BoxCoxTransformedSkimmed <- skim(DPA_BoxCoxTransformed))

Data summary
Name	DPA_BoxCoxTransformed
Number of rows	4331
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Compounds	1	5.36	1.25	3.00	4.58	5.42	6.10	9.55	▅▇▇▂▁
InputFields	1	5.98	1.65	2.30	4.90	6.05	6.90	10.95	▂▆▇▃▁
Iterations	1	0.95	0.02	0.90	0.95	0.95	0.95	1.00	▁▁▇▁▁
NumPending	1	53.39	355.96	0.00	0.00	0.00	0.00	5605.00	▇▁▁▁▁
Hour	1	22.83	8.41	-0.77	16.40	23.04	28.89	47.09	▁▆▇▆▁

##################################
# Applying a Yeo-Johnson transformation
##################################
DPA_YeoJohnson <- preProcess(DPA.Predictors.Numeric, method = c("YeoJohnson"))
DPA_YeoJohnsonTransformed <- predict(DPA_YeoJohnson, DPA.Predictors.Numeric)

##################################
# Gathering descriptive statistics
##################################
(DPA_YeoJohnsonTransformedSkimmed <- skim(DPA_YeoJohnsonTransformed))

Data summary
Name	DPA_YeoJohnsonTransformed
Number of rows	4331
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Compounds	1	4.32	0.80	2.70	3.84	4.40	4.83	6.67	▃▅▇▂▁
InputFields	1	5.40	1.35	2.31	4.53	5.49	6.17	9.19	▂▅▇▃▁
Iterations	1	0.92	0.01	0.88	0.92	0.92	0.92	0.95	▁▁▇▁▁
NumPending	1	0.19	0.35	0.00	0.00	0.00	0.00	0.91	▇▁▁▁▂
Hour	1	33.41	12.40	0.02	23.80	33.53	42.30	70.46	▁▆▇▅▁

##################################
# Applying an exponential transformation
##################################
DPA_ExpoTrans <- preProcess(DPA.Predictors.Numeric, method = c("expoTrans"))
DPA_ExpoTransTransformed <- predict(DPA_ExpoTrans, DPA.Predictors.Numeric)

##################################
# Gathering descriptive statistics
##################################
(DPA_ExpoTransTransformedSkimmed <- skim(DPA_ExpoTransTransformed))

Data summary
Name	DPA_ExpoTransTransformed
Number of rows	4331
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Compounds	1	497.74	1020.17	20.00	98.00	226.00	448.00	14103.00	▇▁▁▁▁
InputFields	1	1537.06	3650.08	10.00	134.00	426.00	991.00	56671.00	▇▁▁▁▁
Iterations	1	12.31	2.08	7.64	12.00	12.00	12.00	17.75	▁▁▇▁▁
NumPending	1	53.39	355.96	0.00	0.00	0.00	0.00	5605.00	▇▁▁▁▁
Hour	1	19.04	6.86	0.02	13.76	18.98	23.83	40.95	▁▆▇▃▁

1.3.8 Dummy Variables

[A] Dummy variable creation (or one-hot encoding) for factor variables remains optional depending on potential model requirements for the subsequent steps.

[B] The caret package includes one method for creating dummy variables:
[B.1] The dummyVars method from the caret package generates a complete (less than full rank parameterized) set of dummy variables from one or more factors.

[C] The dummyVars method was evaluated on the dataset.

Code Chunk | Output

##################################
# Loading dataset
##################################
DPA <- schedulingData

##################################
# Listing all predictors
##################################
DPA.Predictors <- DPA[,!names(DPA) %in% c("Class")]

##################################
# Listing all predictors
##################################
DPA.Predictors.Factor <- DPA.Predictors[,sapply(DPA.Predictors, is.factor)]

##################################
# Applying dummy variable creation
##################################
if (length(names(DPA.Predictors.Factor))>0) {
  print(paste0("There are ",
               (length(names(DPA.Predictors.Factor))),
               " factor variables for dummy variable creation."))
  DPA_DummyVariables <- dummyVars(Class ~ ., data = DPA)
  DPA_DummyVariablesCreated <- predict(DPA_DummyVariables, DPA)
  
  ##################################
  # Gathering descriptive statistics
  ##################################
  (DPA_DummyVariablesCreatedSkimmed <- skim(DPA_DummyVariablesCreated))
  
} else {
  print("There are no factor variables for dummy variable creation.")
}

## [1] "There are 2 factor variables for dummy variable creation."

Data summary
Name	DPA_DummyVariablesCreated
Number of rows	4331
Number of columns	26
_______________________
Column type frequency:
numeric	26
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Protocol.A	1	0.02	0.15	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▁
Protocol.C	1	0.04	0.19	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▁
Protocol.D	1	0.03	0.18	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▁
Protocol.E	1	0.02	0.15	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▁
Protocol.F	1	0.04	0.19	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▁
Protocol.G	1	0.04	0.19	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▁
Protocol.H	1	0.07	0.26	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▁
Protocol.I	1	0.09	0.28	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▁
Protocol.J	1	0.23	0.42	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▂
Protocol.K	1	0.00	0.04	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▁
Protocol.L	1	0.06	0.23	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▁
Protocol.M	1	0.10	0.31	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▁
Protocol.N	1	0.12	0.33	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▁
Protocol.O	1	0.13	0.34	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▁
Compounds	1	497.74	1020.17	20.00	98.0	226.00	448.0	14103.00	▇▁▁▁▁
InputFields	1	1537.06	3650.08	10.00	134.0	426.00	991.0	56671.00	▇▁▁▁▁
Iterations	1	29.24	34.42	10.00	20.0	20.00	20.0	200.00	▇▁▁▁▁
NumPending	1	53.39	355.96	0.00	0.0	0.00	0.0	5605.00	▇▁▁▁▁
Hour	1	13.73	3.98	0.02	10.9	14.02	16.6	23.98	▁▂▇▇▁
Day.Mon	1	0.16	0.37	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▂
Day.Tue	1	0.21	0.41	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▂
Day.Wed	1	0.21	0.41	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▂
Day.Thu	1	0.17	0.37	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▂
Day.Fri	1	0.21	0.41	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▂
Day.Sat	1	0.01	0.09	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▁
Day.Sun	1	0.04	0.19	0.00	0.0	0.00	0.0	1.00	▇▁▁▁▁

1.4 Data Exploration

[A] Data exploration for classification modelling problems involve bivariate analysis between the factor response variable and the numeric predictor variables.

[B] The caret package includes one method for performing data exploration:
[B.1] The featurePlot method from the caret package generates various graphs (box, strip, density, correlation pairs and correlation ellipse plots) for exploring and visualizing the potential relationships between the response and predictor variables.

[C] The featurePlot method was evaluated on the dataset.

Code Chunk | Output

##################################
# Loading dataset
##################################
DPA <- schedulingData

##################################
# Listing all predictors
##################################
DPA.Predictors <- DPA[,!names(DPA) %in% c("Class")]

##################################
# Listing all numeric predictors
##################################
DPA.Predictors.Numeric <- DPA.Predictors[,sapply(DPA.Predictors, is.numeric)]
ncol(DPA.Predictors.Numeric)

## [1] 5

##################################
# Converting response variable data type to factor
##################################
DPA$Class <- as.factor(DPA$Class)
length(levels(DPA$Class))

## [1] 4

##################################
# Formulating the box plots
##################################
featurePlot(x = DPA.Predictors.Numeric, 
            y = DPA$Class,
            plot = "box",
            scales = list(x = list(relation="free", rot = 90), 
                          y = list(relation="free")),
            adjust = 1.5, 
            pch = "|", 
            layout = c(1, (ncol(DPA.Predictors.Numeric))))

##################################
# Formulating the strip plots
##################################
featurePlot(x = DPA.Predictors.Numeric, 
            y = DPA$Class,
            plot = "strip",
            jitter = TRUE,
            scales = list(x = list(relation="free", rot = 90), 
                          y = list(relation="free")),
            adjust = 1.5, 
            pch = "|", 
            layout = c(1, (ncol(DPA.Predictors.Numeric))))

##################################
# Formulating the density plots
##################################
featurePlot(x = DPA.Predictors.Numeric, 
            y = DPA$Class,
            plot = "density",
            scales = list(x = list(relation="free", rot = 90), 
                          y = list(relation="free")),
            adjust = 1.5, 
            pch = "|", 
            layout = c(1, (ncol(DPA.Predictors.Numeric))),
            auto.key = list(columns = (length(levels(DPA$Class)))))

2. Summary

3. References

[Book] Applied Predictive Modeling by Max Kuhn and Kjell Johnson
[Book] An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani
[Book] R For Data Science by Hadley Wickham and Garrett Grolemund
[Book] Exploratory Data Analysis with R by Roger Peng
[Book] Multivariate Data Visualization with R by Deepayan Sarkar
[Book] Machine Learning by Samuel Jackson
[Book] Feature Engineering and Selection: A Practical Approach for Predictive Models by Max Kuhn and Kjell Johnson
[Book] Data Modeling Methods by Jacob Larget
[Book] Introduction to R by Sara Bonnin
[R Package] AppliedPredictiveModeling by Max Kuhn
[R Package] caret by Max Kuhn
[R Package] dplyr by Hadley Wickham
[R Package] moments by Lukasz Komsta and Frederick
[R Package] skimr by Elin Waring
[R Package] RANN by Sunil Arya, David Mount, Samuel Kemp and Gregory Jefferis
[R Package] corrplot by Taiyun Wei
[R Package] lares by Bernardo Lares
[R Package] DMwR2 by Luis Torgo
[Article] The caret Package by Max Kuhn
[Article] A Short Introduction to the caret Package by Max Kuhn
[Article] Caret Package – A Practical Guide to Machine Learning in R by Selva Prabhakaran
[Article] Lattice Graphs by STHDA Team
[Article] Exploratory Data Analysis in R with Tidyverse by Nishant Kumar Singh
[Article] Exploratory Data Analysis in R (Introduction) by Data Science Hero Team
[Article] How to do Exploratory Data Analysis (EDA) in R (With Examples) by Talha Khalid
[Article] Packages for Exploratory Data Analysis in R by Luke DiMartino
[Article] DataExplorer: Exploratory Data Analysis in R by Matt Dancho
[Article] How to Perform Exploratory Data Analysis in R (With Example) by Statology Team
[Article] Exploratory Data Analysis in R Programming by Geeks For Geeks Team
[Article] Exploratory Data Analysis by IBM Team
[Article] What Is Exploratory Data Analysis? by IBM Team
[Article] Data Exploration in R (9 Examples) | Exploratory Analysis & Visualization by Joachim Schork
[Article] Exploratory Data Analysis in Python by Geeks For Geeks Team
[Article] Data Quality Assessment for Machine Learning by IBM Team
[Article] Data Quality for Machine Learning Tasks by IBM Team
[Publication] A Review: Data Pre-processing and Data Augmentation Techniques by Kiran Maharana, Surajit Mondal and Bhushankumar Nemade
[Publication] Data Quality for Machine Learning Tasks by Nitin Gupta, Shashank Mujumdar, Hima Patel, Satoshi Masuda, Naveen Panwar, Sambaran Bandyopadhyay, Sameep Mehta, Shanmukha Guttula, Shazia Afzal, Ruhi Sharma Mittal and Vitobha Munigala (KDD ’21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining)
[Publication] Overview and Importance of Data Quality for Machine Learning Tasks by Abhinav Jain, Hima Patel, Lokesh Nagalapatti, Nitin Gupta, Sameep Mehta, Shanmukha Guttula, Shashank Mujumdar, Shazia Afzal, Ruhi Sharma Mittal and Vitobha Munigala (KDD ’20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining)
[Publication] Spatial Sign Preprocessing: A Simple Way To Impart Moderate Robustness to Multivariate Estimators by Sven Serneels, Evert De Nolf, and Pierre Van Espen (Journal of Chemical Information and Modeling)
[Publication] An Analysis of Transformations by George Box and David Cox (Journal of the Royal Statistical Society)
[Publication] A New Family of Power Transformations to Improve Normality or Symmetry by In-Kwon Yeo and Richard Johnson (Biometrika)
[Publication] Exponential Data Transformations by B Manly (Journal of the Royal Statistical Society)
[Course] Applied Data Mining and Statistical Learning by Penn State Eberly College of Science

Data Preprocessing : Data Quality Assessment, Preprocessing and Exploration for a Classification Modelling Problem

John Pauline Pineda

February 8, 2022

1. Table of Contents

1.1 Sample Data

1.2 Data Quality Assessment

1.3 Data Preprocessing

1.3.1 Missing Data Imputation

1.3.2 Outlier Treatment

1.3.3 Zero and Near-Zero Variance

1.3.4 Collinearity

1.3.5 Linear Dependencies

1.3.6 Centering and Scaling

1.3.7 Shape Transformation

1.3.8 Dummy Variables

1.4 Data Exploration

2. Summary

3. References