1. Table of Contents

This project explores different performance metrics which are adaptable to censoring conditions for evaluating survival model predictions using various helpful packages in R. Using the Survival Random Forest and Cox Proportional Hazards Regression model structures, metrics applied in the analysis to estimate the generalization performance of survival models on out-of-sample data included the Concordance Index, Brier Score, Integrated Absolute Error and Integrated Square Error. The resulting split-sample cross-validated estimations derived from the metrics were evaluated in terms of their performance consistency across the candidate models. Considering the differences in their intended applications and current data restrictions, a comparison of each metric’s strengths and limitations was briefly discussed. All results were consolidated in a Summary presented at the end of the document.

Survival analysis aims to study the relationship between independent covariates and survival time and event outcomes. Evaluation metrics in survival analysis are used to ascertain the agreement between the predicted and actual measures for each observation (expressed in survival probability, status or time), while considering the censoring condition. The metrics applied in this study (mostly contained in the SurvMetrics package) attempt to evaluate the performance of survival model predictions on unseen data.

1.1 Sample Data

The peakVO2 dataset from the randomForestSRC package was used for this illustrated example.

Preliminary dataset assessment:

[A] 2231 rows (observations)

[B] 41 columns (variables)
     [B.1] 1/41 survival event status = died variable (factor)
            [B.1.1] Event = died=1
            [B.1.2] Censored or Non-Event = died=0
     [B.2] 1/41 survival time = ttodead variable (numeric)
     [B.3] 39/41 predictors = All remaining variables (26/39 factor + 13/39 numeric)

Code Chunk | Output

##################################
# Loading R libraries
##################################
library(survival)
library(survminer)
library(mice)
library(foreign)
library(rms)
library(Hmisc)
library(VIM)
library(gridExtra)
library(finalfit)
library(knitr)
library(dplyr)
library(gtsummary)
library(tidyr)
library(purrr)
library(moments)
library(skimr)
library(caret)
library(corrplot)
library(randomForestSRC)
library(SurvMetrics)
library(pec)

##################################
# Loading source and
# formulating the train set
##################################
data(peakVO2)
peakVO2 <- as.data.frame(peakVO2)

##################################
# Converting survival time
# to discrete monthly data
##################################
peakVO2$ttodead <- ceiling(peakVO2$ttodead*12)
peakVO2.Source <- peakVO2

##################################
# Setting the appropriate data types
# for exploratory data analysis
##################################
peakVO2.Assessment <- (as.data.frame(sapply(peakVO2, function(x) max(x))))
names(peakVO2.Assessment) <- c("Max.Value")
peakVO2.Assessment$Variables <- rownames(peakVO2.Assessment)
peakVO2.Assessment

##                          Max.Value               Variables
## age                       82.00000                     age
## betablok                   1.00000                betablok
## dilver                     1.00000                  dilver
## nifed                      1.00000                   nifed
## acei                       1.00000                    acei
## angioten.II                1.00000             angioten.II
## anti.arrhy                 1.00000              anti.arrhy
## anti.coag                  1.00000               anti.coag
## aspirin                    1.00000                 aspirin
## digoxin                    1.00000                 digoxin
## nitrates                   1.00000                nitrates
## vasodilator                1.00000             vasodilator
## diuretic.loop              1.00000           diuretic.loop
## diuretic.thiazide          1.00000       diuretic.thiazide
## diuretic.potassium.spar    1.00000 diuretic.potassium.spar
## lipidrx.statin             1.00000          lipidrx.statin
## insulin                    1.00000                 insulin
## surgery.pacemaker          1.00000       surgery.pacemaker
## surgery.cabg               1.00000            surgery.cabg
## surgery.pci                1.00000             surgery.pci
## surgery.aicd.implant       1.00000    surgery.aicd.implant
## resting.systolic.bp      184.00000     resting.systolic.bp
## resting.hr               124.00000              resting.hr
## smknow                     1.00000                  smknow
## q.wave.mi                  1.00000               q.wave.mi
## bmi                       59.61593                     bmi
## niddm                      1.00000                   niddm
## lvef.metabl               40.00000             lvef.metabl
## peak.rer                   1.80000                peak.rer
## peak.vo2                  43.80000                peak.vo2
## interval                1415.00000                interval
## cad                        1.00000                     cad
## died                       1.00000                    died
## ttodead                  127.00000                 ttodead
## bun                      129.00000                     bun
## sodium                   152.00000                  sodium
## hgb                       19.20000                     hgb
## glucose                  424.00000                 glucose
## male                       1.00000                    male
## black                      1.00000                   black
## crcl                     385.84097                    crcl

(peakVO2.FactorNames <- peakVO2.Assessment[peakVO2.Assessment[,1]==1,2])

##  [1] "betablok"                "dilver"                 
##  [3] "nifed"                   "acei"                   
##  [5] "angioten.II"             "anti.arrhy"             
##  [7] "anti.coag"               "aspirin"                
##  [9] "digoxin"                 "nitrates"               
## [11] "vasodilator"             "diuretic.loop"          
## [13] "diuretic.thiazide"       "diuretic.potassium.spar"
## [15] "lipidrx.statin"          "insulin"                
## [17] "surgery.pacemaker"       "surgery.cabg"           
## [19] "surgery.pci"             "surgery.aicd.implant"   
## [21] "smknow"                  "q.wave.mi"              
## [23] "niddm"                   "cad"                    
## [25] "died"                    "male"                   
## [27] "black"

peakVO2.Factor <- peakVO2[,peakVO2.FactorNames]
peakVO2.Factor <- as.data.frame(lapply(peakVO2.Factor, factor))

peakVO2.Numeric <- as.data.frame(peakVO2[,!names(peakVO2) %in% peakVO2.FactorNames])

peakVO2 <- data.frame(peakVO2.Numeric,peakVO2.Factor)

##################################
# Performing a general exploration of the data set
##################################
dim(peakVO2)

## [1] 2231   41

str(peakVO2)

## 'data.frame':    2231 obs. of  41 variables:
##  $ age                    : int  74 77 79 67 79 51 56 69 41 45 ...
##  $ resting.systolic.bp    : int  90 134 108 122 130 122 100 104 108 100 ...
##  $ resting.hr             : int  74 53 104 96 94 62 79 72 71 84 ...
##  $ bmi                    : num  26.1 20.8 20.7 26.7 23.6 ...
##  $ lvef.metabl            : num  15 20 40 25 20 10 10 15 20 25 ...
##  $ peak.rer               : num  1.12 0.98 1.15 1.21 1.06 0.98 0.91 1.2 1.17 0.89 ...
##  $ peak.vo2               : num  11.9 24 11.2 15.6 17.2 15.7 6 15.9 23.8 15 ...
##  $ interval               : int  374 755 225 405 580 635 145 480 840 453 ...
##  $ ttodead                : num  15 44 3 127 45 15 13 11 122 9 ...
##  $ bun                    : num  25 20 29.7 27.9 39 ...
##  $ sodium                 : num  141 138 140 139 143 ...
##  $ hgb                    : num  14.6 13.1 12.9 13.2 12.5 ...
##  $ glucose                : num  89 90 98.7 136.9 104 ...
##  $ crcl                   : num  49.9 60.4 33.6 69.3 44.7 ...
##  $ betablok               : Factor w/ 2 levels "0","1": 2 2 1 1 2 1 1 1 1 1 ...
##  $ dilver                 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ nifed                  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ acei                   : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 2 2 2 2 ...
##  $ angioten.II            : Factor w/ 2 levels "0","1": 2 1 1 1 2 1 1 1 1 1 ...
##  $ anti.arrhy             : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 2 2 1 1 ...
##  $ anti.coag              : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 2 1 1 2 ...
##  $ aspirin                : Factor w/ 2 levels "0","1": 2 2 2 1 2 2 1 2 1 1 ...
##  $ digoxin                : Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 2 2 2 ...
##  $ nitrates               : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ vasodilator            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ diuretic.loop          : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 2 ...
##  $ diuretic.thiazide      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ diuretic.potassium.spar: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ lipidrx.statin         : Factor w/ 2 levels "0","1": 2 1 1 1 2 1 1 2 1 1 ...
##  $ insulin                : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
##  $ surgery.pacemaker      : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 2 ...
##  $ surgery.cabg           : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
##  $ surgery.pci            : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
##  $ surgery.aicd.implant   : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 2 1 1 ...
##  $ smknow                 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ q.wave.mi              : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ niddm                  : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 1 1 1 1 ...
##  $ cad                    : Factor w/ 2 levels "0","1": 2 1 1 1 2 2 1 1 1 1 ...
##  $ died                   : Factor w/ 2 levels "0","1": 1 1 2 1 2 2 2 2 1 2 ...
##  $ male                   : Factor w/ 2 levels "0","1": 2 1 2 2 2 1 1 2 1 1 ...
##  $ black                  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...

summary(peakVO2)

##       age        resting.systolic.bp   resting.hr          bmi       
##  Min.   :18.00   Min.   : 80.0       Min.   : 39.00   Min.   :14.18  
##  1st Qu.:48.00   1st Qu.: 98.0       1st Qu.: 66.00   1st Qu.:24.52  
##  Median :55.00   Median :110.0       Median : 75.00   Median :27.78  
##  Mean   :54.22   Mean   :110.5       Mean   : 76.18   Mean   :28.50  
##  3rd Qu.:62.00   3rd Qu.:122.0       3rd Qu.: 86.00   3rd Qu.:31.89  
##  Max.   :82.00   Max.   :184.0       Max.   :124.00   Max.   :59.62  
##   lvef.metabl       peak.rer        peak.vo2        interval     
##  Min.   : 5.00   Min.   :0.079   Min.   : 4.20   Min.   :  21.0  
##  1st Qu.:15.00   1st Qu.:1.020   1st Qu.:12.80   1st Qu.: 345.0  
##  Median :20.00   Median :1.100   Median :15.70   Median : 480.0  
##  Mean   :20.26   Mean   :1.079   Mean   :16.27   Mean   : 503.3  
##  3rd Qu.:25.00   3rd Qu.:1.140   3rd Qu.:19.30   3rd Qu.: 641.0  
##  Max.   :40.00   Max.   :1.800   Max.   :43.80   Max.   :1415.0  
##     ttodead            bun              sodium           hgb       
##  Min.   :  1.00   Min.   :  4.322   Min.   :120.0   Min.   : 7.20  
##  1st Qu.: 30.00   1st Qu.: 17.000   1st Qu.:138.0   1st Qu.:12.80  
##  Median : 62.00   Median : 23.000   Median :139.8   Median :13.58  
##  Mean   : 62.75   Mean   : 25.278   Mean   :139.5   Mean   :13.54  
##  3rd Qu.: 95.00   3rd Qu.: 29.334   3rd Qu.:141.0   3rd Qu.:14.40  
##  Max.   :127.00   Max.   :129.000   Max.   :152.0   Max.   :19.20  
##     glucose            crcl        betablok dilver   nifed    acei    
##  Min.   : 37.00   Min.   :  5.76   0: 802   0:2215   0:2132   0: 520  
##  1st Qu.: 87.00   1st Qu.: 60.93   1:1429   1:  16   1:  99   1:1711  
##  Median : 96.93   Median : 83.13                                      
##  Mean   :109.35   Mean   : 90.71                                      
##  3rd Qu.:114.88   3rd Qu.:110.33                                      
##  Max.   :424.00   Max.   :385.84                                      
##  angioten.II anti.arrhy anti.coag aspirin  digoxin  nitrates vasodilator
##  0:1941      0:1722     0:1332    0:1193   0: 661   0:1492   0:2095     
##  1: 290      1: 509     1: 899    1:1038   1:1570   1: 739   1: 136     
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##  diuretic.loop diuretic.thiazide diuretic.potassium.spar lipidrx.statin
##  0: 351        0:1952            0:1582                  0:1381        
##  1:1880        1: 279            1: 649                  1: 850        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##  insulin  surgery.pacemaker surgery.cabg surgery.pci surgery.aicd.implant
##  0:2016   0:1729            0:1637       0:1755      0:1584              
##  1: 215   1: 502            1: 594       1: 476      1: 647              
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##  smknow   q.wave.mi niddm    cad      died     male     black   
##  0:1772   0:1952    0:1881   0:1325   0:1505   0: 602   0:1965  
##  1: 459   1: 279    1: 350   1: 906   1: 726   1:1629   1: 266  
##                                                                 
##                                                                 
##                                                                 
##

##################################
# Formulating a data type assessment summary
##################################
PDA <- peakVO2
(PDA.Summary <- data.frame(
  Column.Index=c(1:length(names(PDA))),
  Column.Name= names(PDA), 
  Column.Type=sapply(PDA, function(x) class(x)), 
  row.names=NULL)
)

##    Column.Index             Column.Name Column.Type
## 1             1                     age     integer
## 2             2     resting.systolic.bp     integer
## 3             3              resting.hr     integer
## 4             4                     bmi     numeric
## 5             5             lvef.metabl     numeric
## 6             6                peak.rer     numeric
## 7             7                peak.vo2     numeric
## 8             8                interval     integer
## 9             9                 ttodead     numeric
## 10           10                     bun     numeric
## 11           11                  sodium     numeric
## 12           12                     hgb     numeric
## 13           13                 glucose     numeric
## 14           14                    crcl     numeric
## 15           15                betablok      factor
## 16           16                  dilver      factor
## 17           17                   nifed      factor
## 18           18                    acei      factor
## 19           19             angioten.II      factor
## 20           20              anti.arrhy      factor
## 21           21               anti.coag      factor
## 22           22                 aspirin      factor
## 23           23                 digoxin      factor
## 24           24                nitrates      factor
## 25           25             vasodilator      factor
## 26           26           diuretic.loop      factor
## 27           27       diuretic.thiazide      factor
## 28           28 diuretic.potassium.spar      factor
## 29           29          lipidrx.statin      factor
## 30           30                 insulin      factor
## 31           31       surgery.pacemaker      factor
## 32           32            surgery.cabg      factor
## 33           33             surgery.pci      factor
## 34           34    surgery.aicd.implant      factor
## 35           35                  smknow      factor
## 36           36               q.wave.mi      factor
## 37           37                   niddm      factor
## 38           38                     cad      factor
## 39           39                    died      factor
## 40           40                    male      factor
## 41           41                   black      factor

1.2 Data Quality Assessment

[A] No missing observations noted for any variable.

[B] Low variance observed for 10 variables with First.Second.Mode.Ratio>5.
     [B.1] dilver variable (factor)
     [B.2] nifed variable (factor)
     [B.3] angioten.II variable (factor)
     [B.4] vasodilator variable (factor)
     [B.5] diuretic.loop variable (factor)
     [B.6] diuretic.thiazide variable (factor)
     [B.7] insulin variable (factor)
     [B.8] q.wave.mi variable (factor)
     [B.9] niddm variable (factor)
     [B.10] black variable (factor)

[C] No low variance observed for any variable with Unique.Count.Ratio<0.01.

[D] No high skewness observed for any variable with Skewness>3 or Skewness<(-3).

Code Chunk | Output

##################################
# Loading dataset
##################################
DQA <- peakVO2

##################################
# Formulating an overall data quality assessment summary
##################################
(DQA.Summary <- data.frame(
  Column.Index=c(1:length(names(DQA))),
  Column.Name= names(DQA),
  Column.Type=sapply(DQA, function(x) class(x)),
  Row.Count=sapply(DQA, function(x) nrow(DQA)),
  NA.Count=sapply(DQA,function(x)sum(is.na(x))),
  Fill.Rate=sapply(DQA,function(x)format(round((sum(!is.na(x))/nrow(DQA)),3),nsmall=3)),
  row.names=NULL)
)

##    Column.Index             Column.Name Column.Type Row.Count NA.Count
## 1             1                     age     integer      2231        0
## 2             2     resting.systolic.bp     integer      2231        0
## 3             3              resting.hr     integer      2231        0
## 4             4                     bmi     numeric      2231        0
## 5             5             lvef.metabl     numeric      2231        0
## 6             6                peak.rer     numeric      2231        0
## 7             7                peak.vo2     numeric      2231        0
## 8             8                interval     integer      2231        0
## 9             9                 ttodead     numeric      2231        0
## 10           10                     bun     numeric      2231        0
## 11           11                  sodium     numeric      2231        0
## 12           12                     hgb     numeric      2231        0
## 13           13                 glucose     numeric      2231        0
## 14           14                    crcl     numeric      2231        0
## 15           15                betablok      factor      2231        0
## 16           16                  dilver      factor      2231        0
## 17           17                   nifed      factor      2231        0
## 18           18                    acei      factor      2231        0
## 19           19             angioten.II      factor      2231        0
## 20           20              anti.arrhy      factor      2231        0
## 21           21               anti.coag      factor      2231        0
## 22           22                 aspirin      factor      2231        0
## 23           23                 digoxin      factor      2231        0
## 24           24                nitrates      factor      2231        0
## 25           25             vasodilator      factor      2231        0
## 26           26           diuretic.loop      factor      2231        0
## 27           27       diuretic.thiazide      factor      2231        0
## 28           28 diuretic.potassium.spar      factor      2231        0
## 29           29          lipidrx.statin      factor      2231        0
## 30           30                 insulin      factor      2231        0
## 31           31       surgery.pacemaker      factor      2231        0
## 32           32            surgery.cabg      factor      2231        0
## 33           33             surgery.pci      factor      2231        0
## 34           34    surgery.aicd.implant      factor      2231        0
## 35           35                  smknow      factor      2231        0
## 36           36               q.wave.mi      factor      2231        0
## 37           37                   niddm      factor      2231        0
## 38           38                     cad      factor      2231        0
## 39           39                    died      factor      2231        0
## 40           40                    male      factor      2231        0
## 41           41                   black      factor      2231        0
##    Fill.Rate
## 1      1.000
## 2      1.000
## 3      1.000
## 4      1.000
## 5      1.000
## 6      1.000
## 7      1.000
## 8      1.000
## 9      1.000
## 10     1.000
## 11     1.000
## 12     1.000
## 13     1.000
## 14     1.000
## 15     1.000
## 16     1.000
## 17     1.000
## 18     1.000
## 19     1.000
## 20     1.000
## 21     1.000
## 22     1.000
## 23     1.000
## 24     1.000
## 25     1.000
## 26     1.000
## 27     1.000
## 28     1.000
## 29     1.000
## 30     1.000
## 31     1.000
## 32     1.000
## 33     1.000
## 34     1.000
## 35     1.000
## 36     1.000
## 37     1.000
## 38     1.000
## 39     1.000
## 40     1.000
## 41     1.000

##################################
# Listing all descriptors
##################################
DQA.Descriptors <- DQA

##################################
# Listing all numeric Descriptors
##################################
DQA.Descriptors.Numeric <- DQA.Descriptors[,sapply(DQA.Descriptors, is.numeric)]

if (length(names(DQA.Descriptors.Numeric))>0) {
    print(paste0("There are ",
               (length(names(DQA.Descriptors.Numeric))),
               " numeric descriptor variable(s)."))
} else {
  print("There are no numeric descriptor variables.")
}

## [1] "There are 14 numeric descriptor variable(s)."

##################################
# Listing all factor Descriptors
##################################
DQA.Descriptors.Factor <- DQA.Descriptors[,sapply(DQA.Descriptors, is.factor)]

if (length(names(DQA.Descriptors.Factor))>0) {
    print(paste0("There are ",
               (length(names(DQA.Descriptors.Factor))),
               " factor descriptor variable(s)."))
} else {
  print("There are no factor descriptor variables.")
}

## [1] "There are 27 factor descriptor variable(s)."

##################################
# Formulating a data quality assessment summary for factor Descriptors
##################################
if (length(names(DQA.Descriptors.Factor))>0) {

  ##################################
  # Formulating a function to determine the first mode
  ##################################
  FirstModes <- function(x) {
    ux <- unique(na.omit(x))
    tab <- tabulate(match(x, ux))
    ux[tab == max(tab)]
  }

  ##################################
  # Formulating a function to determine the second mode
  ##################################
  SecondModes <- function(x) {
    ux <- unique(na.omit(x))
    tab <- tabulate(match(x, ux))
    fm = ux[tab == max(tab)]
    sm = x[!(x %in% fm)]
    usm <- unique(sm)
    tabsm <- tabulate(match(sm, usm))
    ifelse(is.na(usm[tabsm == max(tabsm)])==TRUE,
           return("x"),
           return(usm[tabsm == max(tabsm)]))
  }

  (DQA.Descriptors.Factor.Summary <- data.frame(
  Column.Name= names(DQA.Descriptors.Factor),
  Column.Type=sapply(DQA.Descriptors.Factor, function(x) class(x)),
  Unique.Count=sapply(DQA.Descriptors.Factor, function(x) length(unique(x))),
  First.Mode.Value=sapply(DQA.Descriptors.Factor, function(x) as.character(FirstModes(x)[1])),
  Second.Mode.Value=sapply(DQA.Descriptors.Factor, function(x) as.character(SecondModes(x)[1])),
  First.Mode.Count=sapply(DQA.Descriptors.Factor, function(x) sum(na.omit(x) == FirstModes(x)[1])),
  Second.Mode.Count=sapply(DQA.Descriptors.Factor, function(x) sum(na.omit(x) == SecondModes(x)[1])),
  Unique.Count.Ratio=sapply(DQA.Descriptors.Factor, function(x) format(round((length(unique(x))/nrow(DQA.Descriptors.Factor)),3), nsmall=3)),
  First.Second.Mode.Ratio=sapply(DQA.Descriptors.Factor, function(x) format(round((sum(na.omit(x) == FirstModes(x)[1])/sum(na.omit(x) == SecondModes(x)[1])),3), nsmall=3)),
  row.names=NULL)
  )

}

##                Column.Name Column.Type Unique.Count First.Mode.Value
## 1                 betablok      factor            2                1
## 2                   dilver      factor            2                0
## 3                    nifed      factor            2                0
## 4                     acei      factor            2                1
## 5              angioten.II      factor            2                0
## 6               anti.arrhy      factor            2                0
## 7                anti.coag      factor            2                0
## 8                  aspirin      factor            2                0
## 9                  digoxin      factor            2                1
## 10                nitrates      factor            2                0
## 11             vasodilator      factor            2                0
## 12           diuretic.loop      factor            2                1
## 13       diuretic.thiazide      factor            2                0
## 14 diuretic.potassium.spar      factor            2                0
## 15          lipidrx.statin      factor            2                0
## 16                 insulin      factor            2                0
## 17       surgery.pacemaker      factor            2                0
## 18            surgery.cabg      factor            2                0
## 19             surgery.pci      factor            2                0
## 20    surgery.aicd.implant      factor            2                0
## 21                  smknow      factor            2                0
## 22               q.wave.mi      factor            2                0
## 23                   niddm      factor            2                0
## 24                     cad      factor            2                0
## 25                    died      factor            2                0
## 26                    male      factor            2                1
## 27                   black      factor            2                0
##    Second.Mode.Value First.Mode.Count Second.Mode.Count Unique.Count.Ratio
## 1                  0             1429               802              0.001
## 2                  1             2215                16              0.001
## 3                  1             2132                99              0.001
## 4                  0             1711               520              0.001
## 5                  1             1941               290              0.001
## 6                  1             1722               509              0.001
## 7                  1             1332               899              0.001
## 8                  1             1193              1038              0.001
## 9                  0             1570               661              0.001
## 10                 1             1492               739              0.001
## 11                 1             2095               136              0.001
## 12                 0             1880               351              0.001
## 13                 1             1952               279              0.001
## 14                 1             1582               649              0.001
## 15                 1             1381               850              0.001
## 16                 1             2016               215              0.001
## 17                 1             1729               502              0.001
## 18                 1             1637               594              0.001
## 19                 1             1755               476              0.001
## 20                 1             1584               647              0.001
## 21                 1             1772               459              0.001
## 22                 1             1952               279              0.001
## 23                 1             1881               350              0.001
## 24                 1             1325               906              0.001
## 25                 1             1505               726              0.001
## 26                 0             1629               602              0.001
## 27                 1             1965               266              0.001
##    First.Second.Mode.Ratio
## 1                    1.782
## 2                  138.438
## 3                   21.535
## 4                    3.290
## 5                    6.693
## 6                    3.383
## 7                    1.482
## 8                    1.149
## 9                    2.375
## 10                   2.019
## 11                  15.404
## 12                   5.356
## 13                   6.996
## 14                   2.438
## 15                   1.625
## 16                   9.377
## 17                   3.444
## 18                   2.756
## 19                   3.687
## 20                   2.448
## 21                   3.861
## 22                   6.996
## 23                   5.374
## 24                   1.462
## 25                   2.073
## 26                   2.706
## 27                   7.387

##################################
# Formulating a data quality assessment summary for numeric Descriptors
##################################
if (length(names(DQA.Descriptors.Numeric))>0) {

  ##################################
  # Formulating a function to determine the first mode
  ##################################
  FirstModes <- function(x) {
    ux <- unique(na.omit(x))
    tab <- tabulate(match(x, ux))
    ux[tab == max(tab)]
  }

  ##################################
  # Formulating a function to determine the second mode
  ##################################
  SecondModes <- function(x) {
    ux <- unique(na.omit(x))
    tab <- tabulate(match(x, ux))
    fm = ux[tab == max(tab)]
    sm = na.omit(x)[!(na.omit(x) %in% fm)]
    usm <- unique(sm)
    tabsm <- tabulate(match(sm, usm))
    ifelse(is.na(usm[tabsm == max(tabsm)])==TRUE,
           return(0.00001),
           return(usm[tabsm == max(tabsm)]))
  }

  (DQA.Descriptors.Numeric.Summary <- data.frame(
  Column.Name= names(DQA.Descriptors.Numeric),
  Column.Type=sapply(DQA.Descriptors.Numeric, function(x) class(x)),
  Unique.Count=sapply(DQA.Descriptors.Numeric, function(x) length(unique(x))),
  Unique.Count.Ratio=sapply(DQA.Descriptors.Numeric, function(x) format(round((length(unique(x))/nrow(DQA.Descriptors.Numeric)),3), nsmall=3)),
  First.Mode.Value=sapply(DQA.Descriptors.Numeric, function(x) format(round((FirstModes(x)[1]),3),nsmall=3)),
  Second.Mode.Value=sapply(DQA.Descriptors.Numeric, function(x) format(round((SecondModes(x)[1]),3),nsmall=3)),
  First.Mode.Count=sapply(DQA.Descriptors.Numeric, function(x) sum(na.omit(x) == FirstModes(x)[1])),
  Second.Mode.Count=sapply(DQA.Descriptors.Numeric, function(x) sum(na.omit(x) == SecondModes(x)[1])),
  First.Second.Mode.Ratio=sapply(DQA.Descriptors.Numeric, function(x) format(round((sum(na.omit(x) == FirstModes(x)[1])/sum(na.omit(x) == SecondModes(x)[1])),3), nsmall=3)),
  Minimum=sapply(DQA.Descriptors.Numeric, function(x) format(round(min(x,na.rm = TRUE),3), nsmall=3)),
  Mean=sapply(DQA.Descriptors.Numeric, function(x) format(round(mean(x,na.rm = TRUE),3), nsmall=3)),
  Median=sapply(DQA.Descriptors.Numeric, function(x) format(round(median(x,na.rm = TRUE),3), nsmall=3)),
  Maximum=sapply(DQA.Descriptors.Numeric, function(x) format(round(max(x,na.rm = TRUE),3), nsmall=3)),
  Skewness=sapply(DQA.Descriptors.Numeric, function(x) format(round(skewness(x,na.rm = TRUE),3), nsmall=3)),
  Kurtosis=sapply(DQA.Descriptors.Numeric, function(x) format(round(kurtosis(x,na.rm = TRUE),3), nsmall=3)),
  Percentile25th=sapply(DQA.Descriptors.Numeric, function(x) format(round(quantile(x,probs=0.25,na.rm = TRUE),3), nsmall=3)),
  Percentile75th=sapply(DQA.Descriptors.Numeric, function(x) format(round(quantile(x,probs=0.75,na.rm = TRUE),3), nsmall=3)),
  row.names=NULL)
  )

}

##            Column.Name Column.Type Unique.Count Unique.Count.Ratio
## 1                  age     integer           65              0.029
## 2  resting.systolic.bp     integer           55              0.025
## 3           resting.hr     integer           82              0.037
## 4                  bmi     numeric         2144              0.961
## 5          lvef.metabl     numeric           38              0.017
## 6             peak.rer     numeric           75              0.034
## 7             peak.vo2     numeric          252              0.113
## 8             interval     integer          594              0.266
## 9              ttodead     numeric          127              0.057
## 10                 bun     numeric          936              0.420
## 11              sodium     numeric          881              0.395
## 12                 hgb     numeric         1004              0.450
## 13             glucose     numeric         1060              0.475
## 14                crcl     numeric         2213              0.992
##    First.Mode.Value Second.Mode.Value First.Mode.Count Second.Mode.Count
## 1            55.000            60.000               94                90
## 2           110.000           100.000              174               173
## 3            70.000            75.000               87                81
## 4            27.070            25.417                3                 2
## 5            20.000            15.000              565               529
## 6             1.100             1.120              173               157
## 7            14.800            16.000               28                25
## 8           480.000           600.000              146               129
## 9            14.000            96.000               30                29
## 10           18.000            14.000               78                76
## 11          139.000           140.000              173               163
## 12           13.500            14.300               45                34
## 13           94.000            87.000               40                34
## 14           94.444           104.833                3                 2
##    First.Second.Mode.Ratio Minimum    Mean  Median  Maximum Skewness Kurtosis
## 1                    1.044  18.000  54.223  55.000   82.000   -0.428    3.045
## 2                    1.006  80.000 110.509 110.000  184.000    0.650    3.336
## 3                    1.074  39.000  76.182  75.000  124.000    0.352    2.887
## 4                    1.500  14.183  28.499  27.777   59.616    0.782    4.113
## 5                    1.068   5.000  20.261  20.000   40.000    0.655    3.067
## 6                    1.102   0.079   1.079   1.100    1.800   -1.490   12.741
## 7                    1.120   4.200  16.275  15.700   43.800    0.732    4.113
## 8                    1.132  21.000 503.255 480.000 1415.000    0.514    3.219
## 9                    1.034   1.000  62.750  62.000  127.000    0.049    1.771
## 10                   1.026   4.322  25.278  23.000  129.000    2.250   11.535
## 11                   1.061 120.000 139.451 139.804  152.000   -0.834    7.204
## 12                   1.324   7.200  13.542  13.585   19.200   -0.395    4.308
## 13                   1.176  37.000 109.352  96.929  424.000    2.814   13.907
## 14                   1.500   5.760  90.709  83.131  385.841    1.453    6.734
##    Percentile25th Percentile75th
## 1          48.000         62.000
## 2          98.000        122.000
## 3          66.000         86.000
## 4          24.523         31.888
## 5          15.000         25.000
## 6           1.020          1.140
## 7          12.800         19.300
## 8         345.000        641.000
## 9          30.000         95.000
## 10         17.000         29.334
## 11        138.000        141.000
## 12         12.803         14.400
## 13         87.000        114.880
## 14         60.935        110.333

##################################
# Identifying potential data quality issues
##################################

##################################
# Checking for missing observations
##################################
if ((nrow(DQA.Summary[DQA.Summary$NA.Count>0,]))>0){
  print(paste0("Missing observations noted for ",
               (nrow(DQA.Summary[DQA.Summary$NA.Count>0,])),
               " variable(s) with NA.Count>0 and Fill.Rate<1.0."))
  DQA.Summary[DQA.Summary$NA.Count>0,]
} else {
  print("No missing observations noted.")
}

## [1] "No missing observations noted."

##################################
# Checking for zero or near-zero variance Descriptors
##################################
if (length(names(DQA.Descriptors.Factor))==0) {
  print("No factor descriptors noted.")
} else if (nrow(DQA.Descriptors.Factor.Summary[as.numeric(as.character(DQA.Descriptors.Factor.Summary$First.Second.Mode.Ratio))>5,])>0){
  print(paste0("Low variance observed for ",
               (nrow(DQA.Descriptors.Factor.Summary[as.numeric(as.character(DQA.Descriptors.Factor.Summary$First.Second.Mode.Ratio))>5,])),
               " factor variable(s) with First.Second.Mode.Ratio>5."))
  DQA.Descriptors.Factor.Summary[as.numeric(as.character(DQA.Descriptors.Factor.Summary$First.Second.Mode.Ratio))>5,]
} else {
  print("No low variance factor descriptors due to high first-second mode ratio noted.")
}

## [1] "Low variance observed for 10 factor variable(s) with First.Second.Mode.Ratio>5."

##          Column.Name Column.Type Unique.Count First.Mode.Value
## 2             dilver      factor            2                0
## 3              nifed      factor            2                0
## 5        angioten.II      factor            2                0
## 11       vasodilator      factor            2                0
## 12     diuretic.loop      factor            2                1
## 13 diuretic.thiazide      factor            2                0
## 16           insulin      factor            2                0
## 22         q.wave.mi      factor            2                0
## 23             niddm      factor            2                0
## 27             black      factor            2                0
##    Second.Mode.Value First.Mode.Count Second.Mode.Count Unique.Count.Ratio
## 2                  1             2215                16              0.001
## 3                  1             2132                99              0.001
## 5                  1             1941               290              0.001
## 11                 1             2095               136              0.001
## 12                 0             1880               351              0.001
## 13                 1             1952               279              0.001
## 16                 1             2016               215              0.001
## 22                 1             1952               279              0.001
## 23                 1             1881               350              0.001
## 27                 1             1965               266              0.001
##    First.Second.Mode.Ratio
## 2                  138.438
## 3                   21.535
## 5                    6.693
## 11                  15.404
## 12                   5.356
## 13                   6.996
## 16                   9.377
## 22                   6.996
## 23                   5.374
## 27                   7.387

if (length(names(DQA.Descriptors.Numeric))==0) {
  print("No numeric descriptors noted.")
} else if (nrow(DQA.Descriptors.Numeric.Summary[as.numeric(as.character(DQA.Descriptors.Numeric.Summary$First.Second.Mode.Ratio))>5,])>0){
  print(paste0("Low variance observed for ",
               (nrow(DQA.Descriptors.Numeric.Summary[as.numeric(as.character(DQA.Descriptors.Numeric.Summary$First.Second.Mode.Ratio))>5,])),
               " numeric variable(s) with First.Second.Mode.Ratio>5."))
  DQA.Descriptors.Numeric.Summary[as.numeric(as.character(DQA.Descriptors.Numeric.Summary$First.Second.Mode.Ratio))>5,]
} else {
  print("No low variance numeric descriptors due to high first-second mode ratio noted.")
}

## [1] "No low variance numeric descriptors due to high first-second mode ratio noted."

if (length(names(DQA.Descriptors.Numeric))==0) {
  print("No numeric descriptors noted.")
} else if (nrow(DQA.Descriptors.Numeric.Summary[as.numeric(as.character(DQA.Descriptors.Numeric.Summary$Unique.Count.Ratio))<0.01,])>0){
  print(paste0("Low variance observed for ",
               (nrow(DQA.Descriptors.Numeric.Summary[as.numeric(as.character(DQA.Descriptors.Numeric.Summary$Unique.Count.Ratio))<0.01,])),
               " numeric variable(s) with Unique.Count.Ratio<0.01."))
  DQA.Descriptors.Numeric.Summary[as.numeric(as.character(DQA.Descriptors.Numeric.Summary$Unique.Count.Ratio))<0.01,]
} else {
  print("No low variance numeric descriptors due to low unique count ratio noted.")
}

## [1] "No low variance numeric descriptors due to low unique count ratio noted."

##################################
# Checking for skewed Descriptors
##################################
if (length(names(DQA.Descriptors.Numeric))==0) {
  print("No numeric descriptors noted.")
} else if (nrow(DQA.Descriptors.Numeric.Summary[as.numeric(as.character(DQA.Descriptors.Numeric.Summary$Skewness))>3 |
                                               as.numeric(as.character(DQA.Descriptors.Numeric.Summary$Skewness))<(-3),])>0){
  print(paste0("High skewness observed for ",
  (nrow(DQA.Descriptors.Numeric.Summary[as.numeric(as.character(DQA.Descriptors.Numeric.Summary$Skewness))>3 |
                                               as.numeric(as.character(DQA.Descriptors.Numeric.Summary$Skewness))<(-3),])),
  " numeric variable(s) with Skewness>3 or Skewness<(-3)."))
  DQA.Descriptors.Numeric.Summary[as.numeric(as.character(DQA.Descriptors.Numeric.Summary$Skewness))>3 |
                                 as.numeric(as.character(DQA.Descriptors.Numeric.Summary$Skewness))<(-3),]
} else {
  print("No skewed numeric descriptors noted.")
}

## [1] "No skewed numeric descriptors noted."

1.3 Data Preprocessing

1.3.1 Outlier

[A] Outliers noted for 12 variables with the numeric data visualized through a boxplot including observations classified as suspected outliers using the IQR criterion. The IQR criterion means that all observations above the (75th percentile + 1.5 x IQR) or below the (25th percentile - 1.5 x IQR) are suspected outliers, where IQR is the difference between the third quartile (75th percentile) and first quartile (25th percentile). Outlier treatment for numerical stability remains optional depending on potential model requirements for the subsequent steps.
     [A.1] age variable (33 outliers detected)
     [A.2] resting.systolic.bp variable (26 outliers detected)
     [A.3] resting.hr variable (9 outliers detected)
     [A.4] bmi variable (34 outliers detected)
     [A.5] peak.rer variable (102 outliers detected)
     [A.6] peak.vo2 variable (40 outliers detected)
     [A.7] interval variable (25 outliers detected)
     [A.8] bun variable (116 outliers detected)
     [A.9] sodium variable (140 outliers detected)
     [A.10] hgb variable (83 outliers detected)
     [A.11] glucose variable (222 outliers detected)
     [A.12] crcl variable (82 outliers detected)

Code Chunk | Output

##################################
# Loading dataset
##################################
DPA <- peakVO2

##################################
# Listing all predictors
##################################
DPA.Predictors <- DPA[,!names(DPA) %in% c("died","ttodead")]

##################################
# Listing all numeric predictors
##################################
DPA.Predictors.Numeric <- DPA.Predictors[,sapply(DPA.Predictors, is.numeric)]

##################################
# Identifying outliers for the numeric predictors
##################################
OutlierCountList <- c()

for (i in 1:ncol(DPA.Predictors.Numeric)) {
  Outliers <- boxplot.stats(DPA.Predictors.Numeric[,i])$out
  OutlierCount <- length(Outliers)
  OutlierCountList <- append(OutlierCountList,OutlierCount)
  OutlierIndices <- which(DPA.Predictors.Numeric[,i] %in% c(Outliers))
  boxplot(DPA.Predictors.Numeric[,i], 
          ylab = names(DPA.Predictors.Numeric)[i], 
          main = names(DPA.Predictors.Numeric)[i],
          horizontal=TRUE)
  mtext(paste0(OutlierCount, " Outlier(s) Detected"))
}

OutlierCountSummary <- as.data.frame(cbind(names(DPA.Predictors.Numeric),(OutlierCountList)))
names(OutlierCountSummary) <- c("NumericPredictors","OutlierCount")
OutlierCountSummary$OutlierCount <- as.numeric(as.character(OutlierCountSummary$OutlierCount))
NumericPredictorWithOutlierCount <- nrow(OutlierCountSummary[OutlierCountSummary$OutlierCount>0,])
print(paste0(NumericPredictorWithOutlierCount, " numeric variable(s) were noted with outlier(s)." ))

## [1] "12 numeric variable(s) were noted with outlier(s)."

##################################
# Gathering descriptive statistics
##################################
(DPA_Skimmed <- skim(DPA.Predictors.Numeric))

Data summary
Name	DPA.Predictors.Numeric
Number of rows	2231
Number of columns	13
_______________________
Column type frequency:
numeric	13
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	1	54.22	11.03	18.00	48.00	55.00	62.00	82.00	▁▃▇▇▂
resting.systolic.bp	1	110.51	17.93	80.00	98.00	110.00	122.00	184.00	▇▇▅▁▁
resting.hr	1	76.18	14.39	39.00	66.00	75.00	86.00	124.00	▂▇▇▃▁
bmi	1	28.50	5.69	14.18	24.52	27.78	31.89	59.62	▂▇▃▁▁
lvef.metabl	1	20.26	7.34	5.00	15.00	20.00	25.00	40.00	▂▅▇▂▁
peak.rer	1	1.08	0.12	0.08	1.02	1.10	1.14	1.80	▁▁▇▆▁
peak.vo2	1	16.27	5.03	4.20	12.80	15.70	19.30	43.80	▃▇▂▁▁
interval	1	503.26	220.86	21.00	345.00	480.00	641.00	1415.00	▃▇▅▁▁
bun	1	25.28	12.57	4.32	17.00	23.00	29.33	129.00	▇▂▁▁▁
sodium	1	139.45	3.09	120.00	138.00	139.80	141.00	152.00	▁▁▆▇▁
hgb	1	13.54	1.41	7.20	12.80	13.58	14.40	19.20	▁▂▇▃▁
glucose	1	109.35	42.56	37.00	87.00	96.93	114.88	424.00	▇▂▁▁▁
crcl	1	90.71	43.05	5.76	60.93	83.13	110.33	385.84	▇▇▁▁▁

###################################
# Verifying the data dimensions
###################################
dim(DPA.Predictors.Numeric)

## [1] 2231   13

1.3.2 Zero and Near-Zero Variance

[A] Low variance noted for 10 variables from the previous data quality assessment using a lower threshold.

[B] Low variance noted for 2 variables using a preprocessing summary from the caret package. The nearZeroVar method using both the freqCut and uniqueCut criteria set at 95/5 and 10, respectively, were applied on the dataset.
[B.1] dilver variable (factor)
[B.2] nifed variable (factor)

Code Chunk | Output

##################################
# Loading dataset
##################################
DPA <- peakVO2

##################################
# Gathering descriptive statistics
##################################
(DPA_Skimmed <- skim(DPA))

Data summary
Name	DPA
Number of rows	2231
Number of columns	41
_______________________
Column type frequency:
factor	27
numeric	14
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
betablok	1	FALSE	2	1: 1429, 0: 802
dilver	1	FALSE	2	0: 2215, 1: 16
nifed	1	FALSE	2	0: 2132, 1: 99
acei	1	FALSE	2	1: 1711, 0: 520
angioten.II	1	FALSE	2	0: 1941, 1: 290
anti.arrhy	1	FALSE	2	0: 1722, 1: 509
anti.coag	1	FALSE	2	0: 1332, 1: 899
aspirin	1	FALSE	2	0: 1193, 1: 1038
digoxin	1	FALSE	2	1: 1570, 0: 661
nitrates	1	FALSE	2	0: 1492, 1: 739
vasodilator	1	FALSE	2	0: 2095, 1: 136
diuretic.loop	1	FALSE	2	1: 1880, 0: 351
diuretic.thiazide	1	FALSE	2	0: 1952, 1: 279
diuretic.potassium.spar	1	FALSE	2	0: 1582, 1: 649
lipidrx.statin	1	FALSE	2	0: 1381, 1: 850
insulin	1	FALSE	2	0: 2016, 1: 215
surgery.pacemaker	1	FALSE	2	0: 1729, 1: 502
surgery.cabg	1	FALSE	2	0: 1637, 1: 594
surgery.pci	1	FALSE	2	0: 1755, 1: 476
surgery.aicd.implant	1	FALSE	2	0: 1584, 1: 647
smknow	1	FALSE	2	0: 1772, 1: 459
q.wave.mi	1	FALSE	2	0: 1952, 1: 279
niddm	1	FALSE	2	0: 1881, 1: 350
cad	1	FALSE	2	0: 1325, 1: 906
died	1	FALSE	2	0: 1505, 1: 726
male	1	FALSE	2	1: 1629, 0: 602
black	1	FALSE	2	0: 1965, 1: 266

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	1	54.22	11.03	18.00	48.00	55.00	62.00	82.00	▁▃▇▇▂
resting.systolic.bp	1	110.51	17.93	80.00	98.00	110.00	122.00	184.00	▇▇▅▁▁
resting.hr	1	76.18	14.39	39.00	66.00	75.00	86.00	124.00	▂▇▇▃▁
bmi	1	28.50	5.69	14.18	24.52	27.78	31.89	59.62	▂▇▃▁▁
lvef.metabl	1	20.26	7.34	5.00	15.00	20.00	25.00	40.00	▂▅▇▂▁
peak.rer	1	1.08	0.12	0.08	1.02	1.10	1.14	1.80	▁▁▇▆▁
peak.vo2	1	16.27	5.03	4.20	12.80	15.70	19.30	43.80	▃▇▂▁▁
interval	1	503.26	220.86	21.00	345.00	480.00	641.00	1415.00	▃▇▅▁▁
ttodead	1	62.75	36.77	1.00	30.00	62.00	95.00	127.00	▇▇▇▇▇
bun	1	25.28	12.57	4.32	17.00	23.00	29.33	129.00	▇▂▁▁▁
sodium	1	139.45	3.09	120.00	138.00	139.80	141.00	152.00	▁▁▆▇▁
hgb	1	13.54	1.41	7.20	12.80	13.58	14.40	19.20	▁▂▇▃▁
glucose	1	109.35	42.56	37.00	87.00	96.93	114.88	424.00	▇▂▁▁▁
crcl	1	90.71	43.05	5.76	60.93	83.13	110.33	385.84	▇▇▁▁▁

##################################
# Identifying columns with low variance
###################################
DPA_LowVariance <- nearZeroVar(DPA,
                               freqCut = 95/5,
                               uniqueCut = 10,
                               saveMetrics= TRUE)
(DPA_LowVariance[DPA_LowVariance$nzv,])

##        freqRatio percentUnique zeroVar  nzv
## dilver 138.43750     0.0896459   FALSE TRUE
## nifed   21.53535     0.0896459   FALSE TRUE

if ((nrow(DPA_LowVariance[DPA_LowVariance$nzv,]))==0){
  
  print("No low variance predictors noted.")
  
} else {

  print(paste0("Low variance observed for ",
               (nrow(DPA_LowVariance[DPA_LowVariance$nzv,])),
               " numeric variable(s) with First.Second.Mode.Ratio>4 and Unique.Count.Ratio<0.10."))
  
  DPA_LowVarianceForRemoval <- (nrow(DPA_LowVariance[DPA_LowVariance$nzv,]))
  
  print(paste0("Low variance can be resolved by removing ",
               (nrow(DPA_LowVariance[DPA_LowVariance$nzv,])),
               " numeric variable(s)."))
  
  for (j in 1:DPA_LowVarianceForRemoval) {
  DPA_LowVarianceRemovedVariable <- rownames(DPA_LowVariance[DPA_LowVariance$nzv,])[j]
  print(paste0("Variable ",
               j,
               " for removal: ",
               DPA_LowVarianceRemovedVariable))
  }
  
  DPA %>%
  skim() %>%
  dplyr::filter(skim_variable %in% rownames(DPA_LowVariance[DPA_LowVariance$nzv,]))

  ##################################
  # Filtering out columns with low variance
  #################################
  DPA_ExcludedLowVariance <- DPA[,!names(DPA) %in% rownames(DPA_LowVariance[DPA_LowVariance$nzv,])]
  
  ##################################
  # Gathering descriptive statistics
  ##################################
  (DPA_ExcludedLowVariance_Skimmed <- skim(DPA_ExcludedLowVariance))
}

## [1] "Low variance observed for 2 numeric variable(s) with First.Second.Mode.Ratio>4 and Unique.Count.Ratio<0.10."
## [1] "Low variance can be resolved by removing 2 numeric variable(s)."
## [1] "Variable 1 for removal: dilver"
## [1] "Variable 2 for removal: nifed"

Data summary
Name	DPA_ExcludedLowVariance
Number of rows	2231
Number of columns	39
_______________________
Column type frequency:
factor	25
numeric	14
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
betablok	1	FALSE	2	1: 1429, 0: 802
acei	1	FALSE	2	1: 1711, 0: 520
angioten.II	1	FALSE	2	0: 1941, 1: 290
anti.arrhy	1	FALSE	2	0: 1722, 1: 509
anti.coag	1	FALSE	2	0: 1332, 1: 899
aspirin	1	FALSE	2	0: 1193, 1: 1038
digoxin	1	FALSE	2	1: 1570, 0: 661
nitrates	1	FALSE	2	0: 1492, 1: 739
vasodilator	1	FALSE	2	0: 2095, 1: 136
diuretic.loop	1	FALSE	2	1: 1880, 0: 351
diuretic.thiazide	1	FALSE	2	0: 1952, 1: 279
diuretic.potassium.spar	1	FALSE	2	0: 1582, 1: 649
lipidrx.statin	1	FALSE	2	0: 1381, 1: 850
insulin	1	FALSE	2	0: 2016, 1: 215
surgery.pacemaker	1	FALSE	2	0: 1729, 1: 502
surgery.cabg	1	FALSE	2	0: 1637, 1: 594
surgery.pci	1	FALSE	2	0: 1755, 1: 476
surgery.aicd.implant	1	FALSE	2	0: 1584, 1: 647
smknow	1	FALSE	2	0: 1772, 1: 459
q.wave.mi	1	FALSE	2	0: 1952, 1: 279
niddm	1	FALSE	2	0: 1881, 1: 350
cad	1	FALSE	2	0: 1325, 1: 906
died	1	FALSE	2	0: 1505, 1: 726
male	1	FALSE	2	1: 1629, 0: 602
black	1	FALSE	2	0: 1965, 1: 266

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	1	54.22	11.03	18.00	48.00	55.00	62.00	82.00	▁▃▇▇▂
resting.systolic.bp	1	110.51	17.93	80.00	98.00	110.00	122.00	184.00	▇▇▅▁▁
resting.hr	1	76.18	14.39	39.00	66.00	75.00	86.00	124.00	▂▇▇▃▁
bmi	1	28.50	5.69	14.18	24.52	27.78	31.89	59.62	▂▇▃▁▁
lvef.metabl	1	20.26	7.34	5.00	15.00	20.00	25.00	40.00	▂▅▇▂▁
peak.rer	1	1.08	0.12	0.08	1.02	1.10	1.14	1.80	▁▁▇▆▁
peak.vo2	1	16.27	5.03	4.20	12.80	15.70	19.30	43.80	▃▇▂▁▁
interval	1	503.26	220.86	21.00	345.00	480.00	641.00	1415.00	▃▇▅▁▁
ttodead	1	62.75	36.77	1.00	30.00	62.00	95.00	127.00	▇▇▇▇▇
bun	1	25.28	12.57	4.32	17.00	23.00	29.33	129.00	▇▂▁▁▁
sodium	1	139.45	3.09	120.00	138.00	139.80	141.00	152.00	▁▁▆▇▁
hgb	1	13.54	1.41	7.20	12.80	13.58	14.40	19.20	▁▂▇▃▁
glucose	1	109.35	42.56	37.00	87.00	96.93	114.88	424.00	▇▂▁▁▁
crcl	1	90.71	43.05	5.76	60.93	83.13	110.33	385.84	▇▇▁▁▁

1.3.3 Collinearity

[A] No high correlation > 95% noted for any variable pairs as confirmed using the preprocessing summaries from the caret and lares packages.

Code Chunk | Output

##################################
# Loading dataset
##################################
DPA <- peakVO2

##################################
# Listing all predictors
##################################
DPA.Predictors <- DPA[,!names(DPA) %in% c("died","ttodead")]

##################################
# Listing all numeric predictors
##################################
DPA.Predictors.Numeric <- DPA.Predictors[,sapply(DPA.Predictors, is.numeric)]

##################################
# Visualizing pairwise correlation between predictors
##################################
DPA_CorrelationTest <- cor.mtest(DPA.Predictors.Numeric,
                       method = "pearson",
                       conf.level = .95)

corrplot(cor(DPA.Predictors.Numeric,
             method = "pearson",
             use="pairwise.complete.obs"), 
         method = "circle",
         type = "upper", 
         order = "original", 
         tl.col = "black", 
         tl.cex = 0.75,
         tl.srt = 90, 
         sig.level = 0.05, 
         p.mat = DPA_CorrelationTest$p,
         insig = "blank")

##################################
# Identifying the highly correlated variables
##################################
DPA_Correlation <-  cor(DPA.Predictors.Numeric, 
                        method = "pearson",
                        use="pairwise.complete.obs")
(DPA_HighlyCorrelatedCount <- sum(abs(DPA_Correlation[upper.tri(DPA_Correlation)]) > 0.95))

## [1] 0

if (DPA_HighlyCorrelatedCount == 0) {
  print("No highly correlated predictors noted.")
} else {
  print(paste0("High correlation observed for ",
               (DPA_HighlyCorrelatedCount),
               " pairs of numeric variable(s) with Correlation.Coefficient>0.95."))
  
  (DPA_HighlyCorrelatedPairs <- corr_cross(DPA.Predictors.Numeric,
  max_pvalue = 0.05, 
  top = DPA_HighlyCorrelatedCount,
  rm.na = TRUE,
  grid = FALSE
))
  
}

## [1] "No highly correlated predictors noted."

if (DPA_HighlyCorrelatedCount > 0) {
  DPA_HighlyCorrelated <- findCorrelation(DPA_Correlation, cutoff = 0.95)
  
  (DPA_HighlyCorrelatedForRemoval <- length(DPA_HighlyCorrelated))
  
  print(paste0("High correlation can be resolved by removing ",
               (DPA_HighlyCorrelatedForRemoval),
               " numeric variable(s)."))
  
  for (j in 1:DPA_HighlyCorrelatedForRemoval) {
  DPA_HighlyCorrelatedRemovedVariable <- colnames(DPA.Predictors.Numeric)[DPA_HighlyCorrelated[j]]
  print(paste0("Variable ",
               j,
               " for removal: ",
               DPA_HighlyCorrelatedRemovedVariable))
  }
  
  ##################################
  # Filtering out columns with high correlation
  #################################
  DPA_ExcludedHighCorrelation <- DPA[,-DPA_HighlyCorrelated]
  
  ##################################
  # Gathering descriptive statistics
  ##################################
  (DPA_ExcludedHighCorrelation_Skimmed <- skim(DPA_ExcludedHighCorrelation))

}

1.3.4 Linear Dependencies

[A] No linear dependencies noted for any subset of variables using the preprocessing summary from the caret package applying the findLinearCombos method which utilizes the QR decomposition of a matrix to enumerate sets of linear combinations (if they exist).

Code Chunk | Output

##################################
# Loading dataset
##################################
DPA <- peakVO2

##################################
# Listing all predictors
##################################
DPA.Predictors <- DPA[,!names(DPA) %in% c("died","ttodead")]

##################################
# Listing all numeric predictors
##################################
DPA.Predictors.Numeric <- DPA.Predictors[,sapply(DPA.Predictors, is.numeric)]

##################################
# Identifying the linearly dependent variables
##################################
DPA_LinearlyDependent <- findLinearCombos(DPA.Predictors.Numeric)

(DPA_LinearlyDependentCount <- length(DPA_LinearlyDependent$linearCombos))

## [1] 0

if (DPA_LinearlyDependentCount == 0) {
  print("No linearly dependent predictors noted.")
} else {
  print(paste0("Linear dependency observed for ",
               (DPA_LinearlyDependentCount),
               " subset(s) of numeric variable(s)."))
  
  for (i in 1:DPA_LinearlyDependentCount) {
    DPA_LinearlyDependentSubset <- colnames(DPA.Predictors.Numeric)[DPA_LinearlyDependent$linearCombos[[i]]]
    print(paste0("Linear dependent variable(s) for subset ",
                 i,
                 " include: ",
                 DPA_LinearlyDependentSubset))
  }
  
}

## [1] "No linearly dependent predictors noted."

##################################
# Identifying the linearly dependent variables for removal
##################################

if (DPA_LinearlyDependentCount > 0) {
  DPA_LinearlyDependent <- findLinearCombos(DPA.Predictors.Numeric)
  
  DPA_LinearlyDependentForRemoval <- length(DPA_LinearlyDependent$remove)
  
  print(paste0("Linear dependency can be resolved by removing ",
               (DPA_LinearlyDependentForRemoval),
               " numeric variable(s)."))
  
  for (j in 1:DPA_LinearlyDependentForRemoval) {
  DPA_LinearlyDependentRemovedVariable <- colnames(DPA.Predictors.Numeric)[DPA_LinearlyDependent$remove[j]]
  print(paste0("Variable ",
               j,
               " for removal: ",
               DPA_LinearlyDependentRemovedVariable))
  }
  
  ##################################
  # Filtering out columns with linear dependency
  #################################
  DPA_ExcludedLinearlyDependent <- DPA[,-DPA_LinearlyDependent$remove]
  
  ##################################
  # Gathering descriptive statistics
  ##################################
  (DPA_ExcludedLinearlyDependent_Skimmed <- skim(DPA_ExcludedLinearlyDependent))

}

1.3.5 Pre-Processed Dataset

[A] 2231 rows (observations)

[B] 39 columns (variables)
     [B.1] 1/39 survival event status = died variable (factor)
            [B.1.1] Event = died=1
            [B.1.2] Censored or Non-Event = died=0
     [B.2] 1/39 survival time = ttodead variable (numeric)
     [B.3] 37/39 predictors = All remaining variables (24/39 factor + 13/39 numeric)

[C] Pre-processing actions applied:
     [C.1] No centering, scaling and shape transformation applied to maintain interpretability
     [C.2] No outlier treatment applied since the high values noted were contextually valid and sensible
     [C.3] 2 predictors removed due to zero or near-zero variance
     [C.4] No predictors removed due to high correlation
     [C.5] No predictors removed due to linear dependencies

Code Chunk | Output

##################################
# Filtering out columns noted with data quality issues including
# zero and near-zero variance,
# to create the pre-modelling dataset
##################################
PMA_ExcludedLowVariance <- peakVO2[,!names(peakVO2) %in% c("dilver","nifed")]

PMA_PreModelling_Train <- PMA_ExcludedLowVariance

##################################
# Gathering descriptive statistics
##################################
(PMA_PreModelling_Train_Skimmed <- skim(PMA_PreModelling_Train))

Data summary
Name	PMA_PreModelling_Train
Number of rows	2231
Number of columns	39
_______________________
Column type frequency:
factor	25
numeric	14
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
betablok	1	FALSE	2	1: 1429, 0: 802
acei	1	FALSE	2	1: 1711, 0: 520
angioten.II	1	FALSE	2	0: 1941, 1: 290
anti.arrhy	1	FALSE	2	0: 1722, 1: 509
anti.coag	1	FALSE	2	0: 1332, 1: 899
aspirin	1	FALSE	2	0: 1193, 1: 1038
digoxin	1	FALSE	2	1: 1570, 0: 661
nitrates	1	FALSE	2	0: 1492, 1: 739
vasodilator	1	FALSE	2	0: 2095, 1: 136
diuretic.loop	1	FALSE	2	1: 1880, 0: 351
diuretic.thiazide	1	FALSE	2	0: 1952, 1: 279
diuretic.potassium.spar	1	FALSE	2	0: 1582, 1: 649
lipidrx.statin	1	FALSE	2	0: 1381, 1: 850
insulin	1	FALSE	2	0: 2016, 1: 215
surgery.pacemaker	1	FALSE	2	0: 1729, 1: 502
surgery.cabg	1	FALSE	2	0: 1637, 1: 594
surgery.pci	1	FALSE	2	0: 1755, 1: 476
surgery.aicd.implant	1	FALSE	2	0: 1584, 1: 647
smknow	1	FALSE	2	0: 1772, 1: 459
q.wave.mi	1	FALSE	2	0: 1952, 1: 279
niddm	1	FALSE	2	0: 1881, 1: 350
cad	1	FALSE	2	0: 1325, 1: 906
died	1	FALSE	2	0: 1505, 1: 726
male	1	FALSE	2	1: 1629, 0: 602
black	1	FALSE	2	0: 1965, 1: 266

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	1	54.22	11.03	18.00	48.00	55.00	62.00	82.00	▁▃▇▇▂
resting.systolic.bp	1	110.51	17.93	80.00	98.00	110.00	122.00	184.00	▇▇▅▁▁
resting.hr	1	76.18	14.39	39.00	66.00	75.00	86.00	124.00	▂▇▇▃▁
bmi	1	28.50	5.69	14.18	24.52	27.78	31.89	59.62	▂▇▃▁▁
lvef.metabl	1	20.26	7.34	5.00	15.00	20.00	25.00	40.00	▂▅▇▂▁
peak.rer	1	1.08	0.12	0.08	1.02	1.10	1.14	1.80	▁▁▇▆▁
peak.vo2	1	16.27	5.03	4.20	12.80	15.70	19.30	43.80	▃▇▂▁▁
interval	1	503.26	220.86	21.00	345.00	480.00	641.00	1415.00	▃▇▅▁▁
ttodead	1	62.75	36.77	1.00	30.00	62.00	95.00	127.00	▇▇▇▇▇
bun	1	25.28	12.57	4.32	17.00	23.00	29.33	129.00	▇▂▁▁▁
sodium	1	139.45	3.09	120.00	138.00	139.80	141.00	152.00	▁▁▆▇▁
hgb	1	13.54	1.41	7.20	12.80	13.58	14.40	19.20	▁▂▇▃▁
glucose	1	109.35	42.56	37.00	87.00	96.93	114.88	424.00	▇▂▁▁▁
crcl	1	90.71	43.05	5.76	60.93	83.13	110.33	385.84	▇▇▁▁▁

###################################
# Verifying the data dimensions
# for the train set
###################################
dim(PMA_PreModelling_Train)

## [1] 2231   39

1.4 Data Exploration

[A] Numeric variables which demonstrated differential relationships with the ttodead and died survival time and event status variables, based on trend line correlation with survival time, include:
     [A.1] peak.vo2 variable (numeric)
     [A.2] interval variable (numeric)
     [A.3] sodium variable (numeric)

[B] Factor variables which demonstrated differential relationships with the ttodead and died survival time and event status variables, based on the Log-Rank Test p-value and estimated Kaplan-Meier survival curve separations, include:
     [B.1] betablk variable (factor)
     [B.2] diuretic.thiazide variable (factor)
     [B.3] digoxin variable (factor)
     [B.4] cad variable (factor)
     [B.5] diuretic.loop variable (factor)
     [B.6] male variable (factor)
     [B.7] surgery.cabg variable (factor)
     [B.8] insulin variable (factor)

[C] To simplify the subsequent computations, only three numeric variables (which demonstrated the most intuitive trend line with survival time and consistency of distribution) and three factor variables (which demonstrated the lowest Log-Rank Test p-value and sufficient separation in the estimated Kaplan-Meier survival curves) were evaluated during survival modelling which include:
     [C.1] peak.vo2 variable (numeric)
     [C.2] bun variable (numeric)
     [C.3] sodium variable (numeric)
     [C.4] insulin variable (factor)
     [C.5] diuretic.thiazide variable (factor)
     [C.6] male variable (factor)

Code Chunk | Output

##################################
# Loading dataset
##################################
EDA <- PMA_PreModelling_Train

##################################
# Loading dataset
##################################
EDA.DescriptiveStatistics <- EDA

EDA.DescriptiveStatistics$died <- ifelse(EDA.DescriptiveStatistics$died=="0","Survived+Censored","Died")

##################################
# Formulating a preliminary summary
# of descriptive statistics 
##################################
EDA.DescriptiveStatistics %>%
  tbl_summary(
    by = died,
    type = all_continuous() ~ "continuous2",
    statistic = all_continuous() ~ c("{N_nonmiss}",
                                     "{mean} ({sd})", 
                                     "{median} ({p25}, {p75})", 
                                     "{min}, {max}"),
    missing = "no") %>% 
  add_overall() %>% 
  add_p()

Characteristic	Overall N = 2,231¹	Died N = 726¹	Survived+Censored N = 1,505¹	p-value²
age				<0.001
N Non-missing	2,231	726	1,505
Mean (SD)	54 (11)	56 (11)	53 (11)
Median (Q1, Q3)	55 (48, 62)	58 (50, 64)	54 (46, 61)
Min, Max	18, 82	18, 82	19, 82
resting.systolic.bp				<0.001
N Non-missing	2,231	726	1,505
Mean (SD)	111 (18)	108 (18)	112 (18)
Median (Q1, Q3)	110 (98, 122)	108 (94, 120)	110 (100, 122)
Min, Max	80, 184	80, 184	80, 184
resting.hr				<0.001
N Non-missing	2,231	726	1,505
Mean (SD)	76 (14)	79 (14)	75 (14)
Median (Q1, Q3)	75 (66, 86)	78 (68, 88)	74 (65, 84)
Min, Max	39, 124	41, 121	39, 124
bmi				0.031
N Non-missing	2,231	726	1,505
Mean (SD)	28.5 (5.7)	28.1 (5.5)	28.7 (5.8)
Median (Q1, Q3)	27.8 (24.5, 31.9)	27.5 (24.4, 31.1)	27.9 (24.6, 32.2)
Min, Max	14.2, 59.6	15.8, 51.9	14.2, 59.6
lvef.metabl				<0.001
N Non-missing	2,231	726	1,505
Mean (SD)	20 (7)	19 (7)	21 (7)
Median (Q1, Q3)	20 (15, 25)	20 (15, 25)	20 (15, 25)
Min, Max	5, 40	5, 40	5, 40
peak.rer				0.028
N Non-missing	2,231	726	1,505
Mean (SD)	1.08 (0.12)	1.08 (0.11)	1.08 (0.12)
Median (Q1, Q3)	1.10 (1.02, 1.14)	1.11 (1.03, 1.15)	1.10 (1.02, 1.14)
Min, Max	0.08, 1.80	0.08, 1.35	0.08, 1.80
peak.vo2				<0.001
N Non-missing	2,231	726	1,505
Mean (SD)	16.3 (5.0)	14.3 (4.5)	17.2 (5.0)
Median (Q1, Q3)	15.7 (12.8, 19.3)	13.9 (11.2, 16.6)	16.7 (13.6, 20.2)
Min, Max	4.2, 43.8	5.2, 43.4	4.2, 43.8
interval				<0.001
N Non-missing	2,231	726	1,505
Mean (SD)	503 (221)	424 (192)	542 (224)
Median (Q1, Q3)	480 (345, 642)	415 (282, 540)	532 (375, 700)
Min, Max	21, 1,415	35, 1,250	21, 1,415
ttodead				<0.001
N Non-missing	2,231	726	1,505
Mean (SD)	63 (37)	37 (29)	75 (34)
Median (Q1, Q3)	62 (30, 95)	33 (12, 57)	79 (48, 104)
Min, Max	1, 127	1, 119	11, 127
bun				<0.001
N Non-missing	2,231	726	1,505
Mean (SD)	25 (13)	30 (15)	23 (11)
Median (Q1, Q3)	23 (17, 29)	27 (20, 35)	21 (16, 27)
Min, Max	4, 129	6, 129	4, 104
sodium				<0.001
N Non-missing	2,231	726	1,505
Mean (SD)	139.5 (3.1)	139.1 (3.1)	139.6 (3.1)
Median (Q1, Q3)	139.8 (138.0, 141.0)	139.4 (138.0, 140.6)	140.0 (138.0, 141.0)
Min, Max	120.0, 152.0	122.0, 152.0	120.0, 150.0
hgb				<0.001
N Non-missing	2,231	726	1,505
Mean (SD)	13.54 (1.41)	13.28 (1.41)	13.67 (1.39)
Median (Q1, Q3)	13.58 (12.80, 14.40)	13.35 (12.60, 14.08)	13.70 (12.96, 14.59)
Min, Max	7.20, 19.20	7.90, 18.80	7.20, 19.20
glucose				0.001
N Non-missing	2,231	726	1,505
Mean (SD)	109 (43)	113 (47)	108 (40)
Median (Q1, Q3)	97 (87, 115)	98 (89, 122)	96 (87, 112)
Min, Max	37, 424	37, 424	42, 403
crcl				<0.001
N Non-missing	2,231	726	1,505
Mean (SD)	91 (43)	80 (40)	96 (44)
Median (Q1, Q3)	83 (61, 110)	71 (53, 96)	88 (66, 116)
Min, Max	6, 386	8, 386	6, 329
betablok				<0.001
0	802 (36%)	349 (48%)	453 (30%)
1	1,429 (64%)	377 (52%)	1,052 (70%)
acei				0.3
0	520 (23%)	178 (25%)	342 (23%)
1	1,711 (77%)	548 (75%)	1,163 (77%)
angioten.II				0.3
0	1,941 (87%)	624 (86%)	1,317 (88%)
1	290 (13%)	102 (14%)	188 (12%)
anti.arrhy				0.002
0	1,722 (77%)	531 (73%)	1,191 (79%)
1	509 (23%)	195 (27%)	314 (21%)
anti.coag				<0.001
0	1,332 (60%)	391 (54%)	941 (63%)
1	899 (40%)	335 (46%)	564 (37%)
aspirin				<0.001
0	1,193 (53%)	430 (59%)	763 (51%)
1	1,038 (47%)	296 (41%)	742 (49%)
digoxin				<0.001
0	661 (30%)	150 (21%)	511 (34%)
1	1,570 (70%)	576 (79%)	994 (66%)
nitrates				0.039
0	1,492 (67%)	464 (64%)	1,028 (68%)
1	739 (33%)	262 (36%)	477 (32%)
vasodilator				0.14
0	2,095 (94%)	674 (93%)	1,421 (94%)
1	136 (6.1%)	52 (7.2%)	84 (5.6%)
diuretic.loop				<0.001
0	351 (16%)	82 (11%)	269 (18%)
1	1,880 (84%)	644 (89%)	1,236 (82%)
diuretic.thiazide				<0.001
0	1,952 (87%)	601 (83%)	1,351 (90%)
1	279 (13%)	125 (17%)	154 (10%)
diuretic.potassium.spar				0.012
0	1,582 (71%)	540 (74%)	1,042 (69%)
1	649 (29%)	186 (26%)	463 (31%)
lipidrx.statin				0.068
0	1,381 (62%)	469 (65%)	912 (61%)
1	850 (38%)	257 (35%)	593 (39%)
insulin				<0.001
0	2,016 (90%)	631 (87%)	1,385 (92%)
1	215 (9.6%)	95 (13%)	120 (8.0%)
surgery.pacemaker				0.8
0	1,729 (77%)	560 (77%)	1,169 (78%)
1	502 (23%)	166 (23%)	336 (22%)
surgery.cabg				<0.001
0	1,637 (73%)	490 (67%)	1,147 (76%)
1	594 (27%)	236 (33%)	358 (24%)
surgery.pci				>0.9
0	1,755 (79%)	571 (79%)	1,184 (79%)
1	476 (21%)	155 (21%)	321 (21%)
surgery.aicd.implant				0.041
0	1,584 (71%)	536 (74%)	1,048 (70%)
1	647 (29%)	190 (26%)	457 (30%)
smknow				0.14
0	1,772 (79%)	590 (81%)	1,182 (79%)
1	459 (21%)	136 (19%)	323 (21%)
q.wave.mi				0.2
0	1,952 (87%)	625 (86%)	1,327 (88%)
1	279 (13%)	101 (14%)	178 (12%)
niddm				0.2
0	1,881 (84%)	601 (83%)	1,280 (85%)
1	350 (16%)	125 (17%)	225 (15%)
cad				<0.001
0	1,325 (59%)	378 (52%)	947 (63%)
1	906 (41%)	348 (48%)	558 (37%)
male				<0.001
0	602 (27%)	149 (21%)	453 (30%)
1	1,629 (73%)	577 (79%)	1,052 (70%)
black				0.8
0	1,965 (88%)	638 (88%)	1,327 (88%)
1	266 (12%)	88 (12%)	178 (12%)
¹ n (%)
² Wilcoxon rank sum test; Pearson’s Chi-squared test

##################################
# Listing all predictors
##################################
EDA.Predictors <- EDA[,!names(EDA) %in% c("died","ttodead")]

##################################
# Listing all numeric predictors
##################################
EDA.Predictors.Numeric <- EDA.Predictors[,sapply(EDA.Predictors, is.numeric)]
ncol(EDA.Predictors.Numeric)

## [1] 13

names(EDA.Predictors.Numeric)

##  [1] "age"                 "resting.systolic.bp" "resting.hr"         
##  [4] "bmi"                 "lvef.metabl"         "peak.rer"           
##  [7] "peak.vo2"            "interval"            "bun"                
## [10] "sodium"              "hgb"                 "glucose"            
## [13] "crcl"

##################################
# Listing all factor predictors
##################################
EDA.Predictors.Factor <- EDA.Predictors[,sapply(EDA.Predictors, is.factor)]
ncol(EDA.Predictors.Factor)

## [1] 24

names(EDA.Predictors.Factor)

##  [1] "betablok"                "acei"                   
##  [3] "angioten.II"             "anti.arrhy"             
##  [5] "anti.coag"               "aspirin"                
##  [7] "digoxin"                 "nitrates"               
##  [9] "vasodilator"             "diuretic.loop"          
## [11] "diuretic.thiazide"       "diuretic.potassium.spar"
## [13] "lipidrx.statin"          "insulin"                
## [15] "surgery.pacemaker"       "surgery.cabg"           
## [17] "surgery.pci"             "surgery.aicd.implant"   
## [19] "smknow"                  "q.wave.mi"              
## [21] "niddm"                   "cad"                    
## [23] "male"                    "black"

##################################
# Formulating the box plots
##################################
featurePlot(x = EDA.Predictors.Numeric, 
            y = EDA$died,
            plot = "box",
            scales = list(x = list(relation="free", rot = 90), 
                          y = list(relation="free")),
            adjust = 1.5, 
            pch = "|")

##################################
# Formulating the scatter plot
##################################
featurePlot(x = EDA.Predictors.Numeric, 
            y = EDA$ttodead,
            plot = "scatter",
            type = c("p", "smooth"),
            span = .5)

##################################
# Restructuring the dataset for
# for barchart analysis
##################################
died <- EDA$died
EDA.Bar.Source <- as.data.frame(cbind(died,
                     EDA.Predictors.Factor))
ncol(EDA.Bar.Source)

## [1] 25

names(EDA.Bar.Source)

##  [1] "died"                    "betablok"               
##  [3] "acei"                    "angioten.II"            
##  [5] "anti.arrhy"              "anti.coag"              
##  [7] "aspirin"                 "digoxin"                
##  [9] "nitrates"                "vasodilator"            
## [11] "diuretic.loop"           "diuretic.thiazide"      
## [13] "diuretic.potassium.spar" "lipidrx.statin"         
## [15] "insulin"                 "surgery.pacemaker"      
## [17] "surgery.cabg"            "surgery.pci"            
## [19] "surgery.aicd.implant"    "smknow"                 
## [21] "q.wave.mi"               "niddm"                  
## [23] "cad"                     "male"                   
## [25] "black"

##################################
# Creating a function to formulate
# the proportions table
##################################
EDA.PropTable.Function <- function(FactorVar) {
  EDA.Bar.Source.FactorVar <- EDA.Bar.Source[,c("died",
                                          FactorVar)]
  EDA.Bar.Source.FactorVar.Prop <- as.data.frame(prop.table(table(EDA.Bar.Source.FactorVar), 2))
  names(EDA.Bar.Source.FactorVar.Prop)[2] <- "Structure"
  EDA.Bar.Source.FactorVar.Prop$Variable <- rep(FactorVar,nrow(EDA.Bar.Source.FactorVar.Prop))

  return(EDA.Bar.Source.FactorVar.Prop)

}


EDA.Bar.Source.FactorVar.Prop.Group <- rbind(EDA.PropTable.Function(names(EDA.Bar.Source)[2]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[3]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[4]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[5]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[6]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[7]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[8]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[9]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[10]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[11]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[12]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[13]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[14]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[15]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[16]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[17]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[18]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[19]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[20]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[21]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[22]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[23]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[24]),
                                              EDA.PropTable.Function(names(EDA.Bar.Source)[25]))

(EDA.Barchart.FactorVar <- barchart(EDA.Bar.Source.FactorVar.Prop.Group[,3] ~
                                      EDA.Bar.Source.FactorVar.Prop.Group[,2] | EDA.Bar.Source.FactorVar.Prop.Group[,4],
                                      data=EDA.Bar.Source.FactorVar.Prop.Group,
                                      groups = EDA.Bar.Source.FactorVar.Prop.Group[,1],
                                      stack=TRUE,
                                      ylab = "Proportion",
                                      xlab = "Structure",
                                      auto.key = list(adj=1, space="top", columns=2),
                                      layout=(c(6,4))))

##################################
# Formulating the Kaplan-Meier plots
# for the categorical variables
##################################
KM_PreModelling_Train <- PMA_PreModelling_Train
KM_PreModelling_Train$died <- as.numeric(KM_PreModelling_Train$died )

KM_betablok <- survfit(Surv(ttodead, died) ~ betablok,
                      data=KM_PreModelling_Train)
KMPlot_betablok <- ggsurvplot(KM_betablok,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_aspirin <- survfit(Surv(ttodead, died) ~ aspirin,
                      data=KM_PreModelling_Train)
KMPlot_aspirin <- ggsurvplot(KM_aspirin,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_anti.coag <- survfit(Surv(ttodead, died) ~ anti.coag,
                      data=KM_PreModelling_Train)
KMPlot_anti.coag <- ggsurvplot(KM_anti.coag,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_anti.arrhy <- survfit(Surv(ttodead, died) ~ anti.arrhy,
                      data=KM_PreModelling_Train)
KMPlot_anti.arrhy <- ggsurvplot(KM_anti.arrhy,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_angioten.II <- survfit(Surv(ttodead, died) ~ angioten.II,
                      data=KM_PreModelling_Train)
KMPlot_angioten.II <- ggsurvplot(KM_angioten.II,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_acei <- survfit(Surv(ttodead, died) ~ acei,
                      data=KM_PreModelling_Train)
KMPlot_acei <- ggsurvplot(KM_acei,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_diuretic.thiazide <- survfit(Surv(ttodead, died) ~ diuretic.thiazide,
                      data=KM_PreModelling_Train)
KMPlot_diuretic.thiazide <- ggsurvplot(KM_diuretic.thiazide,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_diuretic.potassium.spar <- survfit(Surv(ttodead, died) ~ diuretic.potassium.spar,
                      data=KM_PreModelling_Train)
KMPlot_diuretic.potassium.spar <- ggsurvplot(KM_diuretic.potassium.spar,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_diuretic.loop <- survfit(Surv(ttodead, died) ~ diuretic.loop,
                      data=KM_PreModelling_Train)
KMPlot_diuretic.loop <- ggsurvplot(KM_diuretic.loop,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_digoxin <- survfit(Surv(ttodead, died) ~ digoxin,
                      data=KM_PreModelling_Train)
KMPlot_digoxin <- ggsurvplot(KM_digoxin,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_cad <- survfit(Surv(ttodead, died) ~ cad,
                      data=KM_PreModelling_Train)
KMPlot_cad <- ggsurvplot(KM_cad,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_black <- survfit(Surv(ttodead, died) ~ black,
                      data=KM_PreModelling_Train)
KMPlot_black <- ggsurvplot(KM_black,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_q.wave.mi <- survfit(Surv(ttodead, died) ~ q.wave.mi,
                      data=KM_PreModelling_Train)
KMPlot_q.wave.mi <- ggsurvplot(KM_q.wave.mi,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_nitrates <- survfit(Surv(ttodead, died) ~ nitrates,
                      data=KM_PreModelling_Train)
KMPlot_nitrates <- ggsurvplot(KM_nitrates,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_niddm <- survfit(Surv(ttodead, died) ~ niddm,
                      data=KM_PreModelling_Train)
KMPlot_niddm <- ggsurvplot(KM_niddm,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_male <- survfit(Surv(ttodead, died) ~ male,
                      data=KM_PreModelling_Train)
KMPlot_male <- ggsurvplot(KM_male,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_lipidrx.statin <- survfit(Surv(ttodead, died) ~ lipidrx.statin,
                      data=KM_PreModelling_Train)
KMPlot_lipidrx.statin <- ggsurvplot(KM_lipidrx.statin,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_insulin <- survfit(Surv(ttodead, died) ~ insulin,
                      data=KM_PreModelling_Train)
KMPlot_insulin <- ggsurvplot(KM_insulin,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_vasodilator <- survfit(Surv(ttodead, died) ~ vasodilator,
                      data=KM_PreModelling_Train)
KMPlot_vasodilator <- ggsurvplot(KM_vasodilator,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_surgery.pci <- survfit(Surv(ttodead, died) ~ surgery.pci,
                      data=KM_PreModelling_Train)
KMPlot_surgery.pci <- ggsurvplot(KM_surgery.pci,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_surgery.pacemaker <- survfit(Surv(ttodead, died) ~ surgery.pacemaker,
                      data=KM_PreModelling_Train)
KMPlot_surgery.pacemaker <- ggsurvplot(KM_surgery.pacemaker,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_surgery.cabg <- survfit(Surv(ttodead, died) ~ surgery.cabg,
                      data=KM_PreModelling_Train)
KMPlot_surgery.cabg <- ggsurvplot(KM_surgery.cabg,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_surgery.aicd.implant <- survfit(Surv(ttodead, died) ~ surgery.aicd.implant,
                      data=KM_PreModelling_Train)
KMPlot_surgery.aicd.implant <- ggsurvplot(KM_surgery.aicd.implant,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KM_smknow <- survfit(Surv(ttodead, died) ~ smknow,
                      data=KM_PreModelling_Train)
KMPlot_smknow <- ggsurvplot(KM_smknow,
                            xlab="Survival Time",
                            ylab="Survival Probability",
                            break.time.by=12,
                            ggtheme=theme_bw(),
                            fontsize=3,
                            pval.size=3,
                            pval=TRUE)

KMPlot_List <- list()
KMPlot_List[[1]] <- KMPlot_betablok
KMPlot_List[[2]] <- KMPlot_aspirin
KMPlot_List[[3]] <- KMPlot_anti.coag
KMPlot_List[[4]] <- KMPlot_anti.arrhy
KMPlot_List[[5]] <- KMPlot_angioten.II
KMPlot_List[[6]] <- KMPlot_acei
KMPlot_List[[7]] <- KMPlot_diuretic.thiazide
KMPlot_List[[8]] <- KMPlot_diuretic.potassium.spar
KMPlot_List[[9]] <- KMPlot_diuretic.loop
KMPlot_List[[10]] <- KMPlot_digoxin
KMPlot_List[[11]] <- KMPlot_cad
KMPlot_List[[12]] <- KMPlot_black

arrange_ggsurvplots(KMPlot_List,
                    ncol=4,
                    nrow=3)

KMPlot_List <- list()
KMPlot_List[[1]] <- KMPlot_q.wave.mi
KMPlot_List[[2]] <- KMPlot_nitrates
KMPlot_List[[3]] <- KMPlot_niddm
KMPlot_List[[4]] <- KMPlot_male
KMPlot_List[[5]] <- KMPlot_lipidrx.statin
KMPlot_List[[6]] <- KMPlot_insulin
KMPlot_List[[7]] <- KMPlot_vasodilator
KMPlot_List[[8]] <- KMPlot_surgery.pci
KMPlot_List[[9]] <- KMPlot_surgery.pacemaker
KMPlot_List[[10]] <- KMPlot_surgery.cabg
KMPlot_List[[11]] <- KMPlot_surgery.aicd.implant
KMPlot_List[[12]] <- KMPlot_smknow

arrange_ggsurvplots(KMPlot_List,
                    ncol=4,
                    nrow=3)

1.5 Survival Prediction Evaluation Metrics

1.5.1 Concordance Index (CIN)

Concordance Index is a rank correlation measure between a predictor and a possibly censored outcome with an event or censoring indicator. In survival analysis, a pair of patients is called concordant if the risk of the event predicted by a model is lower for the patient who experiences the event at a later time point. The concordance probability (C-index) is the frequency of concordant pairs among all pairs of subjects which can be used to measure and compare the discrimination power of risk prediction models.

Random Survival Forest builds an ensemble of tree-based learners based on different bootstrap samples of the original training data and only evaluates the split criterion compatible with censored survival times at each node for a randomly selected subset of features and thresholds. Final predictions are formed by aggregating the predictions of individual trees in the ensemble.

Cox Proportional Hazards examines the relationship between a hazard function and a set of covariates by assuming that predictors act multiplicatively on the hazard function without any explicit assumption of its distributional form. The model is semi-parametric because while the baseline hazard can take any form, the covariates enter the model linearly.

[A] The CIN metric was calculated for both the formulated Random Survival Forest (RSF) and Cox Proportional Hazards (CPH) models using the SurvMetrics, randomForestSRC and survival packages.

[B] The RSF model contains 2 hyperparameters:
     [B.1] ntree = number of trees held constant at a value of 500.
     [B.2] nsplit = number of random splits for splitting a variable held constant at a value of 3.

[C] The CPH model does not contain any hyperparameter.

[D] The mean split-sample cross-validated CIN based on 25 iterations are summarized as follows: :
     [D.1] RSF = 0.68637
     [D.2] CPH = 0.69929
     [D.3] The range of cross-validated CIN metrics was relatively higher for the CPH model as compared to RSF model, indicating that CPH model had better model predictions.

Code Chunk | Output

##################################
# Creating a local object
# for the train set
# using a numeric data type
# for all variables
##################################
MA_CIN <- peakVO2.Source[,names(peakVO2.Source) %in% c("ttodead",
                                                       "died",
                                                       "peak.vo2",
                                                       "bun",
                                                       "sodium",
                                                       "insulin",
                                                       "diuretic.thiazide",
                                                       "male")]

##################################
# Initializing data containers
# and random seed
##################################
set.seed(88888888)
RSF_Model_CIN = 0
CPH_Model_CIN = 0

for (i in 1:25) {
 ##################################
 # Formulating the train and test sets
 ##################################
 MA_CIN_Index = createFolds(1:nrow(MA_CIN), 2)
 MA_CIN_Train = MA_CIN[MA_CIN_Index[[1]],]
 MA_CIN_Test = MA_CIN[MA_CIN_Index[[2]],]

 ##################################
 # Formulating the Random Survival Forest model
 # and generating model predictions
 ##################################
 RSF_Model <-  rfsrc(Surv(ttodead, died)~.,
                   data = MA_CIN_Train,
                   nsplit = 3,
                   ntree = 500)
 RSF_Model_PredObject <- predict(RSF_Model, MA_CIN_Test)

 RSF_Model_Pred <- RSF_Model_PredObject$survival

 ##################################
 # Consolidating the ordered unique death times
 ##################################
 Ordered_Unique_Death_Time = RSF_Model$time.interest

 ##################################
 # Formulating the Cox Proportional Hazards model
 # and generating model predictions
 ##################################
 CPH_Model <-  coxph(Surv(ttodead, died)~.,
                     data = MA_CIN_Train,
                     x = TRUE)
 CPH_Model_Pred = predictSurvProb(CPH_Model,
                                  MA_CIN_Test,
                                  Ordered_Unique_Death_Time)

 ##################################
 # Calculating the Concordance Indices
 ##################################
 Median_Index <- median(1:length(Ordered_Unique_Death_Time))
 Survival_Object <- Surv(MA_CIN_Test$ttodead, MA_CIN_Test$died)

 RSF_Model_CIN[i] <- Cindex(Survival_Object,
                            predicted = RSF_Model_Pred[,Median_Index])

 CPH_Model_CIN[i] <- Cindex(Survival_Object,
                            predicted = CPH_Model_Pred[,Median_Index])

}

##################################
# Consolidating the repeated
# split-sample validated Concordance Indices
##################################
(SplitValidation_CIN = data.frame('CIN' = c(CPH_Model_CIN, RSF_Model_CIN),
                                 'Model' = c(rep('CPH', 25), rep('RSF', 25)),
                                 'Metric' = c(rep('CIN', 50))))

##         CIN Model Metric
## 1  0.700112   CPH    CIN
## 2  0.714396   CPH    CIN
## 3  0.702251   CPH    CIN
## 4  0.702791   CPH    CIN
## 5  0.677873   CPH    CIN
## 6  0.711866   CPH    CIN
## 7  0.689449   CPH    CIN
## 8  0.682214   CPH    CIN
## 9  0.709481   CPH    CIN
## 10 0.703413   CPH    CIN
## 11 0.713824   CPH    CIN
## 12 0.698661   CPH    CIN
## 13 0.717211   CPH    CIN
## 14 0.698832   CPH    CIN
## 15 0.687337   CPH    CIN
## 16 0.699212   CPH    CIN
## 17 0.707390   CPH    CIN
## 18 0.688120   CPH    CIN
## 19 0.686792   CPH    CIN
## 20 0.696447   CPH    CIN
## 21 0.694751   CPH    CIN
## 22 0.700993   CPH    CIN
## 23 0.712135   CPH    CIN
## 24 0.691864   CPH    CIN
## 25 0.694867   CPH    CIN
## 26 0.689888   RSF    CIN
## 27 0.692314   RSF    CIN
## 28 0.688001   RSF    CIN
## 29 0.699220   RSF    CIN
## 30 0.670046   RSF    CIN
## 31 0.714271   RSF    CIN
## 32 0.680497   RSF    CIN
## 33 0.672799   RSF    CIN
## 34 0.688503   RSF    CIN
## 35 0.681141   RSF    CIN
## 36 0.707463   RSF    CIN
## 37 0.691930   RSF    CIN
## 38 0.696516   RSF    CIN
## 39 0.691147   RSF    CIN
## 40 0.662421   RSF    CIN
## 41 0.696633   RSF    CIN
## 42 0.694339   RSF    CIN
## 43 0.679218   RSF    CIN
## 44 0.680564   RSF    CIN
## 45 0.667703   RSF    CIN
## 46 0.665116   RSF    CIN
## 47 0.683106   RSF    CIN
## 48 0.696414   RSF    CIN
## 49 0.683141   RSF    CIN
## 50 0.686914   RSF    CIN

(CPH_Model_CIN_CV <- mean(SplitValidation_CIN[SplitValidation_CIN$Model=="CPH",1]))

## [1] 0.6992913

(RSF_Model_CIN_CV <- mean(SplitValidation_CIN[SplitValidation_CIN$Model=="RSF",1]))

## [1] 0.6863722

(CIN_Boxplot <- bwplot(CIN ~ Model | Metric,
       data = SplitValidation_CIN,
       xlab = "Survival Model",
       ylab = "Concordance Index"))

1.5.2 Brier Score (BSC)

Brier Score is the mean square difference between the true classes and predicted probabilities. Serving as a a form of cost function, it is considered a proper score which means that it takes its minimal value only when the predicted probabilities match the empirical probabilities.

Random Survival Forest builds an ensemble of tree-based learners based on different bootstrap samples of the original training data and only evaluates the split criterion compatible with censored survival times at each node for a randomly selected subset of features and thresholds. Final predictions are formed by aggregating the predictions of individual trees in the ensemble.

Cox Proportional Hazards examines the relationship between a hazard function and a set of covariates by assuming that predictors act multiplicatively on the hazard function without any explicit assumption of its distributional form. The model is semi-parametric because while the baseline hazard can take any form, the covariates enter the model linearly.

[A] The BSC metric for the median survival time was calculated for both the formulated Random Survival Forest (RSF) and Cox Proportional Hazards (CPH) models using the SurvMetrics, randomForestSRC and survival packages.

[B] The RSF model contains 2 hyperparameters:
     [B.1] ntree = number of trees held constant at a value of 500.
     [B.2] nsplit = number of random splits for splitting a variable held constant at a value of 3.

[C] The CPH model does not contain any hyperparameter.

[D] The average split-sample cross-validated BSC based on 25 iterations are summarized as follows: :
     [D.1] RSF = 2.07448
     [D.2] CPH = 1.98999
     [D.3] The range of cross-validated BSC metrics was relatively lower for the CPH model as compared to RSF model, indicating that CPH model had better model predictions.

Code Chunk | Output

##################################
# Creating a local object
# for the train set
# using a numeric data type
# for all variables
##################################
MA_BSC <- peakVO2.Source[,names(peakVO2.Source) %in% c("ttodead",
                                                       "died",
                                                       "peak.vo2",
                                                       "bun",
                                                       "sodium",
                                                       "insulin",
                                                       "diuretic.thiazide",
                                                       "male")]

##################################
# Initializing data containers
# and random seed
##################################
set.seed(88888888)
RSF_Model_BSC = 0
CPH_Model_BSC = 0

for (i in 1:25) {
 ##################################
 # Formulating the train and test sets
 ##################################
 MA_BSC_Index = createFolds(1:nrow(MA_BSC), 2)
 MA_BSC_Train = MA_BSC[MA_BSC_Index[[1]],]
 MA_BSC_Test = MA_BSC[MA_BSC_Index[[2]],]

 ##################################
 # Formulating the Random Survival Forest model
 # and generating model predictions
 ##################################
 RSF_Model <-  rfsrc(Surv(ttodead, died)~.,
                   data = MA_BSC_Train,
                   nsplit = 3,
                   ntree = 500)
 RSF_Model_PredObject <- predict(RSF_Model, MA_BSC_Test)

 RSF_Model_Pred <- RSF_Model_PredObject$survival

 ##################################
 # Consolidating the ordered unique death times
 ##################################
 Ordered_Unique_Death_Time = RSF_Model$time.interest

 ##################################
 # Formulating the Cox Proportional Hazards model
 # and generating model predictions
 ##################################
 CPH_Model <-  coxph(Surv(ttodead, died)~.,
                     data = MA_BSC_Train,
                     x = TRUE)
 CPH_Model_Pred = predictSurvProb(CPH_Model,
                                  MA_BSC_Test,
                                  Ordered_Unique_Death_Time)

 ##################################
 # Calculating the Brier Scores
 ##################################
 Median_Index <- median(1:length(Ordered_Unique_Death_Time))
 Survival_Object <- Surv(MA_BSC_Test$ttodead, MA_BSC_Test$died)
 Median_DeathTime <- median(Ordered_Unique_Death_Time)

 RSF_Model_BSC[i] <- Brier(Survival_Object,
                           pre_sp = RSF_Model_Pred[,Median_Index],
                           Median_DeathTime)

 CPH_Model_BSC[i] <- Brier(Survival_Object,
                           pre_sp = CPH_Model_Pred[,Median_Index],
                           Median_DeathTime)

}

##################################
# Consolidating the repeated
# split-sample validated Brier Scores
##################################
(SplitValidation_BSC = data.frame('BSC' = c(CPH_Model_BSC, RSF_Model_BSC),
                                 'Model' = c(rep('CPH', 25), rep('RSF', 25)),
                                 'Metric' = c(rep('BSC', 50))))

##         BSC Model Metric
## 1  3.042685   CPH    BSC
## 2  2.823948   CPH    BSC
## 3  0.163000   CPH    BSC
## 4  5.653142   CPH    BSC
## 5  2.479632   CPH    BSC
## 6  0.154938   CPH    BSC
## 7  0.158088   CPH    BSC
## 8  0.158927   CPH    BSC
## 9  4.230457   CPH    BSC
## 10 6.025826   CPH    BSC
## 11 0.150234   CPH    BSC
## 12 0.167676   CPH    BSC
## 13 5.691130   CPH    BSC
## 14 0.160928   CPH    BSC
## 15 4.188327   CPH    BSC
## 16 5.096415   CPH    BSC
## 17 4.592601   CPH    BSC
## 18 0.171468   CPH    BSC
## 19 0.181396   CPH    BSC
## 20 0.161536   CPH    BSC
## 21 0.163480   CPH    BSC
## 22 0.156256   CPH    BSC
## 23 0.154343   CPH    BSC
## 24 3.652654   CPH    BSC
## 25 0.170667   CPH    BSC
## 26 3.411727   RSF    BSC
## 27 2.930870   RSF    BSC
## 28 0.165677   RSF    BSC
## 29 5.745821   RSF    BSC
## 30 2.514371   RSF    BSC
## 31 0.154984   RSF    BSC
## 32 0.161796   RSF    BSC
## 33 0.162158   RSF    BSC
## 34 4.250684   RSF    BSC
## 35 6.372230   RSF    BSC
## 36 0.152816   RSF    BSC
## 37 0.170807   RSF    BSC
## 38 6.167940   RSF    BSC
## 39 0.164670   RSF    BSC
## 40 4.438374   RSF    BSC
## 41 5.242487   RSF    BSC
## 42 4.558322   RSF    BSC
## 43 0.172799   RSF    BSC
## 44 0.185103   RSF    BSC
## 45 0.168630   RSF    BSC
## 46 0.170620   RSF    BSC
## 47 0.160568   RSF    BSC
## 48 0.156635   RSF    BSC
## 49 3.910247   RSF    BSC
## 50 0.171887   RSF    BSC

(CPH_Model_BSC_CV <- mean(SplitValidation_BSC[SplitValidation_BSC$Model=="CPH",1]))

## [1] 1.98999

(RSF_Model_BSC_CV <- mean(SplitValidation_BSC[SplitValidation_BSC$Model=="RSF",1]))

## [1] 2.074489

(BSC_Boxplot <- bwplot(BSC ~ Model | Metric,
       data = SplitValidation_BSC,
       xlab = "Survival Model",
       ylab = "Brier Score"))

1.5.3 Integrated Absolute Error (IAE)

Integrated Absolute Error is a type of continuous-time approach to continuous-time identification based on least-absolute errors.

Random Survival Forest builds an ensemble of tree-based learners based on different bootstrap samples of the original training data and only evaluates the split criterion compatible with censored survival times at each node for a randomly selected subset of features and thresholds. Final predictions are formed by aggregating the predictions of individual trees in the ensemble.

Cox Proportional Hazards examines the relationship between a hazard function and a set of covariates by assuming that predictors act multiplicatively on the hazard function without any explicit assumption of its distributional form. The model is semi-parametric because while the baseline hazard can take any form, the covariates enter the model linearly.

[A] The IAE metric for the survival time range was calculated for both the formulated Random Survival Forest (RSF) and Cox Proportional Hazards (CPH) models using the SurvMetrics, randomForestSRC and survival packages.

[B] The RSF model contains 2 hyperparameters:
     [B.1] ntree = number of trees held constant at a value of 500.
     [B.2] nsplit = number of random splits for splitting a variable held constant at a value of 3.

[C] The CPH model does not contain any hyperparameter.

[D] The average split-sample cross-validated IAE based on 25 iterations are summarized as follows: :
     [D.1] RSF = 1.76058
     [D.2] CPH = 1.69852
     [D.3] The range of cross-validated IAE metrics was relatively lower for the CPH model as compared to RSF model, indicating that CPH model had better model predictions.

Code Chunk | Output

##################################
# Creating a local object
# for the train set
# using a numeric data type
# for all variables
##################################
MA_IAE <- peakVO2.Source[,names(peakVO2.Source) %in% c("ttodead",
                                                       "died",
                                                       "peak.vo2",
                                                       "bun",
                                                       "sodium",
                                                       "insulin",
                                                       "diuretic.thiazide",
                                                       "male")]

##################################
# Initializing data containers
# and random seed
##################################
set.seed(88888888)
RSF_Model_IAE = 0
CPH_Model_IAE = 0

for (i in 1:25) {
 ##################################
 # Formulating the train and test sets
 ##################################
 MA_IAE_Index = createFolds(1:nrow(MA_IAE), 2)
 MA_IAE_Train = MA_IAE[MA_IAE_Index[[1]],]
 MA_IAE_Test = MA_IAE[MA_IAE_Index[[2]],]

 ##################################
 # Formulating the Random Survival Forest model
 # and generating model predictions
 ##################################
 RSF_Model <-  rfsrc(Surv(ttodead, died)~.,
                   data = MA_IAE_Train,
                   nsplit = 3,
                   ntree = 500)
 RSF_Model_PredObject <- predict(RSF_Model, MA_IAE_Test)

 RSF_Model_Pred <- RSF_Model_PredObject$survival

 ##################################
 # Consolidating the ordered unique death times
 ##################################
 Ordered_Unique_Death_Time = RSF_Model$time.interest

 ##################################
 # Formulating the Cox Proportional Hazards model
 # and generating model predictions
 ##################################
 CPH_Model <-  coxph(Surv(ttodead, died)~.,
                     data = MA_IAE_Train,
                     x = TRUE)
 CPH_Model_Pred = predictSurvProb(CPH_Model,
                                  MA_IAE_Test,
                                  Ordered_Unique_Death_Time)

 ##################################
 # Calculating the Integrated Absolute Errors
 ##################################
 Median_Index <- median(1:length(Ordered_Unique_Death_Time))
 Survival_Object <- Surv(MA_IAE_Test$ttodead, MA_IAE_Test$died)

 RSF_Model_IE     <- IAEISE(Survival_Object,
                          sp_matrix = RSF_Model_Pred,
                          Ordered_Unique_Death_Time)
 
 RSF_Model_IAE[i] <- RSF_Model_IE[1]

 CPH_Model_IE     <- IAEISE(Survival_Object,
                          sp_matrix = CPH_Model_Pred,
                          Ordered_Unique_Death_Time)
 
 CPH_Model_IAE[i] <- CPH_Model_IE[1]

}

##################################
# Consolidating the repeated
# split-sample validated Integrated Absolute Errors
##################################
(SplitValidation_IAE = data.frame('IAE' = c(CPH_Model_IAE, RSF_Model_IAE),
                                 'Model' = c(rep('CPH', 25), rep('RSF', 25)),
                                 'Metric' = c(rep('IAE', 50))))

##       IAE Model Metric
## 1  3.7202   CPH    IAE
## 2  1.9355   CPH    IAE
## 3  1.2259   CPH    IAE
## 4  1.9123   CPH    IAE
## 5  0.6668   CPH    IAE
## 6  1.2670   CPH    IAE
## 7  2.0430   CPH    IAE
## 8  1.1436   CPH    IAE
## 9  1.8833   CPH    IAE
## 10 1.5042   CPH    IAE
## 11 2.7417   CPH    IAE
## 12 2.4253   CPH    IAE
## 13 1.5415   CPH    IAE
## 14 1.6989   CPH    IAE
## 15 1.7323   CPH    IAE
## 16 1.3642   CPH    IAE
## 17 2.9995   CPH    IAE
## 18 1.0918   CPH    IAE
## 19 1.3910   CPH    IAE
## 20 1.4217   CPH    IAE
## 21 0.7612   CPH    IAE
## 22 0.9284   CPH    IAE
## 23 3.4032   CPH    IAE
## 24 0.7453   CPH    IAE
## 25 0.9154   CPH    IAE
## 26 3.8247   RSF    IAE
## 27 1.8812   RSF    IAE
## 28 1.2173   RSF    IAE
## 29 2.1966   RSF    IAE
## 30 1.1124   RSF    IAE
## 31 1.3252   RSF    IAE
## 32 2.1907   RSF    IAE
## 33 1.7358   RSF    IAE
## 34 2.0110   RSF    IAE
## 35 1.6630   RSF    IAE
## 36 2.1929   RSF    IAE
## 37 3.2396   RSF    IAE
## 38 1.3760   RSF    IAE
## 39 1.5616   RSF    IAE
## 40 1.2727   RSF    IAE
## 41 1.2661   RSF    IAE
## 42 2.3665   RSF    IAE
## 43 1.3245   RSF    IAE
## 44 1.4609   RSF    IAE
## 45 1.6619   RSF    IAE
## 46 0.9673   RSF    IAE
## 47 1.3053   RSF    IAE
## 48 2.5815   RSF    IAE
## 49 1.0318   RSF    IAE
## 50 1.2482   RSF    IAE

(CPH_Model_IAE_CV <- mean(SplitValidation_IAE[SplitValidation_IAE$Model=="CPH",1]))

## [1] 1.698528

(RSF_Model_IAE_CV <- mean(SplitValidation_IAE[SplitValidation_IAE$Model=="RSF",1]))

## [1] 1.760588

(IAE_Boxplot <- bwplot(IAE ~ Model | Metric,
       data = SplitValidation_IAE,
       xlab = "Survival Model",
       ylab = "Integrated Absolute Error"))

1.5.4 Integrated Square Error (ISE)

Integrated Square Error is a type of continuous-time approach to continuous-time identification based on least-squares errors.

Random Survival Forest builds an ensemble of tree-based learners based on different bootstrap samples of the original training data and only evaluates the split criterion compatible with censored survival times at each node for a randomly selected subset of features and thresholds. Final predictions are formed by aggregating the predictions of individual trees in the ensemble.

Cox Proportional Hazards examines the relationship between a hazard function and a set of covariates by assuming that predictors act multiplicatively on the hazard function without any explicit assumption of its distributional form. The model is semi-parametric because while the baseline hazard can take any form, the covariates enter the model linearly.

[A] The ISE metric for the survival time range was calculated for both the formulated Random Survival Forest (RSF) and Cox Proportional Hazards (CPH) models using the SurvMetrics, randomForestSRC and survival packages.

[B] The RSF model contains 2 hyperparameters:
     [B.1] ntree = number of trees held constant at a value of 500.
     [B.2] nsplit = number of random splits for splitting a variable held constant at a value of 3.

[C] The CPH model does not contain any hyperparameter.

[D] The average split-sample cross-validated ISE based on 25 iterations are summarized as follows: :
     [D.1] RSF = 0.04191
     [D.2] CPH = 0.04342
     [D.3] The range of cross-validated ISE metrics was relatively lower for the CPH model as compared to RSF model, indicating that CPH model had better model predictions.

Code Chunk | Output

##################################
# Creating a local object
# for the train set
# using a numeric data type
# for all variables
##################################
MA_ISE <- peakVO2.Source[,names(peakVO2.Source) %in% c("ttodead",
                                                       "died",
                                                       "peak.vo2",
                                                       "bun",
                                                       "sodium",
                                                       "insulin",
                                                       "diuretic.thiazide",
                                                       "male")]

##################################
# Initializing data containers
# and random seed
##################################
set.seed(88888888)
RSF_Model_ISE = 0
CPH_Model_ISE = 0

for (i in 1:25) {
 ##################################
 # Formulating the train and test sets
 ##################################
 MA_ISE_Index = createFolds(1:nrow(MA_ISE), 2)
 MA_ISE_Train = MA_ISE[MA_ISE_Index[[1]],]
 MA_ISE_Test = MA_ISE[MA_ISE_Index[[2]],]

 ##################################
 # Formulating the Random Survival Forest model
 # and generating model predictions
 ##################################
 RSF_Model <-  rfsrc(Surv(ttodead, died)~.,
                   data = MA_ISE_Train,
                   nsplit = 3,
                   ntree = 500)
 RSF_Model_PredObject <- predict(RSF_Model, MA_ISE_Test)

 RSF_Model_Pred <- RSF_Model_PredObject$survival

 ##################################
 # Consolidating the ordered unique death times
 ##################################
 Ordered_Unique_Death_Time = RSF_Model$time.interest

 ##################################
 # Formulating the Cox Proportional Hazards model
 # and generating model predictions
 ##################################
 CPH_Model <-  coxph(Surv(ttodead, died)~.,
                     data = MA_ISE_Train,
                     x = TRUE)
 CPH_Model_Pred = predictSurvProb(CPH_Model,
                                  MA_ISE_Test,
                                  Ordered_Unique_Death_Time)

 ##################################
 # Calculating the Integrated Square Error
 ##################################
 Median_Index <- median(1:length(Ordered_Unique_Death_Time))
 Survival_Object <- Surv(MA_ISE_Test$ttodead, MA_ISE_Test$died)

 RSF_Model_IE     <- IAEISE(Survival_Object,
                          sp_matrix = RSF_Model_Pred,
                          Ordered_Unique_Death_Time)
 
 RSF_Model_ISE[i] <- RSF_Model_IE[2]

 CPH_Model_IE     <- IAEISE(Survival_Object,
                          sp_matrix = CPH_Model_Pred,
                          Ordered_Unique_Death_Time)
 
 CPH_Model_ISE[i] <- CPH_Model_IE[2]

}

##################################
# Consolidating the repeated
# split-sample validated Integrated Square Error
##################################
(SplitValidation_ISE = data.frame('ISE' = c(CPH_Model_ISE, RSF_Model_ISE),
                                 'Model' = c(rep('CPH', 25), rep('RSF', 25)),
                                 'Metric' = c(rep('ISE', 50))))

##       ISE Model Metric
## 1  0.1511   CPH    ISE
## 2  0.0435   CPH    ISE
## 3  0.0211   CPH    ISE
## 4  0.0496   CPH    ISE
## 5  0.0067   CPH    ISE
## 6  0.0186   CPH    ISE
## 7  0.0500   CPH    ISE
## 8  0.0199   CPH    ISE
## 9  0.0358   CPH    ISE
## 10 0.0236   CPH    ISE
## 11 0.0805   CPH    ISE
## 12 0.0672   CPH    ISE
## 13 0.0265   CPH    ISE
## 14 0.0433   CPH    ISE
## 15 0.0435   CPH    ISE
## 16 0.0280   CPH    ISE
## 17 0.1072   CPH    ISE
## 18 0.0150   CPH    ISE
## 19 0.0228   CPH    ISE
## 20 0.0324   CPH    ISE
## 21 0.0076   CPH    ISE
## 22 0.0111   CPH    ISE
## 23 0.1233   CPH    ISE
## 24 0.0096   CPH    ISE
## 25 0.0100   CPH    ISE
## 26 0.1621   RSF    ISE
## 27 0.0420   RSF    ISE
## 28 0.0260   RSF    ISE
## 29 0.0670   RSF    ISE
## 30 0.0166   RSF    ISE
## 31 0.0205   RSF    ISE
## 32 0.0528   RSF    ISE
## 33 0.0405   RSF    ISE
## 34 0.0415   RSF    ISE
## 35 0.0282   RSF    ISE
## 36 0.0569   RSF    ISE
## 37 0.1174   RSF    ISE
## 38 0.0217   RSF    ISE
## 39 0.0412   RSF    ISE
## 40 0.0215   RSF    ISE
## 41 0.0217   RSF    ISE
## 42 0.0672   RSF    ISE
## 43 0.0229   RSF    ISE
## 44 0.0292   RSF    ISE
## 45 0.0448   RSF    ISE
## 46 0.0118   RSF    ISE
## 47 0.0242   RSF    ISE
## 48 0.0719   RSF    ISE
## 49 0.0174   RSF    ISE
## 50 0.0186   RSF    ISE

(CPH_Model_ISE_CV <- mean(SplitValidation_ISE[SplitValidation_ISE$Model=="CPH",1]))

## [1] 0.041916

(RSF_Model_ISE_CV <- mean(SplitValidation_ISE[SplitValidation_ISE$Model=="RSF",1]))

## [1] 0.043424

(ISE_Boxplot <- bwplot(ISE ~ Model | Metric,
       data = SplitValidation_ISE,
       xlab = "Survival Model",
       ylab = "Integrated Square Error"))

1.6 Consolidated Findings

[A] All performance evaluation metrics for survival prediction showed consistent results in terms of identifying the CPH model as better performing than the RSF model. The conditions with which the usage for each metric will vary, will depend on their intended application and/or current data restrictions.
     [A.1] CIN : Concordance Index (SurvMetrics package) evaluates the survival model’s discriminative ability through rank correlation. It is the most commonly used survival model evaluation metric which is a generalization of the Receiver Operating Characteristics (ROC) curve in the case of a complete survival data. While it is an intuitive measure of model discrimination, it diminishes the effect of tied survival data. Updated versions of the metric apply methods to address this data restriction.
     [A.2] BSC : Brier Score (SurvMetrics package) evaluates the true survival state of an observation and the predicted survival probability at a given time point. However, it is restricted by the selection of only a specific time point (normally assigned as the median of the observation time). Different selection methods may also have large differences in the evaluation of the model’s prediction effect. There is an updated version of the metric called the Integrated Brier Score which computes the metric for the entire time range, but may be computationally exhaustive.
     [A.3] IAE | ISE : Integrated Absolute and Square Errors (SurvMetrics package) evaluates the true and predicted survival functions. However, in practical applications, the mathematical expression of the survival function is usually unknown, which results to existing IAE and ISE metrics to be only used in simulation experiments. There is however the alternative to obtain the approximate expression of the survival function using the non-parametric KM estimation method, if the current analysis allows such leniency in assumptions.

Code Chunk | Output

################################################################################
# Consolidating all performance metric results
################################################################################
grid.arrange(CIN_Boxplot, 
             BSC_Boxplot, 
             IAE_Boxplot,
             ISE_Boxplot,
             ncol = 2)

2. Summary

3. References

[Book] Clinical Prediction Models by Ewout Steyerberg
[Book] Survival Analysis: A Self-Learning Text by David Kleinbaum and Mitchel Klein
[Book] Applied Survival Analysis Using R by Dirk Moore
[Book] Regression Modeling Strategies by Frank Harrell
[Book] R for Health Data Science by Ewen Harrison and Riinu Pius
[Book] Supervised Machine Learning by Michael Foley
[Book] Data Science for Biological, Medical and Health Research by Thomas Love
[R Package] survival by Terry Therneau
[R Package] survminer by Alboukadel Kassambara
[R Package] mice by Gerko Vink and Stef van Buuren
[R Package] foreign by R Core Team
[R Package] rms by Frank Harrell
[R Package] Hmisc by Frank Harrell and Charles Dupont
[R Package] VIM by Matthias Templ
[R Package] gridExtra by Baptiste Auguie
[R Package] finalfit by Ewen Harrison
[R Package] knitr by Yihui Xie
[R Package] dplyr by Hadley Wickham
[R Package] gtsummary by Daniel Sjoberg
[R Package] tidyr by Hadley Wickham
[R Package] purrr by Lionel Henry
[R Package] moments by Lukasz Komsta and Frederick Novomestky
[R Package] skimr by Elin Waring
[R Package] caret by Max Kuhn
[R Package] corrplot by Taiyun Wei
[R Package] SurvMetrics by Hanpu Zhou
[R Package] randomForestSRC by Hemant Ishwaran and Udaya Kogalur
[R Package] pec by Thomas Gerds
[Article] Predictive Evaluation Metrics in Survival Analysis by Zhou Hanpu
[Article] Random Survival Forests by Hemant Ishwaran
[Article] Survival Analysis by Lisa Sullivan
[Article] Survival Analysis in R by Emily Zabor
[Article] Assessment of Discrimination in Survival Analysis by R-Studio Team
[Article] Cox Proportional-Hazards Regression by MedCalc Team
[Article] Cox Proportional-Hazards Model by R Studio Team
[Article] Exploring Time To Event / Survival Data by Thomas Love
[Article] Survival Analysis by Alboukadel Kassambara
[Article] Survival Analysis with R by Joseph Rickert
[Article] Cox Model Assumptions by Alboukadel Kassambara
[Article] Understanding Predictions in Survival Analysis by Scikit Team
[Publication] Survival Analysis Part I: Basic Concepts and First Analyses by Taane Clark (British Journal of Cancer)
[Publication] Survival Analysis Part II: Multivariate Data Analysis – An Introduction to Concepts and Methods by Mike Bradburn (British Journal of Cancer)
[Publication] Survival Analysis Part III: Multivariate Data Analysis – Choosing a Model and Assessing its Adequacy and Fit by Mike Bradburn (British Journal of Cancer)
[Publication] Survival Analysis Part IV: Further Concepts and Methods in Survival Analysis by Taane Clark (British Journal of Cancer)
[Publication] Evaluating the Yield of Medical Tests by Frank Harrell, Robert Califf, David Pryor, Kerry Lee and Robert Rosati (Journal of the American Medical Association)
[Publication] Verification of Forecasts Expressed in Terms of Probability by Glenn Brier (Monthly Weather Review)
[Publication] A New Measure of Prognostic Separation in Survival Data by Patrick Royston and Willi Sauerbrei (Statistics in Medicine)
[Publication] L1 Splitting Rules in Survival Forests by Hoora Moradian, Denis Larocque and François Bellavance (Lifetime Data Analysis)
[Publication] Integrated Square Error of Hazard Rate Estimation for Survival Data with Missing Censoring Indicators by Yuye Zou, Guoliang Fan and Riquan Zhang (Journal of Systems Science and Complexity)
[Publication] The Explained Variation in Proportional Hazards Regression by Michael Schemper (Biometrika)
[Publication] Random Survival Forests for High-Dimensional Data by Hemant Ishwaran, Udaya Kogalur, Xi Chen and Andy Minn (Statistical Analysis and Data Mining)
[Publication] Regression Models and Life-Tables by David Cox (Royal Statistical Society)
[Tutorial] Survival Analysis / Time-To-Event Analysis in R by Heidi Seibold (Datacamp)

Supervised Learning : Exploring Performance Evaluation Metrics for Survival Prediction

John Pauline Pineda

March 10, 2023

1. Table of Contents

1.1 Sample Data

1.2 Data Quality Assessment

1.3 Data Preprocessing

1.3.1 Outlier

1.3.2 Zero and Near-Zero Variance

1.3.3 Collinearity

1.3.4 Linear Dependencies

1.3.5 Pre-Processed Dataset

1.4 Data Exploration

1.5 Survival Prediction Evaluation Metrics

1.5.1 Concordance Index (CIN)

1.5.2 Brier Score (BSC)

1.5.3 Integrated Absolute Error (IAE)

1.5.4 Integrated Square Error (ISE)

1.6 Consolidated Findings

2. Summary

3. References